making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…
78 次查看(过去 30 天)
显示 更早的评论
I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I'm studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
--------------------------|---------|---------|----------
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199
For reference, all my data arrays are dlarrays on the GPU (gpuArrays in dlarrays), all single precision, and the pooling operations convert 32 by 32 feature maps (across 2 channels and 16384 batch size) to 16 by 16 feature maps (of same # channels and batch size), so just a 2 by 2 pooling operation.
You can see here that the maxpool forward pass (using “maxpool” function) is about 92 times slower than the avgpool forward pass (using “avgpool”), and the maxpool backward pass (using “maxunpool”) is about 4.6 times slower than the avgpool backward pass (using a custom “avgunpool” function that Anthropic’s Claude had to create for me, since matlab has no “avgunpool”).
These results are extremely suspect to me. For the forwards pass, comparing matlab's built in "maxpool" to built in "avgpool" functions gives a 92x difference, but searching online people seem to instead claim that training max pooling is supposed to be faster than avg pooling, which contradicts the results here.
For simplicity, see the code example below that runs just "maxpool" and "avgpool" only (no other functions) and compares their times:
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
% Number of iterations for averaging
num_iter = 100;
fprintf('Running %d iterations for each timing measurement...\n\n', num_iter);
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time AVERAGE POOLING
fprintf('=== AVERAGE POOLING (conv_af_ap) ===\n');
times_ap = struct();
for iter = 1:num_iter
% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;
end
%% Time MAX POOLING
fprintf('\n=== MAX POOLING (conv_af_mp) ===\n');
times_mp = struct();
for iter = 1:num_iter
% Max pooling with indices
tic;
[Oj_pooled, max_indices] = maxpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_mp.pooling(iter) = toc;
end
%% Compute statistics and display results
fprintf('\n=== TIMING RESULTS (milliseconds) ===\n');
fprintf('%-25s %12s %12s %12s\n', 'Step', 'AvgPool', 'MaxPool', 'Difference');
fprintf('%s\n', repmat('-', 1, 65));
steps_common = { 'pooling'};
total_ap = 0;
total_mp = 0;
for i = 1:length(steps_common)
step = steps_common{i};
if isfield(times_ap, step) && isfield(times_mp, step)
mean_ap = mean(times_ap.(step)) * 1000; % times 1000 to convert seconds to milliseconds
mean_mp = mean(times_mp.(step)) * 1000; % times 1000 to convert seconds to milliseconds
total_ap = total_ap + mean_ap;
total_mp = total_mp + mean_mp;
diff = mean_mp - mean_ap;
fprintf('%-25s %12.4f %12.4f %+12.4f\n', step, mean_ap, mean_mp, diff);
end
end
fprintf('%s\n', repmat('-', 1, 65));
%fprintf('%-25s %12.4f %12.4f %+12.4f\n', 'TOTAL', total_ap, total_mp, total_mp - total_ap);
fprintf('%-25s %12s %12s %12.2fx\n', 'Speedup', '', '', total_mp/total_ap);
end
Now you can see one main difference between my calling of maxpool and avgpool: for avgpool I only have 1 output (the pooled values), but with maxpool I have 2 outputs (the pooled values, and the index locations of these max values).
This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4358 0.2330 -0.2028
-----------------------------------------------------------------
max/avg 0.53x
>>
However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4153 38.9818 +38.5665
-----------------------------------------------------------------
max/avg 93.86x
>>
So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.
Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer... so I need the indices...
I'd assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I'd expect a similar slowdown...
Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab's version of maxpool may not be the best implementation of a maxpool operation... maybe some inefficient processes regarding obtaining the indices?
1 个评论
Richard
2025-9-11,16:12
编辑:Richard
2025-9-11,16:13
We will investigate making improvements to calculating the second output from maxpool, however I am also interested in exactly how you are using them. You say "the indices are needed to later differentiate (backwards pass) the maxpool layer", however this is not the case when calling dlgradient:
X = dlarray(gpuArray.rand(1000, 'single'), 'SSCB');
grad = dlfeval(@(x) dlgradient(sum(maxpool(x,3),'all'),x), X);
The second output from maxpool was only intended for use when there is a maxunpool layer later in the forwards pass. We didn't expect maxunpool to be needed as the backward implementation for maxpool.
I understand you are performing a custom update loop and not a simple backwards across a full network, but at the same time it does sound like you still want the true gradients from maxpool as part of that, and of course dlgradient can do that. If you can share some more details about how you are using the network fragments and what you need the gradient of then we may be able to suggest code that can do that with dlgradient.
回答(2 个)
Matt J
2025-8-25,22:16
编辑:Matt J
2025-8-26,14:41
I think the pooling operations are simply not optimized for the specific situation when the pooling is tiled (i.e., poolsize=stride). This is ironic, I guess, since this is a very common case.
Below I demonstrate how optimize the forward pass for both kinds of pooling using sepblockfun(), downloadable from,
For the maxpool backward pass, it's not really the maximizing indices you need. The more direct thing you need is a binary map of the maxima locations. When you have tiled pooling, this can be generated much faster, as done below in tiledmaxpoolBACK. There are ways to eliminate the second repelem for an even faster implementation, but this should make the point.
analyze_pooling_timing()
GPU: NVIDIA GeForce RTX 4080 SUPER
Milliseconds =
1×4 table
AvgPool(Forw) AvgPool(Back) MaxPool(Forw) MaxPool(Back)
_____________ _____________ _____________ _____________
3.6987 1.121 1.1704 2.0492
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
Y=tiledmaxpool(Oj, sz); %fake input to back pass
dLdY=Y;
apforwTime = gputimeit(@() tiledavgpool(Oj, sz) )*millisec;
apbackTime = gputimeit(@() tiledavgpoolBACK(dLdY,Oj,Y, sz) )*millisec;
mpforwTime = gputimeit(@() tiledmaxpool(Oj, sz) )*millisec;
mpbackTime = gputimeit(@() tiledmaxpoolBACK(dLdY,Oj,Y, sz) )*millisec;
Milliseconds = table(apforwTime, apbackTime, mpforwTime,mpbackTime,...
'Var',{'AvgPool(Forw)' 'AvgPool(Back)',...
'MaxPool(Forw)', 'MaxPool(Back)'})
end
function Y=tiledavgpool(X,s)
Y=sepblockfun(X,[s,1,1],'mean');
end
function Q=tiledavgpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY/prod(s),s(1),s(2));
end
function Y=tiledmaxpool(X,s)
Y=sepblockfun(X,[s,1,1],'max');
end
function Q=tiledmaxpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY,s(1),s(2)) .* (X==repelem(Y,s(1),s(2)));
end
0 个评论
Catalytic
2025-8-26,13:50
编辑:Catalytic
2025-8-26,13:56
As @Matt J says, it may be worthwhile to make customized pooling layers specifically for the case of tiled pools. However, it might also be that your analysis is tainted because you are using the avgpool and maxpool functions to implement pooling instead of implementing them with actual layers. When I test that, I don't see such a big difference between the average pool and max pool performance -
analyze_pooling_timing()
Milliseconds =
1×2 table
AvgPool MaxPool
_______ _______
5.0346 4.4899
function analyze_pooling_timing()
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
for i=1:25
net=dlnetwork(averagePooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
apforwTime(i) = gputimeit( fun,2 )*millisec;
net=dlnetwork(maxPooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
mpforwTime(i) = gputimeit(fun,2 )*millisec;
end
Milliseconds = table(mean(apforwTime), mean(mpforwTime),...
'Var',{'AvgPool' , 'MaxPool',})
end
function [loss, gradients] = modelGradients(net, X)
% Forward pass and loss
Y = forward(net, X);
loss = Y(1);
% Backpropagation (automatic differentiation)
gradients = dlgradient(loss, X);
end
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel and Cloud 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!