Why is my GPU code faster with the profiler on in RTX GPUs?
3 次查看(过去 30 天)
显示 更早的评论
I need to process large multidimensional arrays with a series of 1D convolutions, and I found it's faster to just implement the convolution by hand in a for loop instead of using conv due to the very small kernel size. However, my code runs significantly faster when the profiler is on in certain GPUs. In particular, it is consistently 1.5x to 2x faster when using an Nvidia RTX 3080 or an Nvidia RTX 2070; when I run the code in an Nvidia A4500 or Nvidia A5000, there is no significant difference. This is significant because a single dataset can take hours.
This behavior is consistent among multiple computers, all running Linux (Ubuntu 22.04), and tested with R2021a and R2022a, and with nvidia drivers versions 515 and 520. My question is, how can I make sure I get the "fast" performance without having to embed profile on and profile off in the relevant parts of my code? I have actually done this and I benefit directly from improved performance in the big picture of processing an entire dataset, but this is hacky and will interfere with the expected use of the profiler in the rest of the code.
MWE is here. I am placing the fastest run first to avoid confusion about the second instance potentially running faster due to the JIT or caching. I am also clearing the large variables between runs to avoid confusion about memory allocation. I am also using the results to calculate arrayMean to avoid confusion about the JIT optimizing (i.e., skipping operations) for unused results. Interestingly, the above three concenrs do not matter in practice and the code runs consistently faster with the profiler on.
% Define common parms
clear
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('on')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOn = toc;
profile('off')
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler ON: %g seconds.\n', timeProfOn)
% Run with profiler off. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('off')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOff = toc;
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler OFF: %g seconds.\n', timeProfOff)
0 个评论
采纳的回答
Joss Knight
2022-12-1
This is due to an optimization which is not performing ideally under memory pressure. If you reduce the size of your input you'll see that it's only where you're near the limit of your GPU memory that you see this discrepancy.
When PCT sees a series of element-wise operations like this it fuses them together so it can run a single kernel, as in
largeArrayConv = largeArrayConv + k1.*largeArray(idx1) + k2.*(largeArray(idx2)) + k3.*(largeArray(idx3)) ...
Unfortunately this means that memory must be allocated for the intermediates and when you're low on memory you'll end up with a lot of raw allocs and frees. When the profiler is on this optimization doesn't happen so that the measurements make sense, and so you only ever need one temporary array allocation per loop iteration.
Of the various possible workarounds the easiest is probably just to add wait(gpuDevice) before the end of your for loop.
I agree that the optimization is misbehaving in this case and we'll take a look at how it might be improved.
2 个评论
Joss Knight
2022-12-2
I'm surprised about that. This is how I adapted your code:
clear
gpu = gpuDevice();
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single');
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
wait(gpu)
profile off
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile on
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile off
function largeArrayConv = runConvolutionFull(convSize, largeArray, convKernel)
largeArrayConv = 0;
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
k = convKernel(convSize + 1 + thisShift) / (2 * convSize + 1);
largeArrayPiece = largeArray(:, :, idx, :, :, :);
largeArrayConv = k .* largeArrayPiece + largeArrayConv;
wait(gpuDevice)
end
end
This makes 100% sure we're only timing the things that are consistent between the two scenarios.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 GPU Computing 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!