Why is my GPU code faster with the profiler on in RTX GPUs?

Question

Néstor 2022-11-29

5
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1866728-why-is-my-gpu-code-faster-with-the-profiler-on-in-rtx-gpus

评论： Joss Knight 2022-12-2

I need to process large multidimensional arrays with a series of 1D convolutions, and I found it's faster to just implement the convolution by hand in a for loop instead of using conv due to the very small kernel size. However, my code runs significantly faster when the profiler is on in certain GPUs. In particular, it is consistently 1.5x to 2x faster when using an Nvidia RTX 3080 or an Nvidia RTX 2070; when I run the code in an Nvidia A4500 or Nvidia A5000, there is no significant difference. This is significant because a single dataset can take hours.

This behavior is consistent among multiple computers, all running Linux (Ubuntu 22.04), and tested with R2021a and R2022a, and with nvidia drivers versions 515 and 520. My question is, how can I make sure I get the "fast" performance without having to embed profile on and profile off in the relevant parts of my code? I have actually done this and I benefit directly from improved performance in the big picture of processing an entire dataset, but this is hacky and will interfere with the expected use of the profiler in the rest of the code.

MWE is here. I am placing the fastest run first to avoid confusion about the second instance potentially running faster due to the JIT or caching. I am also clearing the large variables between runs to avoid confusion about memory allocation. I am also using the results to calculate arrayMean to avoid confusion about the JIT optimizing (i.e., skipping operations) for unused results. Interestingly, the above three concenrs do not matter in practice and the code runs consistently faster with the profiler on.

% Define common parms
clear
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('on')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  largeArrayConv = largeArrayConv + ...
    convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOn = toc;
profile('off')
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler ON: %g seconds.\n', timeProfOn)
% Run with profiler off. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('off')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  largeArrayConv = largeArrayConv + ...
    convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOff = toc;
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler OFF: %g seconds.\n', timeProfOff)

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Joss Knight 2022-12-1

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1866728-why-is-my-gpu-code-faster-with-the-profiler-on-in-rtx-gpus#answer_1117248

在 MATLAB Online 中打开

This is due to an optimization which is not performing ideally under memory pressure. If you reduce the size of your input you'll see that it's only where you're near the limit of your GPU memory that you see this discrepancy.

When PCT sees a series of element-wise operations like this it fuses them together so it can run a single kernel, as in

largeArrayConv = largeArrayConv + k1.*largeArray(idx1) + k2.*(largeArray(idx2)) + k3.*(largeArray(idx3)) ...
    

Unfortunately this means that memory must be allocated for the intermediates and when you're low on memory you'll end up with a lot of raw allocs and frees. When the profiler is on this optimization doesn't happen so that the measurements make sense, and so you only ever need one temporary array allocation per loop iteration.

Of the various possible workarounds the easiest is probably just to add wait(gpuDevice) before the end of your for loop.

I agree that the optimization is misbehaving in this case and we'll take a look at how it might be improved.

2 个评论
显示无隐藏无

Néstor 2022-12-2

Thanks, Joss, this explanation is compatible with my observations: increasing the array size to use a similar fraction of the available memory in the A4500 card was sufficient to reproduce the behavior I was seeing with the RTX cards.

I tried adding wait(gpuDevice) but, although there is a small improvement (~1.2x faster), the code with the profiler off is still significantly slower (~1.5x) than with the profiler on.

I would be happy to try more complicated workarounds to recover the full performance, what other suggestions do you have?

Joss Knight 2022-12-2

在 MATLAB Online 中打开

I'm surprised about that. This is how I adapted your code:

clear
gpu = gpuDevice();
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single');
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
wait(gpu)
profile off
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile on
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile off
function largeArrayConv = runConvolutionFull(convSize, largeArray, convKernel)
largeArrayConv = 0;
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  k = convKernel(convSize + 1 + thisShift) / (2 * convSize + 1);
  largeArrayPiece = largeArray(:, :, idx, :, :, :);
  largeArrayConv = k .* largeArrayPiece + largeArrayConv;
  wait(gpuDevice)
end
end

This makes 100% sure we're only timing the things that are consistent between the two scenarios.

请先登录，再进行评论。

Why is my GPU code faster with the profiler on in RTX GPUs?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Why is my GPU code faster with the profiler on in RTX GPUs?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无