Why is pagemtimes slower on GPU than a CPU?

Question

David Ho 2020-12-1

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/671413-why-is-pagemtimes-slower-on-gpu-than-a-cpu

回答： Joss Knight 2021-5-30

I'm a physics researcher, and a lot of my numerical work involves batch multiplication of small, complex-valued matrices. This made me very excited to see that in R2020b the pagemtimes function has been implemented, as this is exactly what I need.

On a speed comparison test between GPU and CPU, however, GPU performs significantly worse. Here's a minimal example of such a test:

batchSize = [100 100 100];
matSize = [2 2];
a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
gpuA = gpuArray(a);
f = @() pagemtimes(a,a);
gpuF = @() pagemtimes(gpuA, gpuA);
timeit(f) % I get around 0.05 seconds
gputimeit(gpuF) % I get around 0.3 seconds

Is the significant slowdown simply because the batch/matrix size is too small for GPU optimisation to beat the overheads? Or is there something else going on that I've missed?

I'm testing this on a NVIDIA Quadro P1000 GPU.

5 个评论
显示 3更早的评论隐藏 3更早的评论

David Ho 2020-12-2

Hi Joss, that's interesting to know. For my applications it would be helpful to use a [2 2 N N N] grid, as it represents a matrix-valued field in 3d space. If it's a significant performance increase, though, it might be possible to flatten the batch dimensions and postprocess afterwards.

Joss Knight 2021-5-30

Flattening and unflattening only changes array metadata so it is essentially a free operation.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Matt J 2020-12-1

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/671413-why-is-pagemtimes-slower-on-gpu-than-a-cpu#answer_561603

编辑：Matt J 2020-12-1

在 MATLAB Online 中打开

I don't have a Matlab version that supports pagemtimes, but for the GPU version, it might be advisable to instead try pagefun(@mtimes,...). I do see an improvement relative to mtimesx.

batchSize = [100 100 100];
matSize = [2 2];
a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
gpuA = gpuArray(a);
 f = @() mtimesx(a,a);
gpuF = @() pagefun(@mtimes,gpuA, gpuA);
timeit(f) % I get around 0.8175 seconds
gputimeit(gpuF) % I get around 0.1033 seconds

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

David Ho 2020-12-2

Hi Matt, thanks for answering. When I test pagefun I get exactly the same performance as pagemtimes. It's interesting that your CPU and GPU times are very different to mine, though.

请先登录，再进行评论。

Answer 2

Nathan Zechar 2021-5-29

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/671413-why-is-pagemtimes-slower-on-gpu-than-a-cpu#answer_712305

在 MATLAB Online 中打开

Hello David, I have a similiar problem and have found that pagemtimes is slower than just expanding an equation and coding it on both CPU and GPU. But for GPU it is exceptionally slow.

Here is an example. This can be coded up two different ways. Notice the performance of pagemtimes with just the CPU.

clear all
Nx = 100;
Ny = 100;
Nz = 100;
[A1,A2,A3,B1,B2,B3,C1,C2,C3,...
E11,E12,E13,E21,E22,E23,E31,E32,E33,...
F11,F12,F13,F21,F22,F23,F31,F32,F33] = deal(rand(Nx,Ny,Nz));
tic
for i = 1:20
    %% Electric Field Update
    C1 = F11.*(A1.*E11+B1)+F12.*(A2.*E12+B2)+F13.*(A3.*E13+B3);
    
    C2 = F21.*(A2.*E21+B2)+F22.*(A2.*E22+B2)+F23.*(A3.*E23+B3);
    
    C3 = F31.*(A3.*E31+B3)+F32.*(A3.*E32+B3)+F33.*(A3.*E33+B3);
end
toc
[A,B,C] = deal(rand(3,1,Nx,Ny,Nz));
[E,F] = deal(rand(3,3,Nx,Ny,Nz));
tic
for i = 1:20
C = pagemtimes(F,(B+pagemtimes(E,A)));
end
toc

Without pagemtimes - "Elapsed time is 0.032141 seconds".

With pagemtimes - "Elapsed time is 0.325006 seconds"

Using gpuArray() on the variables in the deal() function the the difference in times are even slower!

Without pagemtimes - "Elapsed time is 0.012688 seconds."

With pagemtimes - "Elapsed time is 5.357220 seconds."

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 3

Joss Knight 2021-5-30

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/671413-why-is-pagemtimes-slower-on-gpu-than-a-cpu#answer_712610

This might be simply because you are running double-precision math on a device designed for single precision operations. gpuBench doesn't show much improvement for double precision operations over the CPU on these devices. Can you convert your data to single precision and see if there is an improvement on GPU?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Why is pagemtimes slower on GPU than a CPU?

5 个评论
显示 3更早的评论隐藏 3更早的评论

回答（3 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Why is pagemtimes slower on GPU than a CPU?

5 个评论 显示 3更早的评论隐藏 3更早的评论

回答（3 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

5 个评论
显示 3更早的评论隐藏 3更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论