3D gpuArray vs cells of 2D gpuArrays major speed difference!

Question

0 个投票

Can anybody explain why these codes have drastically different runtimes?

I have a shared setup routine

clear all
y = gpuArray.rand(1000, 1000, 'single');
W = cell(1, 5);
WFull = gpuArray.zeros(1000, 1000, 5);
for j = 1:5
   W{j} = gpuArray.rand(1000, 1000, 'single');
   WFull(:,:,j) = W{j};
end

Version 1 (finishes in 1.4 seconds on my machine)

z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
   for j = 1:size(W)
      z(:,:,j) = W{j}*y;
   end
end
toc

vs. Version 2 (finishes in 39 seconds on my machine... 27x times slower)

z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
   for j = 1:size(WFull, 3)
      z(:,:,j) = WFull(:,:,j)*y;
   end
end
toc

Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Matt J 2013-5-24

编辑：Matt J 2013-5-24

在 MATLAB Online 中打开

2 个投票

Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

Yes, it is faster to look-up a cell than to pull a slice out of a 3D array, and that's true for normal arrays as well, as long as there is a small number of slices/cells. Of course, you should really be including the time needed to allocate memory to each W{j} in your comparison.

Another reason is that you have a syntax error in your for-loop over W{j}. It's only doing 1 loop iteration instead of 5,

   >> for j=1:size(W), j, end 
j =
       1

This is biasing the comparison to some degree.

2 个评论
显示无隐藏无

Dan Ryan 2013-5-24

在 MATLAB Online 中打开

I caught a couple of other issues where I had left 'single' off of the gpuArray creation for some items and had it present for others... I changed

size(W)

to

size(W, 2)

and now the comparison is much closer.

Here is the new code:

clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
W = cell(1, 5);
for j = 1:5
   W{j} = gpuArray.rand(1000, 1000, 'single');
end
tic
for i = 1:500
   for j = 1:size(W, 2)
      z(:,:,j) = W{j}*y;
   end
end
toc
clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
WMat = gpuArray.rand(1000, 1000, 5, 'single');
tic
for i = 1:500
   for j = 1:size(WMat, 3)
      z(:,:,j) = WMat(:,:,j)*y;
   end
end
toc

What is really strange to me is that the execution time is very nonlinear in terms of the number of loops, i. There must be some sort of memory flush going on when i gets large, not really sure why though...

i = 100 -> runtimes are 0.10 and 0.14 seconds

i = 200 -> runtimes are 0.73 and 1.98 seconds

i = 500 -> runtimes are 10.3 and 11.7 seconds (notice the large jump for version 1!)

i = 1000 -> runtimes are 26.3 and 28.0 seconds!

Have any clue about this highly nonlinear trend? I don't see why GPU memory would come into play since I am basically just writing over existing values and performing the exact same computations in every iteration!

Dan Ryan 2013-5-30

在 MATLAB Online 中打开

James Lebak from mathworks helped me out with a really good tip:

use a

wait(gpuDevice)

command before the

toc

command when timing the GPU speeds.

Now the timings increase linearly with number of loop iterations and the two implementations give very similar results. Good to know!

请先登录，再进行评论。

3D gpuArray vs cells of 2D gpuArrays major speed difference!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

类别

产品

标签

Community Treasure Hunt

3D gpuArray vs cells of 2D gpuArrays major speed difference!

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无 隐藏 无

更多回答（0 个）

类别

产品

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无