3D gpuArray vs cells of 2D gpuArrays major speed difference!

1 次查看(过去 30 天)
Can anybody explain why these codes have drastically different runtimes?
I have a shared setup routine
clear all
y = gpuArray.rand(1000, 1000, 'single');
W = cell(1, 5);
WFull = gpuArray.zeros(1000, 1000, 5);
for j = 1:5
W{j} = gpuArray.rand(1000, 1000, 'single');
WFull(:,:,j) = W{j};
end
Version 1 (finishes in 1.4 seconds on my machine)
z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
for j = 1:size(W)
z(:,:,j) = W{j}*y;
end
end
toc
vs. Version 2 (finishes in 39 seconds on my machine... 27x times slower)
z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
for j = 1:size(WFull, 3)
z(:,:,j) = WFull(:,:,j)*y;
end
end
toc
Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

采纳的回答

Matt J
Matt J 2013-5-24
编辑:Matt J 2013-5-24
Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?
Yes, it is faster to look-up a cell than to pull a slice out of a 3D array, and that's true for normal arrays as well, as long as there is a small number of slices/cells. Of course, you should really be including the time needed to allocate memory to each W{j} in your comparison.
Another reason is that you have a syntax error in your for-loop over W{j}. It's only doing 1 loop iteration instead of 5,
>> for j=1:size(W), j, end
j =
1
This is biasing the comparison to some degree.
  2 个评论
Dan Ryan
Dan Ryan 2013-5-24
I caught a couple of other issues where I had left 'single' off of the gpuArray creation for some items and had it present for others... I changed
size(W)
to
size(W, 2)
and now the comparison is much closer.
Here is the new code:
clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
W = cell(1, 5);
for j = 1:5
W{j} = gpuArray.rand(1000, 1000, 'single');
end
tic
for i = 1:500
for j = 1:size(W, 2)
z(:,:,j) = W{j}*y;
end
end
toc
clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
WMat = gpuArray.rand(1000, 1000, 5, 'single');
tic
for i = 1:500
for j = 1:size(WMat, 3)
z(:,:,j) = WMat(:,:,j)*y;
end
end
toc
What is really strange to me is that the execution time is very nonlinear in terms of the number of loops, i. There must be some sort of memory flush going on when i gets large, not really sure why though...
i = 100 -> runtimes are 0.10 and 0.14 seconds
i = 200 -> runtimes are 0.73 and 1.98 seconds
i = 500 -> runtimes are 10.3 and 11.7 seconds (notice the large jump for version 1!)
i = 1000 -> runtimes are 26.3 and 28.0 seconds!
Have any clue about this highly nonlinear trend? I don't see why GPU memory would come into play since I am basically just writing over existing values and performing the exact same computations in every iteration!
Dan Ryan
Dan Ryan 2013-5-30
James Lebak from mathworks helped me out with a really good tip:
use a
wait(gpuDevice)
command before the
toc
command when timing the GPU speeds.
Now the timings increase linearly with number of loop iterations and the two implementations give very similar results. Good to know!

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Matrix Indexing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by