Can this GPU code snippet be redone without nested loops?

Question

0 个投票

Hello, I have two matrices: matrix1 is a logical array of 1s and 0s (1000 x 800) matrix2 is a different logical array (2000 x 800)

I am essentially taking the first row of matrix 1 and calculating the row summation of common elements / total number of elements. Both of these arrays are gpuArrays. What I finding out:

for j=gpuArray.colon(1,x)
for k=gpuArray.colon(1,y)
   output(j,k)=sum(matrix1(j,:) & matrix2(k,:)) / sum(matrix1(j,:) | matrix2(k,:))
end
end

Runs very fast for small values of x and y, but once x,y is large is takes exponentially longer to run on the GPU

I am investigating the use of repmat here but I am not sure how to implement. Any ideas here? Or if there is another option for to get rid of the nested for loops?

Thanks

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Joss Knight 2013-11-12

在 MATLAB Online 中打开

3 个投票

The GPU isn't going to work well with your nested loops. This looks like a classic case for bsxfun:

matrix1 = permute(matrix1, [1 3 2]);
matrix2 = permute(matrix2, [3 1 2]);
output = sum( bsxfun(@and, matrix1, matrix2), 3 ) ./ ...
         sum( bsxfun(@or, matrix1, matrix2), 3);

I can't promise it will run faster than on your CPU though, if you have a lot of cores.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Jill Reese 2013-11-13

Amr, the original code was processing fewer elements at once. When Joss mentioned that "The GPU isn't going to work well with your nested loops.", he was referring to the fact that the GPU wasn't being provided enough work to keep it busy.

Joss Knight 2013-11-19

Yes, with the loop the GPU is being asked to do thousands of very small computations in series - entirely the opposite of what it's good at. Instead we created 1000x2000x800 arrays containing all possible and and or combinations (using bsxfun) and then summed along the 3rd dimension to reduce down to the 1000x2000 matrix you were after.

请先登录，再进行评论。

Answer 2

Sean de Wolski 2013-11-11

编辑：Sean de Wolski 2013-11-11

在 MATLAB Online 中打开

0 个投票

Is output preallocated?

Before the loops:

output = gpuArray.zeros(x,y);

This should speed it up dramatically.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Sean de Wolski 2013-11-11

编辑：Sean de Wolski 2013-11-11

Do matrix1 and matrix2 already live on the gpu, i.e. are they gpuArrays?

Amr Ragab 2013-11-11

Yes they live on the gpu as gpuArrays. Its all transferred over before this code snippet

请先登录，再进行评论。

Can this GPU code snippet be redone without nested loops?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

3 个评论
显示 1更早的评论隐藏 1更早的评论

类别

标签

Community Treasure Hunt

Can this GPU code snippet be redone without nested loops?

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论 隐藏 1更早的评论

更多回答（1 个）

3 个评论 显示 1更早的评论 隐藏 1更早的评论

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论