Batched matrix multiplicaion with CUDA

Question

Peter Egli 2020-4-28

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/521345-batched-matrix-multiplicaion-with-cuda

编辑： Erik Meade 2020-5-5

Hi,

I saw that Matlab R2020a implements new features for the GPU coder, especially the gpucoder.stridedMatrixMultiply. However, I don't understand how the batch is defined there. If you take a look at the generated CUDA code that is shown in the example, it states 1 for the batch size (cf. NVIDIA documentation). Also the variables A,B & C are expected to be 2D and of the dimensionality of the matrices to be processes.

How do I use the function correctly? I have a 3D vector in Matlab which holdes many small matrices, so A(:,:,1), A(:,:,2) and so on. The same applies for B. I would like to process them all at the same time using CUDA. I would like to calculate A(:,:,1)*B(:,:,1) etc using a CUDA function. How can I achieve that with the new GPU coder functionality? How do I interface that from Matlab?

Peter

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Erik Meade 2020-5-5

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/521345-batched-matrix-multiplicaion-with-cuda#answer_430460

编辑：Erik Meade 2020-5-5

在 MATLAB Online 中打开

Hi Peter,

gpucoder.stridedMatrixMultiply works exactly as you want. You can directly pass A and B to gpucoder.stridedMatrixMultiply and it will compute them in the way you want.

A small example, say you have a function called stridedMultiply:

function c = stridedMultiply(a, b)
    c = gpucoder.stridedMatrixMultiply(a, b);
end

Then we can generate code for it and verify that the answer is correct with the following code:

% 3D-vector inputs
a = rand(5,4,100);
b = rand(4,5,100);
% Generate Code
codegen -config coder.gpuConfig('mex') -args {a, b} stridedMultiply
% Verify correctness
c_mex = stridedMultiply_mex(a, b);
c = zeros(size(c_mex));
for i = 1:100
   c(:,:,i) = a(:,:,i) * b(:,:,i); 
end
% Check MATLAB answer vs. stridedMatrixMultiply generated code
tolerance = 1e-8;
assert(all(abs(c(:) - c_mex(:)) < tolerance));

If we look at the generated code, we will see that the batch size has been properly set to 100:

  cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5,
    5, 4, (double *)gpu_alpha1, (double *)&(*gpu_a)[0], 5, 20, (double *)
    &(*gpu_b)[0], 4, 20, (double *)gpu_beta1, (double *)&(*gpu_c)[0], 5, 25, 100);

With regards to the example in the doc page you cited, since the input matrices in the example are both 2D, there is only 1 batch to be computed, therefore the parameter is set to 1. I understand your confusion however, since gpucoder.stridedMatrixMultiply is mostly intended to be used with 3D inputs. To clarify, gpucoder.stridedMatrixMultiply multiplies along the first two dimensions only. I understand how that example can be confusing however, and we will look into updating that example.

I hope that answers your question!