Vectorizing nonlinear matrix operation on many small matrices

1 次查看（过去 30 天）

显示更早的评论

Adam Shaw 2020-12-18

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/697645-vectorizing-nonlinear-matrix-operation-on-many-small-matrices

评论： Matt J 2020-12-19

采纳的回答： Matt J

在 MATLAB Online 中打开

I am trying to optimize the following generic matrix operation:

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
    a = A(:,ii);
    r = a*a';
    B = B + kron(r,r);
end
% return B

On my computer the above takes ~7s. By compiling to a MEX file with MATLAB Coder I can improve this by ~15x. I have tried compiling to CUDA with GPU Coder, but this seems to be quite inefficient.

I think the difficulty comes from two different sources:

1) I am not sure of an efficient way to vectorize the creation of the "r" matrices from the columns of the A matrix, and so have to resort to the outer for loop approach

2) I think the Kronecker product is inefficient to implement on the gpu due to the small matrix size

The speedup from compiling to MEX is nice, but I just have this feeling that I am still doing something quite inefficiently. I would appreciate if anyone has any ideas on how to optimize the above calculation, either along the lines of the two difficulties I outlined above, or via a different approach.

2 个评论
显示无隐藏无

David Goodmanson 2020-12-19

在 MATLAB Online 中打开

Hi Adam,

if you replace

B = B + kron(r,r);

with

r = r(:);
BB = BB + r*r';

the loop runs about 5 times faster. (The actual substitution runs faster than that, but the nonchanged steps in the loop still of course have to be included).

Matt J 2020-12-19

@Adam,

It may be important to know what you plan to do with B, once you've computed it.

请先登录，再进行评论。

请先登录，再回答此问题。

采纳的回答

Matt J 2020-12-19

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/697645-vectorizing-nonlinear-matrix-operation-on-many-small-matrices#answer_579655

编辑：Matt J 2020-12-19

在 MATLAB Online 中打开

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
tic;
    B = zeros(m^2,m^2);
    for ii = 1:size(A,2)
        a = A(:,ii);
        r = a*a';
        B = B + kron(r,r);
    end
toc;
Elapsed time is 6.800329 seconds.
tic;
    C=reshape(A,m,1,n).*reshape(A,1,m,n);
    
    C=reshape(C,m^2,n);
    
    B=C*C.';
toc;
Elapsed time is 0.081757 seconds.

7 个评论
显示 5更早的评论隐藏 5更早的评论

Adam Shaw 2020-12-19

编辑：Adam Shaw 2020-12-19

在 MATLAB Online 中打开

I normally do

tic; CODE; wait(gpu); toc;

when evaluating gpu speed, which I believe is equivalent to doing gputimeit? I still am seeing about the same speedups using your gputimeit code above (using a 1660ti). I guess more specifically tcpu is ~40 ms, and tgpu is ~5 ms. Which is ~>1000x faster than the ~6 s the original method took.

Matt J 2020-12-19

Ah well, you have a really good GPU...!

请先登录，再进行评论。

类别

Parallel Computing Parallel Computing Toolbox GPU Computing

在 Help Center 和 File Exchange 中查找有关 GPU Computing 的更多信息

产品

版本

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

Vectorizing nonlinear matrix operation on many small matrices

2 个评论
显示无隐藏无

采纳的回答

7 个评论
显示 5更早的评论隐藏 5更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Vectorizing nonlinear matrix operation on many small matrices

2 个评论 显示 无隐藏 无

采纳的回答

7 个评论 显示 5更早的评论隐藏 5更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

7 个评论
显示 5更早的评论隐藏 5更早的评论