Speed up big matrix multiplication (Parallel Processing/GPU)

Question

0 个投票

Hello there,

below is the code i want to run. The rand()-calls are only for code simplicity. In my code the variables obviously have meaningful content.

for N = 1024 this takes about 2 hrs to run on my machine. I've tried so many things, e.g. precalculate the cosArgs.

N = 1024;
img = rand(N);
cosArg1 = rand(N^2,1);
cosArg2 = rand(N^2,1);
[q, p] = meshgrid(0:N-1, 0:N-1); %p and q are just another NxN size matrices respectively
recon = zeros(numel(img),1);
for k = 1:numel(img)
        a = img.*cos(cosArg1(k)*p).*cos(cosArg2(k)*q);        
        recon(k) = sum(a(:));% sum of vec is faster then sumsum of matrix although we need to save it as variable
end

Is there any clever way to speed this code up?

_______

I also just bought Parallel Processing Toolbox to make it work with GPU-Arrays. This nown takes abouzt 17 min with a GTX 1060. The variables ending with GPU are just gpuArray-Casts of their original.

EDIT: by first casting to single, i cut it down to 10 min.

Is there something I can do better?

    cosArg1GPU = gpuArray(single(cosArg1));
    cosArg2GPU = gpuArray(single(cosArg2));
    imgGPU = gpuArray(single(img));
    reconGPU = gpuArray(single(recon));
    pGPU = gpuArray(single(p));
    qGPU = gpuArray(single(q));
for k = 1:numel(imgDCTGPU)
        % sum of vec is faster then sumsum of matrix although we need to save it as variable
        a = imgDCTGPU.*cos(cosArg1GPU(k)*pGPU).*cos(cosArg2GPU(k)*qGPU);
        reconGPU(k) = sum(a(:));
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Edric Ellis 2019-1-11

编辑：Edric Ellis 2019-1-11

在 MATLAB Online 中打开

1 个投票

You should take advantage of:

Implicit dimension expansion, and
The new multi-dimension arguments to sum

and then perform the calculation in chunks. The idea is that instead of looping over single pages, you calculate multiple pages simultaneously. I'm not sure how much better this is than your original case though.

N = 1024;
img = rand(N, 'gpuArray');
cosArg1 = rand(1,1,N^2, 'gpuArray');
cosArg2 = rand(1,1,N^2, 'gpuArray');
[q, p] = meshgrid(gpuArray(0:N-1), gpuArray(0:N-1));
recon = zeros(numel(img),1, 'gpuArray');
chunk = 128; % Might need to reduce this if it takes too much memory
tic
for k = 1:chunk:numel(img)
    range = k:(k+chunk-1);
    % The following line relies on implicit dimension expansion
    % to calculate "chunk" pages of "a" simultaneously
    a = img .* cos(p .* cosArg1(1,1,range)) .* cos(q .* cosArg2(1,1,range));        
    % Use the vector syntax of SUM to reduce to a 1x1xchunk "vector", 
    % and assign into "recon"
    recon(range) = sum(a, [1 2]);
end
toc

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Andreas Dorner 2019-1-11

在 MATLAB Online 中打开

that cut it in half. Thank you very much. I added the use of arrayfun. scraped a few seconds off too:

reconPixel = @(imgDCTGPU, icosArg1GPU, icosArg2GPU, pGPU, qGPU) imgDCTGPU.*cos(pGPU.*icosArg1GPU).*cos(qGPU.*icosArg2GPU);
for k = 1:chunk:nEntries
        lim = (k+chunk-1);
        if lim > nEntries
            lim = nEntries;
        end
        
        range = k:lim;
        
        % Use the vector syntax of SUM to reduce to a 1x1xchunk "vector",
        % and assign into "recon"
        reconGPU(range) = sum(arrayfun(reconPixel, imgDCTGPU, cosArg1GPU(range), cosArg2GPU(range), pGPU, qGPU), [1,2]);
    end

请先登录，再进行评论。

Speed up big matrix multiplication (Parallel Processing/GPU)

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

类别

产品

标签

Community Treasure Hunt

Speed up big matrix multiplication (Parallel Processing/GPU)

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

更多回答（0 个）

类别

产品

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论