Make double loop run on GPU

3 次查看(过去 30 天)
Hi
I have the following code that I want to run somehow on the GPU.
In short, I have a 2D array (an MRI image). Starting from a user-defined centre-point, the the image is divided into radial concentric "rings" of specified width (f.eks. 0.1mm) and total number of pixels and total signal intensity (pixel value) in each ring are recorded in separate output arrays (called Profile and SumBinSignal). The code below does this using 2 for loops.
for X= Xmin:1:Xmax
for Y= Ymin:1:Ymax
if ~isnan(Image(Y,X)) %NaN pixels are just ignored
Dist = Resolution * ( (X-CentX)^2 + (Y-CentY)^2 )^0.5; %distance of point to centre measured in mm
TargetBin= ceil(Dist * BinDensity); %convert distance from mm to "bins/rings"
if (TargetBin>=1) && (TargetBin<=MaxTargetBin)
Profile(TargetBin)= Profile(TargetBin)+1; %record number of pixels in bin/ring
SumBinSignal(TargetBin)= SumBinSignal(TargetBin) + Image(Y, X); % record total signal in ring/bin
end
end
end
end
Now I know that in order to run on the GPU, I either must vectorize the calculations or use "arrayfun". But I really cant manage to do either. I maganged to vectorize the calculations only partly with no great speed advantage. The arrayfun approach I dont know even how to start with.
Help will be highly appreciated
best regards, Rozh

采纳的回答

Bruno Luong
Bruno Luong 2023-9-6
编辑:Bruno Luong 2023-9-6
GPU is the slowest on my PC. I knew for-loop is now away fast, but I'm still amazed by how fast it is. There is really a remarkable progress made by TMW on Excution Engine last few releases.
  • For-loop time = 0.010799
  • Vectorized CPU time = 0.012376
  • Vectorized GPU time = 0.069613
Code:
N = 1000;
Image = peaks(N);
Xmin = 1; Xmax = N;
Ymin = 1; Ymax = N;
CentX = 0;
CentY = 0;
Resolution = 1/N;
BinDensity = 100;
MaxTargetBin = 100;
tic
Profile = zeros(MaxTargetBin,1);
SumBinSignal = zeros(MaxTargetBin,1);
for X=Xmin:1:Xmax
for Y=Ymin:1:Ymax
if ~isnan(Image(Y,X)) %NaN pixels are just ignored
Dist = Resolution * ( (X-CentX)^2 + (Y-CentY)^2 )^0.5; %distance of point to centre measured in mm
TargetBin= ceil(Dist * BinDensity); %convert distance from mm to "bins/rings"
if (TargetBin>=1) && (TargetBin<=MaxTargetBin)
Profile(TargetBin)= Profile(TargetBin)+1; %record number of pixels in bin/ring
SumBinSignal(TargetBin)= SumBinSignal(TargetBin) + Image(Y, X); % record total signal in ring/bin
end
end
end
end
fprintf('For-loop time = %1.6f\n', toc)
For-loop time = 0.071178
% Vectorized code, CPU
tic
X = Xmin:Xmax;
Y = (Ymin:Ymax)';
Dist = Resolution * sqrt( (X-CentX).^2 + (Y-CentY).^2 );
TargetBin= ceil(Dist * BinDensity);
Im = Image;
keep = (TargetBin>=1) & (TargetBin<=MaxTargetBin) & ~isnan(Im);
TargetBin = TargetBin(keep);
Profile2 = accumarray(TargetBin, 1, [MaxTargetBin,1]);
if ~isequal([length(Y) length(X)], size(Im))
Im = Image(Y,X); % indexing is costly
end
SumBinSignal2 = accumarray(TargetBin, Im(keep), [MaxTargetBin,1]);
fprintf('Vectorized CPU time = %1.6f\n', toc)
Vectorized CPU time = 0.031047
if gpuDeviceCount("available")
% Vectorized code, GPU
tic
X = gpuArray(Xmin:Xmax);
Y = gpuArray((Ymin:Ymax))';
Dist = Resolution * sqrt( (X-CentX).^2 + (Y-CentY).^2 );
TargetBin= ceil(Dist * BinDensity);
Im = gpuArray(Image);
keep = (TargetBin>=1) & (TargetBin<=MaxTargetBin) & ~isnan(Im);
TargetBin = TargetBin(keep);
Profile3 = gather(accumarray(TargetBin, 1, [MaxTargetBin,1]));
if ~isequal([length(Y) length(X)], size(Im))
Im = Im(Y,X); % indexing is costly
end
SumBinSignal3 = gather(accumarray(TargetBin, Im(keep), [MaxTargetBin,1]));
fprintf('Vectorized GPU time = %1.6f\n', toc)
end
% Compare results
%norm(Profile-Profile2)
norm(SumBinSignal-SumBinSignal2)
ans = 0
if gpuDeviceCount("available")
%norm(Profile-Profile3)
norm(SumBinSignal-SumBinSignal3)
end
  2 个评论
Sam Marshalik
Sam Marshalik 2023-9-6
To add to Bruno's thorough response: In order to maximize your GPU, you will want to:
  • Run only computationally intensive things on the GPU
  • Run computations that do not require communication with other components on the motherboard (RAM, storage, etc.)
It seems that in this case the problem you are working on is not computationally intensive enough to really make use of the GPU. I imagine if you were to bump up the problem size you would eventually see the benefit of using a GPU.
Rozh Al-Mashhdi
Rozh Al-Mashhdi 2023-9-6
Thanks alot Bruno Luong for the very detailed answer. That has saved me several days of just staring at the problem. Had no idea about "accumarray" which was just what I needed to complete the vectorization. And also, my hat off for your skills and quick understading of the problem (and my code).
On my setup (and size of images) the vectorized GPU is a bit faster than the CPU, so I am very pleased. Thanks again.
Also thanks to Sam Marshalik for the good advice - you are right, the GPU makes more sense for my "hires" images only (i.e. large amout of pixels)

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Get Started with GPU Coder 的更多信息

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by