How to speed up our code to be implemented on GPU
19 次查看(过去 30 天)
显示 更早的评论
Hello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.' TY.' ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)' , K');
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)' , K');
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.').*Efield2(:,6)).'*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,'LineStyle','none');
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it'd be faster on CPU if I use MATLAB's matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank you
2 个评论
Joss Knight
2024-7-8
Can I just check that you are aware that you do not need to use Code Generation to accelerate your code on GPU? You only need to adapt your code to use gpuArray data. GPU Coder can be useful for converting code that must be written as a loop; but if you can vectorize your loops and make them matrix, vector or pagewise operations instead, you could get better performance without needing to use coder instrinsics or configure a compiler.
采纳的回答
Umar
2024-7-3
Hi Moh,
Please see my suggestions below to help you out. I did analyze your code to identify any potential bottlenecks or areas for optimization.
Your code initializes a complex image array image with dimensions 17x54x54. This array is used to store the results of the calculations. A Kaiser window is created using the w array. Grid points xf and yf are defined using a range and step size. The code initializes variables and arrays for further calculations. A loop is used to iterate over different values of z. Within the loop, the code calculates the distance between transmitters and grid points (dtXYUV) and stores it in dtXYUV2. The code then calculates the exponential term expT using the distance and wave number. Next, the code calculates the distance between receivers and grid points (dXYUV) and stores it in expR. The exponential terms expT and expR are combined to calculate the overall exponential term EXP. The code reshapes and rearranges the arrays to perform matrix multiplication and obtain the final image. The image is stored in the image array. The code repeats steps 5-11 for different values of z. The final image is obtained by taking the absolute value of the image array. The code plots the image using the surf function.
Now, to optimize the code for speed, there are several key suggestions to consider. One important strategy is to preallocate arrays with the correct dimensions instead of initializing them with zeros. This can help avoid the need for resizing the array during loop iterations, which can slow down the code. Another useful tip is to vectorize calculations whenever possible. By using MATLAB's matrix multiplication capability, you can perform calculations more efficiently and avoid the need for loops. This can significantly improve the speed of your code. It's also important to analyze your code and identify any redundant calculations or unnecessary operations that can be eliminated. By streamlining your code in this way, you can make it more efficient and faster. Additionally, if your system has multiple CPU cores, consider utilizing MATLAB's parallel computing capabilities to distribute the workload and speed up calculations. This can help take advantage of the processing power available and further optimize your code for speed.
In terms of memory management, reducing array sizes where possible can help address "out of memory" errors. Adjusting step sizes or grid point ranges can help minimize memory usage and prevent these errors from occurring. Using data types with smaller memory footprints, such as single precision instead of double precision, can also help conserve memory. If memory limitations are still a concern, consider splitting calculations into smaller chunks and processing them sequentially to avoid exceeding available memory.
By implementing these optimizations and memory management techniques, you can improve both the speed and memory usage of your code significantly.
2 个评论
Umar
2024-7-6
移动:Walter Roberson
2024-7-8
Hi Moh Mor,
Have you considered reaching out to MathWorks support for further assistance. Provide them with detailed information about your system configuration, MATLAB version, and the steps leading to the internal error.
更多回答(2 个)
Chao Luo
2024-7-3
The generated code is quite optimized for GPU. I tried rewriting the code using explicit for-loops which results in similar performance. On top of that, I converted the data type from double to single, which speeds up the execution about 10 times. Do the conversion If signle precision is good enough for you. Here is the code I rewrite with the ploting part removed for your reference:
function image = BPmimo2C4(Efield) %#codegen
coder.gpu.kernelfun;
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]); % 5376x6
Efield2_6 = single(Efield2(:,6).');
% z = 0.4;
XYPos1 = single([TX.', TY.']);
UVPos = single([x1f(:), y1f(:)]);
dtXYUV1 = pdist2(XYPos1, UVPos);
XYPos2 = single([real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1))]);
dtXYUV2 = pdist2(XYPos2, UVPos);
EXP = coder.nullcopy(single((ones(21,16,16,17,2916) * 1i)));
for f_idx = 1:numel(x1f)
for dep = 1:17
for r_idx = 1:numR
for t_idx = 1:numel(TX)
for k_idx = 1:numel(K)
z2 = z(dep) * z(dep);
dt1 = dtXYUV1(r_idx,f_idx) * dtXYUV1(r_idx,f_idx) + z2;
dt1 = sqrt(dt1);
dt2 = dtXYUV2(t_idx,f_idx) * dtXYUV2(t_idx,f_idx) + z2;
dt2 = sqrt(dt2);
expV = exp((dt1 + dt2) * K(k_idx) * 1i);
EXP(k_idx, t_idx, r_idx, dep, f_idx) = expV;
end
end
end
end
end
EXP_resh = reshape(EXP, [21*16*16, 17*2916]);
image = Efield2_6 * EXP_resh;
image = reshape(image, [17,54,54]);
end
8 个评论
Chao Luo
2024-7-10
R2018b is pretty old that I cannot debug it and give you a workaround. Is it possible for you to upgrade MATLAB at least to R2019b version?
Umar
2024-7-6
Hi Moh Mor,
Have you considered reaching out to MathWorks support for further assistance. Provide them with detailed information about your system configuration, MATLAB version, and the steps leading to the internal error.
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 GPU Computing 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!