How to speed up our code to be implemented on GPU

20 次查看(过去 30 天)
Hello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.' TY.' ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)' , K');
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)' , K');
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.').*Efield2(:,6)).'*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,'LineStyle','none');
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it'd be faster on CPU if I use MATLAB's matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank you
  2 个评论
Joss Knight
Joss Knight 2024-7-8
Can I just check that you are aware that you do not need to use Code Generation to accelerate your code on GPU? You only need to adapt your code to use gpuArray data. GPU Coder can be useful for converting code that must be written as a loop; but if you can vectorize your loops and make them matrix, vector or pagewise operations instead, you could get better performance without needing to use coder instrinsics or configure a compiler.
moh mor
moh mor 2024-7-9
Thank you @Joss Knight ,
Actually I implemented this computation using code of one of my friends in python. I had to first import "dll" file of my function and then install cuda and gcc on my computer. its speed was so much better than mine. Furtheremore, it did not have any "out of memory" problem, while I can not increase the size of my array whatever I want. I'm trying to overcome this problem in my code. Previously I implemented my arrays using gpuArray and I understood the increased speed of my function. But I think it is not enough.

请先登录,再进行评论。

采纳的回答

Umar
Umar 2024-7-3
Hi Moh,
Please see my suggestions below to help you out. I did analyze your code to identify any potential bottlenecks or areas for optimization.
Your code initializes a complex image array image with dimensions 17x54x54. This array is used to store the results of the calculations. A Kaiser window is created using the w array. Grid points xf and yf are defined using a range and step size. The code initializes variables and arrays for further calculations. A loop is used to iterate over different values of z. Within the loop, the code calculates the distance between transmitters and grid points (dtXYUV) and stores it in dtXYUV2. The code then calculates the exponential term expT using the distance and wave number. Next, the code calculates the distance between receivers and grid points (dXYUV) and stores it in expR. The exponential terms expT and expR are combined to calculate the overall exponential term EXP. The code reshapes and rearranges the arrays to perform matrix multiplication and obtain the final image. The image is stored in the image array. The code repeats steps 5-11 for different values of z. The final image is obtained by taking the absolute value of the image array. The code plots the image using the surf function.
Now, to optimize the code for speed, there are several key suggestions to consider. One important strategy is to preallocate arrays with the correct dimensions instead of initializing them with zeros. This can help avoid the need for resizing the array during loop iterations, which can slow down the code. Another useful tip is to vectorize calculations whenever possible. By using MATLAB's matrix multiplication capability, you can perform calculations more efficiently and avoid the need for loops. This can significantly improve the speed of your code. It's also important to analyze your code and identify any redundant calculations or unnecessary operations that can be eliminated. By streamlining your code in this way, you can make it more efficient and faster. Additionally, if your system has multiple CPU cores, consider utilizing MATLAB's parallel computing capabilities to distribute the workload and speed up calculations. This can help take advantage of the processing power available and further optimize your code for speed.
In terms of memory management, reducing array sizes where possible can help address "out of memory" errors. Adjusting step sizes or grid point ranges can help minimize memory usage and prevent these errors from occurring. Using data types with smaller memory footprints, such as single precision instead of double precision, can also help conserve memory. If memory limitations are still a concern, consider splitting calculations into smaller chunks and processing them sequentially to avoid exceeding available memory.
By implementing these optimizations and memory management techniques, you can improve both the speed and memory usage of your code significantly.
  2 个评论
moh mor
moh mor 2024-7-6
Hi @Umar,
I did what you mentioned to increase computation speed on my system. Moreover, I bring up a little minor changes in my code to make it dynamic. For example, I have given "start", "stop", and "step" for grid point coordinates in "test" function and passed it through to "BPmimo2C" function. As much as I could, I have performed computation in "single precision" format. All function are GPU compatible. The problem is that, for example, computation speed for normal run was 38s while it was 26s for mex file. I expect more to see from my implementation.
This is the test file:
clear all
load Efield
%% grid points
xf_str = -8; % up to three decimal points
xf_end = 8; % up to three decimal points
yf_str = -8; % up to three decimal points
yf_end = 8; % up to three decimal points
xy_step = 0.2; % up to three decimal points
z_str = 0.398; % up to three decimal points
z_end = 0.401;% up to three decimal points
z_step = 0.001; % up to three decimal points
xyz_par = [xf_str, xf_end, xy_step, yf_str, yf_end, xy_step, z_str, z_end, z_step]*1000;
f_str = 10e9;
f_end = 20e9;
f_step = 0.2e9;
f_par = [f_str, f_end, f_step];
ArrRadius = 15;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
tic
BPmimo2C_mex( Efield, f_par, xyz_par, TX, TY, numT, numR)
toc
I don't know if this'll help, but Here are the warnings appeard in MATLAB after compilation:
Thank you for your help.
Umar
Umar 2024-7-6
移动:Walter Roberson 2024-7-8

Hi Moh Mor,

Have you considered reaching out to MathWorks support for further assistance. Provide them with detailed information about your system configuration, MATLAB version, and the steps leading to the internal error.

请先登录,再进行评论。

更多回答(2 个)

Chao Luo
Chao Luo 2024-7-3
The generated code is quite optimized for GPU. I tried rewriting the code using explicit for-loops which results in similar performance. On top of that, I converted the data type from double to single, which speeds up the execution about 10 times. Do the conversion If signle precision is good enough for you. Here is the code I rewrite with the ploting part removed for your reference:
function image = BPmimo2C4(Efield) %#codegen
coder.gpu.kernelfun;
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]); % 5376x6
Efield2_6 = single(Efield2(:,6).');
% z = 0.4;
XYPos1 = single([TX.', TY.']);
UVPos = single([x1f(:), y1f(:)]);
dtXYUV1 = pdist2(XYPos1, UVPos);
XYPos2 = single([real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1))]);
dtXYUV2 = pdist2(XYPos2, UVPos);
EXP = coder.nullcopy(single((ones(21,16,16,17,2916) * 1i)));
for f_idx = 1:numel(x1f)
for dep = 1:17
for r_idx = 1:numR
for t_idx = 1:numel(TX)
for k_idx = 1:numel(K)
z2 = z(dep) * z(dep);
dt1 = dtXYUV1(r_idx,f_idx) * dtXYUV1(r_idx,f_idx) + z2;
dt1 = sqrt(dt1);
dt2 = dtXYUV2(t_idx,f_idx) * dtXYUV2(t_idx,f_idx) + z2;
dt2 = sqrt(dt2);
expV = exp((dt1 + dt2) * K(k_idx) * 1i);
EXP(k_idx, t_idx, r_idx, dep, f_idx) = expV;
end
end
end
end
end
EXP_resh = reshape(EXP, [21*16*16, 17*2916]);
image = Efield2_6 * EXP_resh;
image = reshape(image, [17,54,54]);
end
  8 个评论
moh mor
moh mor 2024-7-9
Thank you again @Chao Luo ,
I ran my code on MATLAB successfully. Moreover, I extracted MEX file of my code using MATLAB coder and it works without any error.
Chao Luo
Chao Luo 2024-7-10
R2018b is pretty old that I cannot debug it and give you a workaround. Is it possible for you to upgrade MATLAB at least to R2019b version?

请先登录,再进行评论。


Umar
Umar 2024-7-6

Hi Moh Mor,

Have you considered reaching out to MathWorks support for further assistance. Provide them with detailed information about your system configuration, MATLAB version, and the steps leading to the internal error.

类别

Help CenterFile Exchange 中查找有关 GPU Computing 的更多信息

产品


版本

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by