GPU Coder codegen slow performance

I am running the code shown below with MEX CUDA. Following Chao Luo's suggestion, I prepared a MWE, so that anyone who is interested can run this.
The computational bottleneck is the while loop in this function:
function [V,Policy,Params] = fun_VFI_cuda(p_eqm, a_grid, z_grid, pi_z, Params, vfoptions)
coder.gpu.kernelfun;
Params.r = p_eqm(1);
Params.w = p_eqm(2);
verbose = vfoptions.verbose ;
%% 1 First solve the value function
n_a = length(a_grid) ;
n_z = length(z_grid) ;
if verbose>=1
disp('Start Value Function Iteration')
end
r = p_eqm(1);
w = p_eqm(2);
lambda = Params.lambda;
delta = Params.delta;
alpha = Params.alpha;
upsilon = Params.upsilon;
crra = Params.crra;
beta = Params.beta;
%% 1.1 the return matrix
ReturnMatrix = coder.nullcopy(zeros(n_a,n_a,n_z)) ;
cash = coder.nullcopy(zeros(n_a,n_z)) ;
tic
coder.gpu.kernel()
for z_c = 1:n_z
coder.gpu.constantMemory(z_grid);
z = z_grid(z_c);
for a_c=1:n_a
coder.gpu.constantMemory(a_grid);
a = a_grid(a_c);
profit = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon);
cash(a_c,z_c) = max(w,profit) + (1+r)*a; % cash depends only on (a,z)
end
end
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
coder.gpu.constantMemory(cash);
cash_is = cash(a_c,z_c) ;
for aprime_c = 1 : n_a % Now introduce a'
coder.gpu.constantMemory(a_grid);
cons = cash_is - a_grid(aprime_c);
if cons > 0
ReturnMatrix(aprime_c,a_c,z_c) = (cons^(1-crra))/(1-crra);
else
ReturnMatrix(aprime_c,a_c,z_c) = - Inf ;
end %end if
end %end aprime
end %end a
end %end z
time_ret = toc;
%% 1.1 the value and policy function
V0 = coder.nullcopy(zeros(n_a,n_z)) ; % Initial guess V0
V = coder.nullcopy(zeros(n_a,n_z)) ;
Policy = coder.nullcopy(zeros(n_a,n_z)) ;
err = vfoptions.tolerance+1;
iter = 1;
pi_z_tranposed = pi_z';
tic
while err > vfoptions.tolerance
EV = V0*pi_z_tranposed; %EV(a',z)
coder.gpu.kernel()
for z_c=1:n_z
for a_c=1:n_a
tmpmax = - Inf ;
maxid = 1 ;
for aprime_c = 1 : n_a
coder.gpu.constantMemory(ReturnMatrix);
coder.gpu.constantMemory(EV);
entireRHS = ReturnMatrix(aprime_c,a_c,z_c) + beta*EV(aprime_c,z_c);
if tmpmax < entireRHS
tmpmax = entireRHS ;
maxid = aprime_c ;
end
V(a_c,z_c) = tmpmax;
Policy(a_c,z_c) = maxid;
end
end
end
% -------------------------- Howard ----------------------------------%
for h_c = 1 : vfoptions.howards
EVh = V*pi_z_tranposed;
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
coder.gpu.constantMemory(Policy);
coder.gpu.constantMemory(ReturnMatrix);
coder.gpu.constantMemory(EVh);
aprime_opt = Policy(a_c,z_c) ;
V(a_c,z_c) = ReturnMatrix(aprime_opt,a_c,z_c) + beta*EVh(aprime_opt,z_c) ;
end
end
end %end howards
% --------------------------------- ----------------------------------%
% Update
err = max(abs(V(:)-V0(:)));
% if verbose == 2
% fprintf('iter = %4.0f, err = %f \n',iter,err)
% end
V0 = V;
iter = iter+1;
end %end while
time_vfi = toc;
if verbose >= 1
fprintf('Time return matrix: %8.6f \n',time_ret);
fprintf('Time vfi: %8.6f \n',time_vfi);
fprintf('Time return matrix + vfi: %8.6f \n',time_ret+time_vfi);
end
end %end function fun_VFI_cuda
Please note that this function calls the function solve_entre, which I copy and paste here:
function [profit,kstar,lstar] = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon)
coder.gpu.kernelfun;
% Get k1, kstar, lstar
%aux = 1-(1-alpha)*(1-upsilon);
k1a = (1/(max(r+delta,1e-8)))*alpha*(1-upsilon)*z;
k1b = (1/w)*(1-alpha)*(1-upsilon)*z;
inside = k1a^(1-(1-alpha)*(1-upsilon)) * k1b^((1-alpha)*(1-upsilon));
k1 = (inside)^(1/upsilon);
kstar = min(k1,lambda*a);
inside_lab = (1/w)*(1-alpha)*(1-upsilon)*z *kstar^(alpha*(1-upsilon));
lstar = ( inside_lab )^(1/(1-(1-alpha)*(1-upsilon)));
% Evaluate profit if do choose to be entrepreneur
profit = z*((kstar^alpha)*(lstar^(1-alpha)) )^(1-upsilon) -w*lstar -(delta+r)*kstar;
end %end function solve_entre
I compile and run the code with the following script. Note that I want the input argument a_grid to be of variable size.
clear,clc,close all
%% Parameters
Params.crra = 1.5;
Params.beta = 0.904;
Params.delta = 0.06;
Params.alpha = 0.33;
Params.upsilon = 0.21;
Params.psi = 0.894;
Params.eta = 4.15;
Params.lambda = inf;
Params.r = 0.0476;
Params.w = 0.172;
p_eqm(1) = Params.r;
p_eqm(2) = Params.w;
vfoptions = struct();
vfoptions.verbose = 1;
vfoptions.lowmemory = 0;
vfoptions.tolerance = 1e-9;
vfoptions.howards = 80;
%% Grids
a_min = 1e-6;
a_max = 4000;
a_scale = 2;
n_a = 1000;
a_grid = a_min+(a_max-a_min)*linspace(0,1,n_a)'.^a_scale;
% z_grid is 40*1 vector
% pi_z is 40*40 transition matrix with rows summing up to one.
n_z = 40;
z_grid = linspace(0.5,1.5,n_z)'; % 40x1 vector
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
%% Compile to get C mex
% Define variable size inputs
vs_a_grid = coder.typeof(a_grid,[2001,1],1);
cfg = coder.gpuConfig('mex');
cfg.GenerateReport = true;
codegen -config cfg fun_VFI_cuda -args {zeros(1,2), vs_a_grid, z_grid, pi_z, Params, vfoptions} -o fun_VFI_cuda_mex
%% Call mex function
[V,Policy,Params] = fun_VFI_cuda_mex(p_eqm, a_grid, z_grid, pi_z,Params,vfoptions) ;
% THE END
The problem is that the performance of the function fun_VFI_cuda_mex is only slightly faster than Matlab native code. Of course this can be driven by some harware factors such as the Nvidia GPU graphics card I am using; However, I would like to ask if there is something I can do to improve the code above. For example, everything is with loops since I thought this is easier to translate for the gpu coder, but maybe I am wrong.
Any comment/answer is greatly appreciated!

 采纳的回答

Chao Luo
Chao Luo 2025-8-18
Hi Alessandro,
The code you pasted here seems not complete that I cannot try it. But you can try using gpuPerformanceAnalyzer to profile the generated code.
By just looking at the code, I can see the code of line 15-31 can be rewriten using gpucoder.reduce to improve the performance.
Also, if you can give the full reproduction, I can investigate more.

9 个评论

Hi @Chao Luo, thanks for your reply! I updated my question to include a full reproduction. I think you should be able to run the code.
Could you explain a bit more gpucoder.reduce? Thanks!
For 25a, you can replace the for-loop computing the max index with gpucoder.internal.max. The performance on my side is 20X faster.
You can profile the generate code like this:
gpuPerformanceAnalyzer("fun_VFI_cuda.m", {p_eqm, a_grid, z_grid, pi_z,Params,vfoptions}, Config=cfg)
gpuPerformanceAnalyzer("fun_VFI_cuda2.m", {p_eqm, a_grid, z_grid, pi_z,Params,vfoptions}, Config=cfg)
Here is the code:
function [V,Policy,Params] = fun_VFI_cuda2(p_eqm, a_grid, z_grid, pi_z, Params, vfoptions)
coder.gpu.kernelfun;
Params.r = p_eqm(1);
Params.w = p_eqm(2);
verbose = vfoptions.verbose ;
%% 1 First solve the value function
n_a = length(a_grid) ;
n_z = length(z_grid) ;
if verbose>=1
disp('Start Value Function Iteration')
end
r = p_eqm(1);
w = p_eqm(2);
lambda = Params.lambda;
delta = Params.delta;
alpha = Params.alpha;
upsilon = Params.upsilon;
crra = Params.crra;
beta = Params.beta;
%% 1.1 the return matrix
ReturnMatrix = coder.nullcopy(zeros(n_a,n_a,n_z)) ;
cash = coder.nullcopy(zeros(n_a,n_z)) ;
tic
coder.gpu.kernel()
for z_c = 1:n_z
coder.gpu.constantMemory(z_grid);
z = z_grid(z_c);
for a_c=1:n_a
coder.gpu.constantMemory(a_grid);
a = a_grid(a_c);
profit = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon);
cash(a_c,z_c) = max(w,profit) + (1+r)*a; % cash depends only on (a,z)
end
end
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
coder.gpu.constantMemory(cash);
cash_is = cash(a_c,z_c) ;
for aprime_c = 1 : n_a % Now introduce a'
coder.gpu.constantMemory(a_grid);
cons = cash_is - a_grid(aprime_c);
if cons > 0
ReturnMatrix(aprime_c,a_c,z_c) = (cons^(1-crra))/(1-crra);
else
ReturnMatrix(aprime_c,a_c,z_c) = - Inf ;
end %end if
end %end aprime
end %end a
end %end z
time_ret = toc;
%% 1.1 the value and policy function
V0 = zeros(n_a,n_z) ; % Initial guess V0
err = vfoptions.tolerance+1;
V = coder.nullcopy(zeros(n_a,n_z));
Policy = coder.nullcopy(zeros(n_a,n_z)) ;
iter = 1;
pi_z_tranposed = pi_z';
tic
while err > vfoptions.tolerance
EV = beta*V0*pi_z_tranposed; %EV(a',z)
% Compute ReturnMatrix - EV
entireRHS = coder.nullcopy(ReturnMatrix);
for z_c = 1:n_z
for a_c = 1:n_a
for aprime_c = 1:n_a
entireRHS(aprime_c, a_c, z_c) = ReturnMatrix(aprime_c, a_c, z_c) + EV(aprime_c,z_c);
end
end
end
[V3, Policy3] = gpucoder.internal.max(entireRHS, 'dim', 1);
V = squeeze(V3);
Policy = squeeze(Policy3);
% coder.gpu.kernel()
% for z_c=1:n_z
% for a_c=1:n_a
% tmpmax = - Inf ;
% maxid = 1 ;
% for aprime_c = 1 : n_a
% coder.gpu.constantMemory(ReturnMatrix);
% coder.gpu.constantMemory(EV);
% entireRHS = ReturnMatrix(aprime_c,a_c,z_c) + beta*EV(aprime_c,z_c);
% if tmpmax < entireRHS
% tmpmax = entireRHS ;
% maxid = aprime_c ;
% end
% V(a_c,z_c) = tmpmax;
% Policy(a_c,z_c) = maxid;
% end
% end
% end
% -------------------------- Howard ----------------------------------%
for h_c = 1 : vfoptions.howards
EVh = beta*V*pi_z_tranposed;
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
aprime_opt = Policy(a_c,z_c) ;
V(a_c,z_c) = ReturnMatrix(aprime_opt,a_c,z_c) + EVh(aprime_opt,z_c) ;
end
end
end %end howards
% --------------------------------- ----------------------------------%
% Update
err = max(abs(V(:)-V0(:)));
V0 = V;
iter = iter+1;
end %end while
time_vfi = toc;
if verbose >= 1
fprintf('Number of iteration: %8.6f \n',iter);
fprintf('Time return matrix: %8.6f \n',time_ret);
fprintf('Time vfi: %8.6f \n',time_vfi);
fprintf('Time return matrix + vfi: %8.6f \n',time_ret+time_vfi);
end
end %end function fun_VFI_cuda

Thanks, this is great!

I was not able to find much documentation for gpucoder.internal.max, though.

There is a more serious problem. I tried to compile fun_VFI_cuda with 2025a but I got this error message:
Left hand side of the assignment has fewer dimensions than the right hand side: [:? x :?] cannot be assigned from [:? x :? x :?].
Error in ==> fun_VFI_cuda Line: 116 Column: 5
It does not like the following line where I update the value function:
V0 = V;
This is strange: if I compile it in 2024b, I do not get this error (but of course it does not compile either because it does not recognize gpucoder.internal.max
Do not use
V = squeeze(V3);
Use reshape() instead.
Can you check the size of V and V0? The squeeze/reshape is required here because gpucoder.interal.max output type is [1000x1x40] instead of [1000x40].
@Walter Roberson Thanks for the tip!
@Chao Luo I ran the M file and I checked these lines
[V3, Policy3] = max(entireRHS, [], 1);
V = reshape(V3,n_a,n_z);
entireRHS is [1000*1000*40], so V3 is [1*1000*40], while V is [1000*40]. This is why in my original code I used squeeze(V3), to eliminate the singleton dimension and make V3 have the same dimension or shape as V. @Walter Roberson suggests to use reshape which I did, but it should be equivalent, at least in native Matlab.
I will compile the GPU code later when I have access to 2025a (now I am with 2024b)
I have an update. Replacing squeeze with reshape fixes the problem. Thanks @Walter Roberson!
@Chao Luo I checked the running time and it turns out that your code with
entireRHS = coder.nullcopy(ReturnMatrix);
coder.gpu.kernel()
for z_c=1:n_z
for a_c=1:n_a
for aprime_c = 1 : n_a
entireRHS(aprime_c, a_c, z_c)=ReturnMatrix(aprime_c,a_c,z_c)+beta*EV(aprime_c,z_c);
end
end
end
[V3, Policy3] = gpucoder.internal.max(entireRHS, 'dim', 1);
is slightly (20%) faster than my original code with the "max" implemented manually via a for loop:
coder.gpu.kernel()
for z_c=1:n_z
for a_c=1:n_a
max_val = -inf;
max_ind = -1;
for aprime_c = 1 : n_a
entireRHS = ReturnMatrix(aprime_c,a_c,z_c)+beta*EV(aprime_c,z_c);
if entireRHS>max_val
max_val = entireRHS;
max_ind = aprime_c;
end
end
V(a_c,z_c) = max_val;
Policy(a_c,z_c) = max_ind; % Store the optimal policy
end
end
So the conclusion is that if one has 2025a, it is better to calculate the array and then its max via gpucoder.internal.max instead of doing the reduction with a loop.
@Chao Luo: For 2024b and earlier version, can I add a directive to the reduction loop to tell the compiler that it is a reduction? Thanks!
gpucoder.internal.max is not supported for releases earlier than 25a.
You can try a different approach:
maxVal = gpucoder.reduce(entireRHS, @mymax, 'dim', 1)
maxIdx = coder.ones(n_a, n_z) * int32(n_a);
coder.gpu.kernel()
for z_c = 1:n_z
coder.gpu.kernel()
for a_c = 1:n_a
coder.gpu.kernel()
for aprime_c = 1:n_a
if entireRHS(aprime_c, a_c, z_c) == maxVal(1, a_c, z_c)
maxIdx(a_c, z_c) = gpucoder.atomicMin(maxIdx(a_c, z_c), int32(aprime_c));
end
end
end
end
function out = mymax(lhs, rhs)
out = max(lhs, rhs);
end

请先登录,再进行评论。

更多回答(0 个)

类别

帮助中心File Exchange 中查找有关 Kernel Creation from MATLAB Code 的更多信息

产品

版本

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by