GPU optimization of looped vector operations
1 次查看(过去 30 天)
显示 更早的评论
I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000x10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
%canUseGPU = false;
if canUseGPU
%integral using GPU
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
dN_KK =gather(gdN_KK);
else
%integral using GPU
parfor i = 1:len
dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
end
end
% Scales data
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc
4 个评论
Joss Knight
2019-8-29
Lloyd, have you tried running your computation in single precision? Your Quadro P5200 has respectable single precision performance of about 8 or 9 teraflops, but like most graphics cards except special ones, its double precision performance is a small fraction of that at about 280 gigaflops (figures from the Wikipedia page where NVIDIA post their specs). This is why you're not getting much better matrix multiply performance out of your GPU than your CPU - this would be dramatically different in single. It is perfectly normal for an algorithm to run faster on the CPU than on one of these graphics-focussed cards, especially if it is an algorithm with a lot of unvectorized loops and a multiprocess parallelization.
采纳的回答
Matt J
2019-8-28
编辑:Matt J
2019-9-6
This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks
It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
wait(gd)
toc %Elapsed time is 2.027665 seconds.
15 个评论
Matt J
2019-9-6
编辑:Matt J
2019-9-6
Okay, I did make a few fixes, but now to be sure we're on the same page, I share the test code below, and I see strong agreement between the two versions
nubar_low = 2450;
nubar_high = 2451;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
%%%% ORIGINAL %%%%%
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
version1 = gdN_KK ;
%%%% OPTIMIZED %%%%%%
chunksize=5;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
%wait(gd)
toc %Elapsed time is 2.027665 seconds.
version2 = gdN_KK ;
plot(1:len,version1,'-',1:len,+version2,'x'); legend('Lloyd','Matt')
更多回答(1 个)
Lloyd Bumm
2019-9-6
编辑:Lloyd Bumm
2019-9-6
3 个评论
Joss Knight
2019-9-6
Thanks. The only explanation for that is that your cost is all overhead on the GPU, and not computation.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!