- Combine the scalar operations to reduce the number of matrix operations.
- Move constants out of the loop.
- Crop the data outside the loops instead of applying an a:b indexing in each iteration.
- Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
- Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:
How to speed up matrix operations with lots of scalar constants involved ?
2 次查看(过去 30 天)
显示 更早的评论
I have tried to transfer all arrays to gpu, but it's much slower.
dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
Cn=zeros(nr(jz),ns,'gpuArray');
Cn2=zeros(nr(jz),ns,'gpuArray');
Cn0=Cnw(1:nr(jz),3:5:nsw-2); %Cnw is a gpu array.
CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
CnZ=permute(CnZ,[1,3,2]);
I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
% i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth) %The most time-consuming part
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?
0 个评论
采纳的回答
Jan
2018-5-20
编辑:Jan
2018-5-20
If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:
% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb
I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:
Y = X * dt / hb
is equivalent to:
Temp = X * dt;
Y = Temp / hb;
These are 2 matrix operations. But you can combine the scalar operations:
Y = X * (dt / hb)
This is 1 matrix operation and a cheap scalar division.
As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.
I cannot run it by my own, because you did not provide the input values, but compare it optically:
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
or:
c1 = 1 - exp(-dt / tau2); % Calculate constants outside the loops!
Ipc = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
This can be simplified a little bit for the CnE term:
dCnE = (sigma0 * CnZ(:, :, jl) + ...
CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
Methods:
C1 = I(:, jt2) * (dt/hb);
The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).
Finally: Replace
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
by:
Cn0 = max(Cn0, 0);
and move it outside the loops, because you do not use the cleaned values inside the loop.
更多回答(1 个)
Matt J
2018-5-20
编辑:Matt J
2018-5-20
Statements like "2:npmax(jz)" which construct index arrays on the right hand side cost, and you are incurring that cost repeatedly by re-executing them unnecessarily.
Instead of this,
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
Do things like
indices=2:npmax(jz); %pre-compute
dCnE(:,indices)=(sigma0*CnZ(:,indices,jl)-sigma*CnE(:,indices,jl)+...
Extracting sub-arrays like CnZ(:,2:npmax(jz),jl) is similarly costly. Use temporary variables to avoid this as well, e.g.
temp=CnZ(:,2:npmax(jz),jl);
dCnZ(:,2:npmax(jz))=-sigma0*tmp.*Ipc*dt/hb-(beta0*tmp.*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
Finally, repmat should be unnecessary if you are using a Matlab version more recent than R2016b.
4 个评论
Matt J
2018-5-21
I don't see any temporary arrays in what you posted. It looks identical to your original code.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Logical 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!