How to speed up matrix operations with lots of scalar constants involved ?

2 次查看(过去 30 天)
I have tried to transfer all arrays to gpu, but it's much slower.
dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
Cn=zeros(nr(jz),ns,'gpuArray');
Cn2=zeros(nr(jz),ns,'gpuArray');
Cn0=Cnw(1:nr(jz),3:5:nsw-2); %Cnw is a gpu array.
CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
CnZ=permute(CnZ,[1,3,2]);
I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
% i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth) %The most time-consuming part
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?

采纳的回答

Jan
Jan 2018-5-20
编辑:Jan 2018-5-20
If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:
% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb
I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:
Y = X * dt / hb
is equivalent to:
Temp = X * dt;
Y = Temp / hb;
These are 2 matrix operations. But you can combine the scalar operations:
Y = X * (dt / hb)
This is 1 matrix operation and a cheap scalar division.
As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.
I cannot run it by my own, because you did not provide the input values, but compare it optically:
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
or:
c1 = 1 - exp(-dt / tau2); % Calculate constants outside the loops!
Ipc = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
This can be simplified a little bit for the CnE term:
dCnE = (sigma0 * CnZ(:, :, jl) + ...
CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
Methods:
  1. Combine the scalar operations to reduce the number of matrix operations.
  2. Move constants out of the loop.
  3. Crop the data outside the loops instead of applying an a:b indexing in each iteration.
  4. Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
  5. Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:
C1 = I(:, jt2) * (dt/hb);
The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).
Finally: Replace
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
by:
Cn0 = max(Cn0, 0);
and move it outside the loops, because you do not use the cleaned values inside the loop.
  1 个评论
ben huang
ben huang 2018-5-21
编辑:ben huang 2018-5-21
Sorry for unclear explanation, sigma0, sigma, tau1, tau2, dt, hb, vs2, beta0 are all scalars. I have tested all your suggestions without gpu. The results are as follows (In each method, except the parts mentioned, all other parts are the same as the original one)
method 1,2,5
original (6.8328 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
modified (5.8009 sec)
outside the loop
Cexp=1-exp(-dt/tau2);
vsdh=vs2*sigma*dt/hb;
bd2h=beta0*dt/2/hb;
ddt=dt/tau1;
mvsdh=(1-vs2)*sigma*dt/hb;
mvd2hb=(1-vs2)*dt/(2*hb)*beta0;
msdh=-sigma0*dt/hb;
ddh=dt/hb;
in the loop
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vsdh*CnE(:,1:npmax(jz)-1,jl)).*Ipc+CnB(:,2:npmax(jz),jl)*Cexp...
+vs2*(bd2h*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))-ddt*CnE(:,2:npmax(jz),jl);
dCnB(:,2:npmax(jz))=mvsdh*CnE(:,1:npmax(jz)-1,jl).*Ipc+mvd2hb*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2)...
-CnB(:,2:npmax(jz),jl)*Cexp;
dCnZ(:,2:npmax(jz))=msdh*CnZ(:,2:npmax(jz),jl).*Ipc-bd2h*CnZ(:,2:npmax(jz),jl).*Ipc.^2+CnE(:,1:npmax(jz)-1,jl)*ddt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*ddh+...
-CnE(:,1,jl)*ddt;
dCnZ(:,1)=msdh*CnZ(:,1,jl).*Ipc(:,1)-bd2h*CnZ(:,1,jl).*Ipc(:,1).^2;
good method! thanks a lot!
Method 3:
To corp the arrays out of the inner loop and let the temporary arrays take the place of CnE, CnB, CnZ in inner loop, I have to add some commands to connect them. Except for the replacement of CnE, CnB, CnZ, I used comments to mark other changed parts in the modified version.
original (6.7290 sec)
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt; %(change of S1 state population density)
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Modified (8.7912 sec)
for jl=1:ns
if (npmax(jz)>nphth)
CnZt2=CnZ(:,2:npmax(jz),jl); % connecting commands
CnEt2=CnE(:,2:npmax(jz),jl);
CnBt2=CnB(:,2:npmax(jz),jl);
CnZt1=CnZ(:,1:npmax(jz)-1,jl);
CnEt1=CnE(:,1:npmax(jz)-1,jl);
end
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZt2-sigma*CnEt2+vs2*sigma*CnEt1).*Ipc*dt/hb+CnBt2*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)-CnEt2/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnEt1.*Ipc*dt/hb+(1-vs2)*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)...
-CnBt2*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZt2.*Ipc*dt/hb-(beta0*CnZt2.*Ipc.^2)*dt/(2*hb)+CnEt1/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZt2=CnZt2+dCnZ(:,2:npmax(jz)); % the number of arrays whose changes need to be record is increase from 3 to 5
CnZt1=CnZt1+dCnZ(:,1:npmax(jz)-1);
CnBt2=CnBt2+dCnB(:,2:npmax(jz));
CnEt2=CnEt2+dCnE(:,2:npmax(jz));
CnEt1=CnEt1+dCnE(:,1:npmax(jz)-1);
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
if (npmax(jz)>nphth) % connecting commands
CnZ(:,2:npmax(jz),jl)=CnZt2;
CnE(:,2:npmax(jz),jl)=CnEt2;
CnB(:,2:npmax(jz),jl)=CnBt2;
CnZ(:,1,jl)=CnZt1(:,1);
CnE(:,1,jl)=CnEt1(:,1);
end
end
The trade-off is too expensive.
Method 4: original (6.7951 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
auto-expanding modified (7.8393 sec)
Ipc=I(:,jt2);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+Cn B(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);
bsxfun modified (34.3400 sec)
outside the loop
fun1=@(g,h) g.*h.^2;
inside the loop
Ipc=I(:,jt2);
dCnE(:,2:npmax(jz))=bsxfun(@times,sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*bsxfun(@times,CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+(1-vs2)*(beta0*bsxfun(fun1,CnZ(:,1:npmax(jz)-1,jl),Ipc))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*bsxfun(@times,CnZ(:,2:npmax(jz),jl),Ipc)*dt/hb-(beta0*bsxfun(fun1,CnZ(:,2:npmax(jz),jl),Ipc))*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);
Maybe the reason is that in modified version, every time Ipc is used (it's used several times in one loop), it's auto-expanded, so the total consumed time is more than the original one, which expand Ipc only one time in one loop. Maybe the bsxfun case has the same reason.
non0-moved Method:
After testing several choices, only the following modification was found a little bit faster. To move the cleaning command of Cn,Cn0 out of the inner loop, I have to exchange the counter of outer loop and inner loop.
original (6.7660 sec)
for jl=1:ns
for jt2=1:(nt+1)
...
...
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Modified (6.7522 sec)
for jt2=1:(nt+1)
for jl=1:ns
...
...
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
end
Cn0(Cn0<0)=0;
Cn(Cn<0)=0;
end
I appreciate all of your suggestions! But why I can't use gpu to accelerate it? Is my thought in the main post correct ?

请先登录,再进行评论。

更多回答(1 个)

Matt J
Matt J 2018-5-20
编辑:Matt J 2018-5-20
Statements like "2:npmax(jz)" which construct index arrays on the right hand side cost, and you are incurring that cost repeatedly by re-executing them unnecessarily.
Instead of this,
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
Do things like
indices=2:npmax(jz); %pre-compute
dCnE(:,indices)=(sigma0*CnZ(:,indices,jl)-sigma*CnE(:,indices,jl)+...
Extracting sub-arrays like CnZ(:,2:npmax(jz),jl) is similarly costly. Use temporary variables to avoid this as well, e.g.
temp=CnZ(:,2:npmax(jz),jl);
dCnZ(:,2:npmax(jz))=-sigma0*tmp.*Ipc*dt/hb-(beta0*tmp.*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
Finally, repmat should be unnecessary if you are using a Matlab version more recent than R2016b.
  4 个评论
Matt J
Matt J 2018-5-21
I don't see any temporary arrays in what you posted. It looks identical to your original code.
ben huang
ben huang 2018-5-21
编辑:ben huang 2018-5-21
Sorry for not specifying the changed part! Please refer to the following code, which is the changed part only.
original (6.7460 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZ(: , 2 : npmax(jz) , jl) - sigma * CnE(: , 2 : npmax(jz) , jl) +...
vs2 * sigma * CnE(: , 1 : npmax(jz) - 1 , jl)) .* Ipc * dt / hb + CnB(: , 2 : npmax(jz) , jl) * ( 1 - exp( -dt / tau2 ))...
+ vs2 * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnE(: , 2 : npmax(jz) , jl) / tau1 * dt ;
dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnE(: , 1 : npmax(jz) - 1 , jl) .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .*( Ipc .^ 2 )) * dt / ( 2 * hb )...
- CnB(: , 2 : npmax(jz) , jl) *( 1 - exp ( -dt / tau2 ));
dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc * dt / hb - ( beta0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnE(: , 1 : npmax(jz) - 1 , jl) / tau1 * dt ;
Modified (9.6524 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
CnZt2 = CnZ(: , 2 : npmax(jz) , jl); % temporary array
CnEt2 = CnE(: , 2 : npmax(jz) , jl);
CnBt2 = CnB(: , 2 : npmax(jz) , jl);
CnZt1 = CnZ(: , 1 : npmax(jz) - 1 , jl);
CnEt1 = CnE(: , 1 : npmax(jz) - 1 , jl);
dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZt2 - sigma * CnEt2 + vs2 * sigma * CnEt1 ) .* Ipc * dt / hb + CnBt2 *( 1 - exp( -dt / tau2 ))...
+vs2 * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnEt2 / tau1 * dt ;
dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnEt1 .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )...
- CnBt2 * ( 1 - exp ( -dt / tau2 )) ;
dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZt2 .* Ipc * dt / hb - ( beta0 * CnZt2 .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnEt1 / tau1 * dt ;

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Logical 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by