How to speed up matrix operations with lots of scalar constants involved ?

Question

ben huang 2018-5-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/401642-how-to-speed-up-matrix-operations-with-lots-of-scalar-constants-involved

编辑： ben huang 2018-5-21

I have tried to transfer all arrays to gpu, but it's much slower.

    dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
    dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
    dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
    Cn=zeros(nr(jz),ns,'gpuArray');
    Cn2=zeros(nr(jz),ns,'gpuArray');
    Cn0=Cnw(1:nr(jz),3:5:nsw-2);  %Cnw is a gpu array.
    CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
    CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
    CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
    CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
    CnZ=permute(CnZ,[1,3,2]);
    I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
    % i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
    for jt2=1:(nt+1)
       dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
       dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
       +vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
       dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
       -Cn2(:,jl)*(1-exp(-dt/tau2));
       dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
    if (npmax(jz)>nphth)       %The most time-consuming part
        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
        vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb); 
        CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
        CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
        CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
    end
        I(:,jt2)=I(:,jt2)+dI;
        non0=I(:,jt2);
        non0(non0<0)=0;
        I(:,jt2)=non0;
        Cn(:,jl)=Cn(:,jl)+dCn;
        Cn0(:,jl)=Cn0(:,jl)+dCn0;
        Cn2(:,jl)=Cn2(:,jl)+dCn2;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
       end
end

Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Jan 2018-5-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/401642-how-to-speed-up-matrix-operations-with-lots-of-scalar-constants-involved#answer_321058

编辑：Jan 2018-5-20

在 MATLAB Online 中打开

If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:

% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb

I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:

Y = X * dt / hb

is equivalent to:

Temp = X * dt;
Y    = Temp / hb;

These are 2 matrix operations. But you can combine the scalar operations:

Y = X * (dt / hb)

This is 1 matrix operation and a cheap scalar division.

As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.

I cannot run it by my own, because you did not provide the input values, but compare it optically:

Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
      vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
      +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;

or:

c1 = 1 - exp(-dt / tau2);   % Calculate constants outside the loops!
Ipc  = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
       vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
       CnB(:,:,jl) * c1 + ...
       (vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
       CnE(:, :, jl) / (tau1 * dt);

This can be simplified a little bit for the CnE term:

dCnE = (sigma0 * CnZ(:, :, jl) + ...
       CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
       CnB(:,:,jl) * c1 + ...
       (vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
       CnE(:, :, jl) / (tau1 * dt);

Methods:

Combine the scalar operations to reduce the number of matrix operations.
Move constants out of the loop.
Crop the data outside the loops instead of applying an a:b indexing in each iteration.
Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:

C1 = I(:, jt2) * (dt/hb);

The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).

Finally: Replace

        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;

by:

Cn0 = max(Cn0, 0);

and move it outside the loops, because you do not use the cleaned values inside the loop.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

ben huang 2018-5-21

编辑：ben huang 2018-5-21

在 MATLAB Online 中打开

Sorry for unclear explanation, sigma0, sigma, tau1, tau2, dt, hb, vs2, beta0 are all scalars. I have tested all your suggestions without gpu. The results are as follows (In each method, except the parts mentioned, all other parts are the same as the original one)

method 1,2,5

original (6.8328 sec)

        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);

modified (5.8009 sec)

outside the loop

Cexp=1-exp(-dt/tau2);
vsdh=vs2*sigma*dt/hb;
bd2h=beta0*dt/2/hb;
ddt=dt/tau1;
mvsdh=(1-vs2)*sigma*dt/hb;
mvd2hb=(1-vs2)*dt/(2*hb)*beta0;
msdh=-sigma0*dt/hb;
ddh=dt/hb;

in the loop

        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vsdh*CnE(:,1:npmax(jz)-1,jl)).*Ipc+CnB(:,2:npmax(jz),jl)*Cexp...
        +vs2*(bd2h*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))-ddt*CnE(:,2:npmax(jz),jl);
        dCnB(:,2:npmax(jz))=mvsdh*CnE(:,1:npmax(jz)-1,jl).*Ipc+mvd2hb*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2)...
        -CnB(:,2:npmax(jz),jl)*Cexp;
        dCnZ(:,2:npmax(jz))=msdh*CnZ(:,2:npmax(jz),jl).*Ipc-bd2h*CnZ(:,2:npmax(jz),jl).*Ipc.^2+CnE(:,1:npmax(jz)-1,jl)*ddt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*ddh+...
        -CnE(:,1,jl)*ddt;
        dCnZ(:,1)=msdh*CnZ(:,1,jl).*Ipc(:,1)-bd2h*CnZ(:,1,jl).*Ipc(:,1).^2;

good method! thanks a lot!

Method 3:

To corp the arrays out of the inner loop and let the temporary arrays take the place of CnE, CnB, CnZ in inner loop, I have to add some commands to connect them. Except for the replacement of CnE, CnB, CnZ, I used comments to mark other changed parts in the modified version.

original (6.7290 sec)

for jl=1:ns
     for jt2=1:(nt+1)
     dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
     dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
     +vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;  %(change of S1 state population density) 
     dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
     -Cn2(:,jl)*(1-exp(-dt/tau2));
     dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
    if (npmax(jz)>nphth)
        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb); 
        CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
        CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
        CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
    end
        I(:,jt2)=I(:,jt2)+dI;
        non0=I(:,jt2);
        non0(non0<0)=0;
        I(:,jt2)=non0;
        Cn(:,jl)=Cn(:,jl)+dCn;
        Cn0(:,jl)=Cn0(:,jl)+dCn0;
        Cn2(:,jl)=Cn2(:,jl)+dCn2;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
       end
end

Modified (8.7912 sec)

for jl=1:ns
  if (npmax(jz)>nphth)
      CnZt2=CnZ(:,2:npmax(jz),jl); % connecting commands
      CnEt2=CnE(:,2:npmax(jz),jl);
      CnBt2=CnB(:,2:npmax(jz),jl);
      CnZt1=CnZ(:,1:npmax(jz)-1,jl);
      CnEt1=CnE(:,1:npmax(jz)-1,jl);
  end
     for jt2=1:(nt+1)
     dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
     dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
     +vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt; 
     dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
     -Cn2(:,jl)*(1-exp(-dt/tau2));
     dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
    if (npmax(jz)>nphth)
        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZt2-sigma*CnEt2+vs2*sigma*CnEt1).*Ipc*dt/hb+CnBt2*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)-CnEt2/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnEt1.*Ipc*dt/hb+(1-vs2)*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)...
        -CnBt2*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZt2.*Ipc*dt/hb-(beta0*CnZt2.*Ipc.^2)*dt/(2*hb)+CnEt1/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb); 
        CnZt2=CnZt2+dCnZ(:,2:npmax(jz)); % the number of arrays whose changes need to be record is increase from 3 to 5
        CnZt1=CnZt1+dCnZ(:,1:npmax(jz)-1);
        CnBt2=CnBt2+dCnB(:,2:npmax(jz));
        CnEt2=CnEt2+dCnE(:,2:npmax(jz));
        CnEt1=CnEt1+dCnE(:,1:npmax(jz)-1);
    end
        I(:,jt2)=I(:,jt2)+dI;
        non0=I(:,jt2);
        non0(non0<0)=0;
        I(:,jt2)=non0;
        Cn(:,jl)=Cn(:,jl)+dCn;
        Cn0(:,jl)=Cn0(:,jl)+dCn0;
        Cn2(:,jl)=Cn2(:,jl)+dCn2;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
       end
       if (npmax(jz)>nphth) % connecting commands
           CnZ(:,2:npmax(jz),jl)=CnZt2;
           CnE(:,2:npmax(jz),jl)=CnEt2;
           CnB(:,2:npmax(jz),jl)=CnBt2;
           CnZ(:,1,jl)=CnZt1(:,1);
           CnE(:,1,jl)=CnEt1(:,1);
       end
end

The trade-off is too expensive.

Method 4: original (6.7951 sec)

        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);

auto-expanding modified (7.8393 sec)

        Ipc=I(:,jt2);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+Cn B(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);

bsxfun modified (34.3400 sec)

outside the loop

fun1=@(g,h) g.*h.^2;

inside the loop

        Ipc=I(:,jt2);
        dCnE(:,2:npmax(jz))=bsxfun(@times,sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*bsxfun(@times,CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+(1-vs2)*(beta0*bsxfun(fun1,CnZ(:,1:npmax(jz)-1,jl),Ipc))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*bsxfun(@times,CnZ(:,2:npmax(jz),jl),Ipc)*dt/hb-(beta0*bsxfun(fun1,CnZ(:,2:npmax(jz),jl),Ipc))*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);

Maybe the reason is that in modified version, every time Ipc is used (it's used several times in one loop), it's auto-expanded, so the total consumed time is more than the original one, which expand Ipc only one time in one loop. Maybe the bsxfun case has the same reason.

non0-moved Method:

After testing several choices, only the following modification was found a little bit faster. To move the cleaning command of Cn,Cn0 out of the inner loop, I have to exchange the counter of outer loop and inner loop.

original (6.7660 sec)

for jl=1:ns
  for jt2=1:(nt+1)
      ...
      ...
      non0=I(:,jt2);
      non0(non0<0)=0;
      I(:,jt2)=non0;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
    end
end

Modified (6.7522 sec)

for jt2=1:(nt+1)
  for jl=1:ns
      ...
      ...
      non0=I(:,jt2);
      non0(non0<0)=0;
      I(:,jt2)=non0;
  end
     Cn0(Cn0<0)=0;
     Cn(Cn<0)=0;
end

I appreciate all of your suggestions! But why I can't use gpu to accelerate it? Is my thought in the main post correct ?

请先登录，再进行评论。

Answer 2

Matt J 2018-5-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/401642-how-to-speed-up-matrix-operations-with-lots-of-scalar-constants-involved#answer_321043

编辑：Matt J 2018-5-20

在 MATLAB Online 中打开

Statements like "2:npmax(jz)" which construct index arrays on the right hand side cost, and you are incurring that cost repeatedly by re-executing them unnecessarily.

Instead of this,

   dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...

Do things like

   indices=2:npmax(jz);  %pre-compute
   dCnE(:,indices)=(sigma0*CnZ(:,indices,jl)-sigma*CnE(:,indices,jl)+...

Extracting sub-arrays like CnZ(:,2:npmax(jz),jl) is similarly costly. Use temporary variables to avoid this as well, e.g.

temp=CnZ(:,2:npmax(jz),jl);
dCnZ(:,2:npmax(jz))=-sigma0*tmp.*Ipc*dt/hb-(beta0*tmp.*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;

Finally, repmat should be unnecessary if you are using a Matlab version more recent than R2016b.

4 个评论
显示 2更早的评论隐藏 2更早的评论

ben huang 2018-5-21

编辑：ben huang 2018-5-21

在 MATLAB Online 中打开

I have tried using temporary arrays without gpu as follows, but the execution time of this block is 9.6524 sec, which is 2.9064 sec more than the original one(without gpu too).

toc
for jl=1:ns
     for jt2=1:(nt+1)
     dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
     dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
     +vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;  %(change of S1 state population density) 
     dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
     -Cn2(:,jl)*(1-exp(-dt/tau2));
     dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
    if (npmax(jz)>nphth)
        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        CnZt2=CnZ(:,2:npmax(jz),jl); % temporary array
        CnEt2=CnE(:,2:npmax(jz),jl);
        CnBt2=CnB(:,2:npmax(jz),jl);
        CnZt1=CnZ(:,1:npmax(jz)-1,jl);
        CnEt1=CnE(:,1:npmax(jz)-1,jl);
        dCnE(:,2:npmax(jz))=(sigma0*CnZt2-sigma*CnEt2+vs2*sigma*CnEt1).*Ipc*dt/hb+CnBt2*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)-CnEt2/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnEt1.*Ipc*dt/hb+(1-vs2)*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)...
        -CnBt2*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZt2.*Ipc*dt/hb-(beta0*CnZt2.*Ipc.^2)*dt/(2*hb)+CnEt1/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb); 
        CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
        CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
        CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
    end
        I(:,jt2)=I(:,jt2)+dI;
        non0=I(:,jt2);
        non0(non0<0)=0;
        I(:,jt2)=non0;
        Cn(:,jl)=Cn(:,jl)+dCn;
        Cn0(:,jl)=Cn0(:,jl)+dCn0;
        Cn2(:,jl)=Cn2(:,jl)+dCn2;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
       end
end
toc

I think maybe the reason is that the cost of producing a temporary array is more than the saved time of using them, because these temporary arrays is used only two or three times.

Matt J 2018-5-21

I don't see any temporary arrays in what you posted. It looks identical to your original code.

ben huang 2018-5-21

编辑：ben huang 2018-5-21

在 MATLAB Online 中打开

Sorry for not specifying the changed part! Please refer to the following code, which is the changed part only.

original (6.7460 sec)

        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZ(: , 2 : npmax(jz) , jl) - sigma * CnE(: , 2 : npmax(jz) , jl) +...
        vs2 * sigma * CnE(: , 1 : npmax(jz) - 1 , jl)) .* Ipc * dt / hb + CnB(: , 2 : npmax(jz) , jl) * ( 1 - exp( -dt / tau2 ))...
        + vs2 * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnE(: , 2 : npmax(jz) , jl) / tau1 * dt ;
        dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnE(: , 1 : npmax(jz) - 1 , jl) .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .*( Ipc .^ 2 )) * dt / ( 2 * hb )...
        - CnB(: , 2 : npmax(jz) , jl) *( 1 - exp ( -dt / tau2 ));
        dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc * dt / hb - ( beta0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnE(: , 1 : npmax(jz) -  1 , jl) / tau1 * dt ;

Modified (9.6524 sec)

        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        CnZt2 = CnZ(: , 2 : npmax(jz) , jl); % temporary array
        CnEt2 = CnE(: , 2 : npmax(jz) , jl);
        CnBt2 = CnB(: , 2 : npmax(jz) , jl);
        CnZt1 = CnZ(: , 1 : npmax(jz) - 1 , jl);
        CnEt1 = CnE(: , 1 : npmax(jz) - 1 , jl);
        dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZt2 - sigma * CnEt2 + vs2 * sigma * CnEt1 ) .* Ipc * dt / hb + CnBt2 *( 1 - exp( -dt / tau2 ))...
        +vs2 * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnEt2 / tau1 * dt ;
        dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnEt1 .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )...
        - CnBt2 * ( 1 - exp ( -dt / tau2 )) ;
        dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZt2 .* Ipc * dt / hb - ( beta0 * CnZt2 .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnEt1 / tau1 * dt ;