Perform specific operations on arrays without "for" loop

Question

0 个投票

Hi! I'm trying to perform some particular operations between arrays in a more efficient way, possibly without using the "for" loop. Here is the function I would like to optimize:

function [chi,zeta,M,O,Delta,rho] = calc_fun(N,n,x,x_avg,omega,xi)
    chi=cumsum(x.*cos((1:N)'*omega));
    zeta=cumsum(x.*sin((1:N)'*omega));
    M=zeros(floor(N/n),1);
    for m=1:floor(N/n)
        M_term = (chi(m+1:N)-chi(1:N-m)).^2 + (zeta(m+1:N)-zeta(1:N-m)).^2;
        M(m) = sum(M_term)/(N-m);
    end
    
    O = (x_avg)^2 * (1-cos(xi.*omega)/(1-cos(omega)));
    Delta = M - O;
    covariance=cov(xi,Delta,1);
    rho = covariance(1,2) / sqrt(var(xi,1)*var(Delta,1));
end

Usually, N is more than 200'000 and n=200; x is the input time series, an array of N values, and omega is an element of an array of 512 values.

This function should be executed for each omega value and for different time series (at least 15), which brings the for loop in the function to be executed more than 7.5 million times (with my pc, it takes around 5 h to complete the job).

It seems to me that most of the time is spent for calling different portions of the vectors chi and zeta for different m values, but I don't know how to call them without the use of a foor loop. Is there a better/faster way to execute this calculation? Any suggestions would be helpful!

Here is my full code (function at the bottom):

close all; clear; clc
%% INPUT
% input parameters
n_series=15;                     % number of time series to be analyzed
x_0=0.5;                        % initial value of the sequence, x_O in (0,1)
f_s=51200;                      % frequency of acquisition
f_delta=50;                     % delta frequency of analysis
n=200;                          % reduction for M
seconds=4;
% parameters definition
time=(0:1/f_s:seconds)';  
N=size(time,1);     
f=(f_delta:f_delta:f_s/2)';     % frequencies to be analyzed
omega=2*pi*f/f_s;
omegaSize=size(omega,1);
xi=(1:floor(N/n))';
z(1:n_series)=struct('time',time,'x',[],'x_avg',[],'chi',[],'chi_avg',[],...
    'zeta',[],'zeta_avg',[],'radius',[],'M',[],'O',[],'Delta',[],'rho',[]);
% test sequence definition (dummy sequence, usually replaced by real time series)
mu=linspace(1.1,4,n_series);  % parameter of the sequence, mu in (1,4]
data=zeros(N,n_series);
for i=1:n_series
    data(1,i)=x_0;
    for j=2:N
        data(j,i)=mu(i)*data(j-1,i)*(1-data(j-1,i));
    end
end
%% MAIN RUN
tic
for i=1:n_series
    i_time=tic;
    
    % initialization
    x=data(:,i);
    x_avg=mean(x);
    chi=zeros(N,omegaSize);             zeta=zeros(N,omegaSize); 
    M=zeros(floor(N/n),omegaSize);      O=zeros(floor(N/n),omegaSize);
    Delta=zeros(floor(N/n),omegaSize);  rho=zeros(1,omegaSize);
    
    % calculation
    for w=1:omegaSize
        w_time=tic;
        [chi(:,w),zeta(:,w),M(:,w),O(:,w),Delta(:,w),rho(1,w)]=...
            calc_fun(N,n,x,x_avg,omega(w),xi);
        w_time_end=toc(w_time);
        fprintf('SERIES %u of %u - END of frequency %u Hz  (%u of %u) - time: %.3f\n',...
            i,n_series,f(w),w,omegaSize,w_time_end);
    end
    chi_avg=mean(chi);
    zeta_avg=mean(zeta);
    radius_term=(chi-chi_avg).^2 + (zeta-zeta_avg).^2;
    radius=sqrt(1/(N-1) * sum(radius_term));
    
    % data saving
    z(i).x=x;   z(i).x_avg=x_avg;   
    z(i).chi=chi;   z(i).chi_avg=chi_avg;
    z(i).zeta=zeta; z(i).zeta_avg=zeta_avg; z(i).radius=radius;   
    z(i).M=M;   z(i).O=O;   z(i).Delta=Delta;   z(i).rho=rho;
    i_time_end=toc(i_time);
    fprintf('END of SERIES %u of %u - time: %.3f\n',i,n_series,i_time_end);
end
clear x x_avg chi chi_avg zeta zeta_avg radius_term radius M_term M O Delta rho
toc
%% FUNCTION
function [chi,zeta,M,O,Delta,rho] = calc_fun(N,n,x,x_avg,omega,xi)
    chi=cumsum(x.*cos((1:N)'*omega));
    zeta=cumsum(x.*sin((1:N)'*omega));
    M=zeros(floor(N/n),1);
    for m=1:floor(N/n)
        M_term = (chi(m+1:N)-chi(1:N-m)).^2 + (zeta(m+1:N)-zeta(1:N-m)).^2;
        M(m) = sum(M_term)/(N-m);
    end
    
    O = (x_avg)^2 * (1-cos(xi.*omega)/(1-cos(omega)));
    Delta = M - O;
    covariance=cov(xi,Delta,1);
    rho = covariance(1,2) / sqrt(var(xi,1)*var(Delta,1));
end

Thanks in advance for any suggestions!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Jan 2022-5-16

编辑：Jan 2022-5-16

在 MATLAB Online 中打开

3 个投票

You do not want less for loops, but more: Accesing the RAM is slower than using the CPU cache:

function [chi,zeta,M,O,Delta,rho] = calc_fun2(N,n,x,x_avg,omega,xi)
chi  = cumsum(x.*cos((1:N)'*omega));
zeta = cumsum(x.*sin((1:N)'*omega));
q    = floor(N / n);
M    = zeros(q, 1);
for m = 1:q
   M2 = 0;
   for k = m+1:N
      M2 = M2 + (chi(k)  - chi(k - m)) .^ 2 + ...
                (zeta(k) - zeta(k - m)) .^ 2;
   end
   M(m) = M2 / (N-m); 
end
O     = x_avg^2 * (1 - cos(xi.*omega) / (1-cos(omega)));
Delta = M - O;
C     = cov(xi, Delta, 1);
rho   = C(1,2) / sqrt(var(xi,1) * var(Delta,1));
end

When I run with n_series=1 and for w=1:5 (instead of 1:omegaSize) the original code needs 16.7 s and the loop version only 6.0 s in Matlab R2018b on my i5 mobile.

Using the parallel processing toolbox on the 2 cores the time is reduced to 4.5 s:

parfor w=1:5 % instead of 1:omegaSize

27% of the original processing time. With more cores the advantage shoudl be larger.

Alternatively run the parallel loop inside the function:

parfor m = 1:q

This has the same speed.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Jan 2022-5-16

编辑：Jan 2022-5-16

在 MATLAB Online 中打开

74.8s to 5.3s is nice! I glad to read this. While waiting for 5 hours is stress, 21 minutes sounds bearable.

This code creates 9 vectors:

M_term = (chi(m+1:N) - chi(1:N-m)).^2 + (zeta(m+1:N) - zeta(1:N-m)).^2;
% 1. a = chi(m+1:N)
% 2. b = chi(1:N-m)
% 3. c = a - b
% 4. d = c .^ 2
% 5. e = zeta(m+1:N)
% 6. f = zeta(1:N-m);
% 7. g = e - f
% 8. g .^ 2
% 9. d + g

For this the data are taken from the slow RAM at first. If they do not match into the CPU's 1st level cache, they are stored in the slower 2nd level cache or even worse in the RAM again.

The total number of allocated bytes is 1.8827e9 * 8 bytes = 15.06 GB per call of calc_fun()! Maybe Matlab's JIT applies some smart tricks to re-use the vectors. If not, the allocation requires, that the OS delivers the cleared sections of memory and therefore they must be overwritten by zeros before they are re-used. In consequence 30 GB of data are written. Of course this takes time.

At least it is known, that Matlab's JIT avoid to create the index vectors m+1:N and 1:N-m explicitly. You can check this by forcing Matlab to do this:

idx1 = m+1:N;
idx2 = 1:N-m;
M_term = (chi(idx1) - chi(idx2)).^2 + (zeta(idx1) - zeta(idx)).^2;

This is slower for 2 reasons: 1. the index vectors must be created. 2. Then Matlab looses the knowledge, that the indices are sorted and in consequence it must check for each index, if it is inside the existing range.

In the loop version no intermediate vectors are created:

M2 = 0;
for k = m+1:N
   M2 = M2 + (chi(k)  - chi(k - m)) .^ 2 + ...
             (zeta(k) - zeta(k - m)) .^ 2;
end
M(m) = M2 / (N-m);

Even here the input data are obtain from the slow RAM. For this the CPU imports blocks of 64 byte (the cache-line size) and stores in locally in a cache. So reading 8 doubles hast almost the same costs as reading 1 double. The scalars match into the 1st level cache, but it can be assumed, that they are store in a CPU register directly.

"I always tought that executing operations on arrays instead of elements was faster." - This is an interesting idea. It is true for some cases:

A = rand(1000, 1000);
B = rand(1000, 1000);
C = zeros(1000, 1000);
for i2 = 1:1000
    for i1 = 1:1000
        C(i1, i2) = A(i1, i2) + B(i1, i2);
    end
end
D = A + B;

Here accessing the arrays without a loop is much faster, because this is done inside a highly optimized BLAS library. But you do not need any intermediate arrays, so there is no additional memory overhead.

Since Matlab R6.5 (2002) loops have been subject to massive accelerations in Matlab. The old rumor "loops are slow" is really outdated, but you find many teachers still telling students, what they have learned in the last century.

"Can you suggest some documentation to deepen this topic?" - I do not know an article about this topic. The strategy is to examine, which tasks Matlab has to perform. Imagine the contents of the memory as a physical object, e.g. a strip of paper. Then perform the operations with these objects. While taking the vector v as input is cheap, v(a:b) is an expensive copy and creation of a new strip. Split your working area in 5 sections: scalars match into registers (it does not matter if there are 8 or 64 - in doubt, there are less than you need), then more memory is split in 1st and 2nd level cache, RAM and the harddrive. From level to level the access times grow by the factor 10 (or 100 - such details do not matter).

The cache-line matters, if you process data in the caches with multiple cores: a core grabs a block of 64 byte and blocks it for the other cores. So if core 1 writes to x(1) and core 2 to x(2), they block eachother and the processing time is slower than with a single thread. A rule of thumb: parallelize the outer loops to avoid this.

Jan 2022-5-21

Moving the parfor from the subfunction to for w=1:omegaSize reduces the processing time by further 20%.

Luca Romagnuolo 2022-5-22

Yes, I did so in my previous run, as you suggested. I didn't know about these 5 different ways of saving and accessing variables, really interesting! Thanks for all your support!

请先登录，再进行评论。

Perform specific operations on arrays without "for" loop

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

4 个评论
显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

类别

产品

版本

标签

Community Treasure Hunt

Perform specific operations on arrays without "for" loop

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

4 个评论 显示 2更早的评论 隐藏 2更早的评论

更多回答（0 个）

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论