Perform specific operations on arrays without "for" loop

Hi! I'm trying to perform some particular operations between arrays in a more efficient way, possibly without using the "for" loop. Here is the function I would like to optimize:
function [chi,zeta,M,O,Delta,rho] = calc_fun(N,n,x,x_avg,omega,xi)
chi=cumsum(x.*cos((1:N)'*omega));
zeta=cumsum(x.*sin((1:N)'*omega));
M=zeros(floor(N/n),1);
for m=1:floor(N/n)
M_term = (chi(m+1:N)-chi(1:N-m)).^2 + (zeta(m+1:N)-zeta(1:N-m)).^2;
M(m) = sum(M_term)/(N-m);
end
O = (x_avg)^2 * (1-cos(xi.*omega)/(1-cos(omega)));
Delta = M - O;
covariance=cov(xi,Delta,1);
rho = covariance(1,2) / sqrt(var(xi,1)*var(Delta,1));
end
Usually, N is more than 200'000 and n=200; x is the input time series, an array of N values, and omega is an element of an array of 512 values.
This function should be executed for each omega value and for different time series (at least 15), which brings the for loop in the function to be executed more than 7.5 million times (with my pc, it takes around 5 h to complete the job).
It seems to me that most of the time is spent for calling different portions of the vectors chi and zeta for different m values, but I don't know how to call them without the use of a foor loop. Is there a better/faster way to execute this calculation? Any suggestions would be helpful!
Here is my full code (function at the bottom):
close all; clear; clc
%% INPUT
% input parameters
n_series=15; % number of time series to be analyzed
x_0=0.5; % initial value of the sequence, x_O in (0,1)
f_s=51200; % frequency of acquisition
f_delta=50; % delta frequency of analysis
n=200; % reduction for M
seconds=4;
% parameters definition
time=(0:1/f_s:seconds)';
N=size(time,1);
f=(f_delta:f_delta:f_s/2)'; % frequencies to be analyzed
omega=2*pi*f/f_s;
omegaSize=size(omega,1);
xi=(1:floor(N/n))';
z(1:n_series)=struct('time',time,'x',[],'x_avg',[],'chi',[],'chi_avg',[],...
'zeta',[],'zeta_avg',[],'radius',[],'M',[],'O',[],'Delta',[],'rho',[]);
% test sequence definition (dummy sequence, usually replaced by real time series)
mu=linspace(1.1,4,n_series); % parameter of the sequence, mu in (1,4]
data=zeros(N,n_series);
for i=1:n_series
data(1,i)=x_0;
for j=2:N
data(j,i)=mu(i)*data(j-1,i)*(1-data(j-1,i));
end
end
%% MAIN RUN
tic
for i=1:n_series
i_time=tic;
% initialization
x=data(:,i);
x_avg=mean(x);
chi=zeros(N,omegaSize); zeta=zeros(N,omegaSize);
M=zeros(floor(N/n),omegaSize); O=zeros(floor(N/n),omegaSize);
Delta=zeros(floor(N/n),omegaSize); rho=zeros(1,omegaSize);
% calculation
for w=1:omegaSize
w_time=tic;
[chi(:,w),zeta(:,w),M(:,w),O(:,w),Delta(:,w),rho(1,w)]=...
calc_fun(N,n,x,x_avg,omega(w),xi);
w_time_end=toc(w_time);
fprintf('SERIES %u of %u - END of frequency %u Hz (%u of %u) - time: %.3f\n',...
i,n_series,f(w),w,omegaSize,w_time_end);
end
chi_avg=mean(chi);
zeta_avg=mean(zeta);
radius_term=(chi-chi_avg).^2 + (zeta-zeta_avg).^2;
radius=sqrt(1/(N-1) * sum(radius_term));
% data saving
z(i).x=x; z(i).x_avg=x_avg;
z(i).chi=chi; z(i).chi_avg=chi_avg;
z(i).zeta=zeta; z(i).zeta_avg=zeta_avg; z(i).radius=radius;
z(i).M=M; z(i).O=O; z(i).Delta=Delta; z(i).rho=rho;
i_time_end=toc(i_time);
fprintf('END of SERIES %u of %u - time: %.3f\n',i,n_series,i_time_end);
end
clear x x_avg chi chi_avg zeta zeta_avg radius_term radius M_term M O Delta rho
toc
%% FUNCTION
function [chi,zeta,M,O,Delta,rho] = calc_fun(N,n,x,x_avg,omega,xi)
chi=cumsum(x.*cos((1:N)'*omega));
zeta=cumsum(x.*sin((1:N)'*omega));
M=zeros(floor(N/n),1);
for m=1:floor(N/n)
M_term = (chi(m+1:N)-chi(1:N-m)).^2 + (zeta(m+1:N)-zeta(1:N-m)).^2;
M(m) = sum(M_term)/(N-m);
end
O = (x_avg)^2 * (1-cos(xi.*omega)/(1-cos(omega)));
Delta = M - O;
covariance=cov(xi,Delta,1);
rho = covariance(1,2) / sqrt(var(xi,1)*var(Delta,1));
end
Thanks in advance for any suggestions!

 采纳的回答

You do not want less for loops, but more: Accesing the RAM is slower than using the CPU cache:
function [chi,zeta,M,O,Delta,rho] = calc_fun2(N,n,x,x_avg,omega,xi)
chi = cumsum(x.*cos((1:N)'*omega));
zeta = cumsum(x.*sin((1:N)'*omega));
q = floor(N / n);
M = zeros(q, 1);
for m = 1:q
M2 = 0;
for k = m+1:N
M2 = M2 + (chi(k) - chi(k - m)) .^ 2 + ...
(zeta(k) - zeta(k - m)) .^ 2;
end
M(m) = M2 / (N-m);
end
O = x_avg^2 * (1 - cos(xi.*omega) / (1-cos(omega)));
Delta = M - O;
C = cov(xi, Delta, 1);
rho = C(1,2) / sqrt(var(xi,1) * var(Delta,1));
end
When I run with n_series=1 and for w=1:5 (instead of 1:omegaSize) the original code needs 16.7 s and the loop version only 6.0 s in Matlab R2018b on my i5 mobile.
Using the parallel processing toolbox on the 2 cores the time is reduced to 4.5 s:
parfor w=1:5 % instead of 1:omegaSize
27% of the original processing time. With more cores the advantage shoudl be larger.
Alternatively run the parallel loop inside the function:
parfor m = 1:q
This has the same speed.

4 个评论

it works! thanks!!
with my i7, by using n_series=1 and w=1:32, my original code takes 74.8 s to complete, while with your edits it takes only 22.4 s! And with parfor (6 cores) for the w loop it goes down to 5.3 s!
I always tought that executing operations on arrays instead of elements was faster... Can you please be more specific in the distinction between accessing RAM and CPU cache? I mean, in the for loop with k=m+1:N the code still needs to access (almost) all the elements of the arrays chi and zeta, as the previous code did. Where is the difference? Can you suggest some documentation to deepen this topic? That would be great!
Thanks again!
74.8s to 5.3s is nice! I glad to read this. While waiting for 5 hours is stress, 21 minutes sounds bearable.
This code creates 9 vectors:
M_term = (chi(m+1:N) - chi(1:N-m)).^2 + (zeta(m+1:N) - zeta(1:N-m)).^2;
% 1. a = chi(m+1:N)
% 2. b = chi(1:N-m)
% 3. c = a - b
% 4. d = c .^ 2
% 5. e = zeta(m+1:N)
% 6. f = zeta(1:N-m);
% 7. g = e - f
% 8. g .^ 2
% 9. d + g
For this the data are taken from the slow RAM at first. If they do not match into the CPU's 1st level cache, they are stored in the slower 2nd level cache or even worse in the RAM again.
The total number of allocated bytes is 1.8827e9 * 8 bytes = 15.06 GB per call of calc_fun()! Maybe Matlab's JIT applies some smart tricks to re-use the vectors. If not, the allocation requires, that the OS delivers the cleared sections of memory and therefore they must be overwritten by zeros before they are re-used. In consequence 30 GB of data are written. Of course this takes time.
At least it is known, that Matlab's JIT avoid to create the index vectors m+1:N and 1:N-m explicitly. You can check this by forcing Matlab to do this:
idx1 = m+1:N;
idx2 = 1:N-m;
M_term = (chi(idx1) - chi(idx2)).^2 + (zeta(idx1) - zeta(idx)).^2;
This is slower for 2 reasons: 1. the index vectors must be created. 2. Then Matlab looses the knowledge, that the indices are sorted and in consequence it must check for each index, if it is inside the existing range.
In the loop version no intermediate vectors are created:
M2 = 0;
for k = m+1:N
M2 = M2 + (chi(k) - chi(k - m)) .^ 2 + ...
(zeta(k) - zeta(k - m)) .^ 2;
end
M(m) = M2 / (N-m);
Even here the input data are obtain from the slow RAM. For this the CPU imports blocks of 64 byte (the cache-line size) and stores in locally in a cache. So reading 8 doubles hast almost the same costs as reading 1 double. The scalars match into the 1st level cache, but it can be assumed, that they are store in a CPU register directly.
"I always tought that executing operations on arrays instead of elements was faster." - This is an interesting idea. It is true for some cases:
A = rand(1000, 1000);
B = rand(1000, 1000);
C = zeros(1000, 1000);
for i2 = 1:1000
for i1 = 1:1000
C(i1, i2) = A(i1, i2) + B(i1, i2);
end
end
D = A + B;
Here accessing the arrays without a loop is much faster, because this is done inside a highly optimized BLAS library. But you do not need any intermediate arrays, so there is no additional memory overhead.
Since Matlab R6.5 (2002) loops have been subject to massive accelerations in Matlab. The old rumor "loops are slow" is really outdated, but you find many teachers still telling students, what they have learned in the last century.
"Can you suggest some documentation to deepen this topic?" - I do not know an article about this topic. The strategy is to examine, which tasks Matlab has to perform. Imagine the contents of the memory as a physical object, e.g. a strip of paper. Then perform the operations with these objects. While taking the vector v as input is cheap, v(a:b) is an expensive copy and creation of a new strip. Split your working area in 5 sections: scalars match into registers (it does not matter if there are 8 or 64 - in doubt, there are less than you need), then more memory is split in 1st and 2nd level cache, RAM and the harddrive. From level to level the access times grow by the factor 10 (or 100 - such details do not matter).
The cache-line matters, if you process data in the caches with multiple cores: a core grabs a block of 64 byte and blocks it for the other cores. So if core 1 writes to x(1) and core 2 to x(2), they block eachother and the processing time is slower than with a single thread. A rule of thumb: parallelize the outer loops to avoid this.
Moving the parfor from the subfunction to for w=1:omegaSize reduces the processing time by further 20%.
Yes, I did so in my previous run, as you suggested. I didn't know about these 5 different ways of saving and accessing variables, really interesting! Thanks for all your support!

请先登录,再进行评论。

更多回答(0 个)

类别

帮助中心File Exchange 中查找有关 Loops and Conditional Statements 的更多信息

产品

版本

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by