Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

Question

Michael 2023-11-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2050857-why-is-this-loop-faster-than-a-vectorised-version-could-the-vectorised-version-be-made-faster-than

评论： Alexander 2023-11-22

I'm trying to improve performance in a code that uses a loop. I've written a vectorised version matching the functionality, while avoiding costly transposes. However, I've found that the loop version invariably runs ~25% more quickly. Is there any way to further improve the performance of the vectorised version so that it surpasses the loop?

Of course, this is a tiny sub-function of a much larger, more complex program, but it is called tens of thousands of times in a single run, and is a bottleneck in the run time.

I do have the parallel computing toolbox, so could look into using parfor loops, but these don't always save time, and I was surprised that the vectorised version doesn't perform better!

% Input vectors
x1 = rand(1, 960);
x2 = rand(1, 960);
%% Looped version
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutLoop = zeros(length(x2), length(Y1)); 
for i=1:length(Y1)
    p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
    p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
    p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
    g1 = (x1 - Y2(i))./Y2(i); 
    g1 = abs(g1);
    EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
end
if (size(EoutLoop,1) > 1) 
    EoutLoop = sum(EoutLoop);
end
EoutLoop = 10 .* log(EoutLoop) ./ log(10);
% end timer
toc;
%% Vectorised version
% transpose input vector for vectorised version
x2 = x2.';
x1 = x1.';
% start timer
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutVec = zeros(length(x2), length(Y1)); 
p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
g1 = abs(g1);
EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
if (size(EoutVec,1) > 1) 
    EoutVec = sum(EoutVec);
end
EoutVec = 10.*log(EoutVec)./log(10);
% end timer
toc;

3 个评论
显示 1更早的评论隐藏 1更早的评论

Michael 2023-11-22

Ok, that's helpful, thanks.

Alexander 2023-11-22

I agree. On my old Win7 machine (R2021b) the result is

Loop: Elapsed time is 0.316945 seconds.

Vectorised: Elapsed time is 0.062135 seconds.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Dyuman Joshi 2023-11-22

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2050857-why-is-this-loop-faster-than-a-vectorised-version-could-the-vectorised-version-be-made-faster-than#answer_1358077

编辑：Dyuman Joshi 2023-11-22

在 MATLAB Online 中打开

Ideally, timeit should be used over tic-toc to get a more accurate idea of run times of the codes. tic-toc is generally used for portions of code.

"Use the timeit function for a rigorous measurement of function execution time. Use tic and toc to estimate time for smaller portions of code that are not complete functions." Reference - Measure the Performance of Your Code

While using tic-toc to measure the time of the code, you can either

> Run the same code multiple times via a for loop and average the data - "Sometimes programs run too fast for tic and toc to provide useful data. If your code is faster than 1/10 second, consider measuring it running in a loop, and then average to find the time for a single run." (Reference - https://in.mathworks.com/help/matlab/ref/tic.html#bswc7ww-3)

or

> Take a large(r) dataset.

I have chosen the latter option below -

% Input vectors
%% Large(r) dataset
x1 = rand(1, 100000);
x2 = rand(1, 100000);
%% Looped version
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutLoop = zeros(length(x2), length(Y1)); 
tic;
for i=1:length(Y1)
    p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
    p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
    p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
    g1 = (x1 - Y2(i))./Y2(i); 
    g1 = abs(g1);
    EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
end
if (size(EoutLoop,1) > 1) 
    EoutLoop = sum(EoutLoop);
end
EoutLoop = 10 .* log(EoutLoop) ./ log(10);
% end timer
toc;
Elapsed time is 1.298153 seconds.

%% Vectorised version
% transpose input vector for vectorised version
x2 = x2.';
x1 = x1.';
% start timer
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutVec = zeros(length(x2), length(Y1)); 
tic;
p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
g1 = abs(g1);
EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
if (size(EoutVec,1) > 1) 
    EoutVec = sum(EoutVec);
end
EoutVec = 10.*log(EoutVec)./log(10);
% end timer
toc;
Elapsed time is 0.621410 seconds.

You can see that the time taken by the vectorized approach is less than half of the time taken by the for loop approach.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Dyuman Joshi 2023-11-22

编辑：Dyuman Joshi 2023-11-22

在 MATLAB Online 中打开

Using timeit -

FYI - timeit() returns the median value of time measurements, where it calls the specified functions many times.

% Input vectors
x1 = rand(1, 10000);
x2 = rand(1, 10000);
F1 = @(a, b) forLoop(a, b);
F2 = @(a, b) vectorized(a, b);
f1 = @() F1(x1, x2);
f2 = @() F2(x1, x2);
%Check whether the outputs are equal or not
isequal(f1(), f2())
ans = logical
   1
fprintf('Time taken by the for loop method is %f seconds', timeit(f1))
Time taken by the for loop method is 0.064619 seconds
fprintf('Time taken by the vectorized method is %f seconds', timeit(f2))
Time taken by the vectorized method is 0.021920 seconds

As you can see from the results here, the vectorized method is more than 3x faster the for loop method.

%% Function definitions
function EoutLoop = forLoop(x1, x2)
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutLoop = zeros(length(x2), length(Y1)); 
    for i=1:length(Y1)
        p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
        p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
        p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
        g1 = (x1 - Y2(i))./Y2(i); 
        g1 = abs(g1);
        EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
        EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
    end
    if (size(EoutLoop,1) > 1) 
        EoutLoop = sum(EoutLoop);
    end
    EoutLoop = 10 .* log(EoutLoop) ./ log(10);
end
%Note that you don't need to preallocate for a vectorized approach
function EoutVec = vectorized(x1, x2)
    % transpose input vector for vectorised version
    x2 = x2.';
    x1 = x1.';
    
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
    p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
    p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
    g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
    g1 = abs(g1);
    EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
    EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
    if (size(EoutVec,1) > 1) 
        EoutVec = sum(EoutVec);
    end
    EoutVec = 10.*log(EoutVec)./log(10);
end

Michael 2023-11-22

在 MATLAB Online 中打开

Thanks, I re-wrote the test code as you suggested, and for 10000 runs the vector version took about 30% of the time as the loop (the input vectors are fixed length for the application).

% Input vectors
x1 = rand(1, 960);
x2 = rand(1, 960);
x1T = x1.';
x2T = x2.';
totalRuns = 10000;
%% Looped version
loopTime = 0;
for runs = 1:totalRuns
    % start timer
    tic
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutLoop = zeros(length(x2), length(Y1)); 
    
    for i=1:length(Y1)
    
        p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
        p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
        p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
        g1 = (x1 - Y2(i))./Y2(i); 
    
        g1 = abs(g1);
        EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    
        EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
    end
    
    if (size(EoutLoop,1) > 1) 
        EoutLoop = sum(EoutLoop);
    end
    
    EoutLoop = 10 .* log(EoutLoop) ./ log(10);
    
    % append time
    loopTime = loopTime + toc;
end
loopTime = loopTime/runs;
disp(num2str(loopTime))
%% Vectorised version
vecTime = 0;
for runs = 1:totalRuns
    % start timer
    tic
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutVec = zeros(length(x2T), length(Y1)); 
    
    p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2T), length(Y2));
    p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2T - 51), 1, length(Y2));
    p3  = p1.*(x1T >= repmat(Y2, 1, size(x1T, 2))) + p2 .* (x1T < repmat(Y2, 1, size(x1T, 2))); 
    g1 = ((x1T - repmat(Y2, 1, size(x1T, 2)))./repmat(Y2, 1, size(x1T, 2))); 
    
    g1 = abs(g1);
    EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2T./10), 1, length(Y2));
    
    EoutVec((1:length(x2T)), :) = EVec(1:length(x2T), :);
    
    
    if (size(EoutVec,1) > 1) 
        EoutVec = sum(EoutVec);
    end
    
    EoutVec = 10.*log(EoutVec)./log(10);
    vecTime = vecTime + toc;
end
vecTime = vecTime/runs;
disp(num2str(vecTime))

Thanks for the info. I did use Profiler on the full programme, which is how I identified the bottleneck for further testing and optimisation!

Dyuman Joshi 2023-11-22

You are welcome!

It's good to know that you are utilizing the Profiler, it is an extremely helpful tool!

请先登录，再进行评论。

Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

3 个评论
显示 1更早的评论隐藏 1更早的评论

采纳的回答

4 个评论
显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

3 个评论 显示 1更早的评论隐藏 1更早的评论

采纳的回答

4 个评论 显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

3 个评论
显示 1更早的评论隐藏 1更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论