Vectorized code slower than loops

Question

Alessandro 2024-12-4，23:56

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops

评论： Joss Knight 2024-12-7，16:01

I am trying to speed up the code reported below (it is a MWE, actual code is more complicated). I vectorized the inner loop (see method 2) and, while the vectorized code is correct, it is slower than the original code with a nested loop.

I would be grateful if someone has any idea on how to improve the performance (given that my attempt at vectorization made performance worse).

clear
clc
close all
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z   = rand(n_z,n_z);
pi_z   = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
disp('Method 1')
Method 1
tic
% Initialize output
V1      = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(EVz,n_d);
    for a_c=1:n_a
        entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
        [max_val,max_ind] = max(entireRHS); %scalar
        %[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
        V1(a_c,z_c) = max_val;
        Policy1(a_c,z_c)=max_ind;
    end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.321922 seconds.
disp('Method 2 (vectorized)')
Method 2 (vectorized)
tic
V2      = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(EVz,n_d);
    entireRHS = RetMat(:,:,z_c)+beta*EVz_big; %(d*a',a)
    [max_val,max_ind] = max(entireRHS,[],1); %(1,a)
    V2(:,z_c) = max_val;
    Policy2(:,z_c)=max_ind;
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.488226 seconds.
err = max(abs(V2-V1),[],"all");
disp(err)
     0
err = max(abs(Policy2_all-Policy1_all),[],"all");
disp(err)
     0

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Epsilon 2024-12-5，15:16

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1554024

在 MATLAB Online 中打开

Hi Alessandro,

Method1 is handling a single column vector at a time and is faster than method2 which computes the entireRHS at once by adding the matrix RetMat(:,:,z_c) of size 3065 x 601 to the expanded vector EVz_big. Vectorization is generally faster due to lesser loop overheads, but in this case the larger size of data being processed can slow down the computation.

To further optimize try the below method:

clear
clc
close all
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z   = rand(n_z,n_z);
pi_z   = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
disp('Method 1')
Method 1
tic
% Initialize output
V1      = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(EVz,n_d);
    for a_c=1:n_a
        entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
        [max_val,max_ind] = max(entireRHS); %scalar
        %[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
        V1(a_c,z_c) = max_val;
        Policy1(a_c,z_c)=max_ind;
    end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.392334 seconds.
disp('Method 2 (suggested)')
Method 2 (suggested)
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
    EVz_big = repelem(EV(:, z_c), n_d);
    
    for a_c = 1:n_a
        entireRHS = RetMat_reshaped(:, a_c, z_c) + beta * EVz_big;
        [max_val, max_ind] = max(entireRHS);
        V2(a_c, z_c) = max_val;
        Policy2(a_c, z_c) = max_ind;
    end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
toc
Elapsed time is 0.310808 seconds.
err = max(abs(V2-V1),[],"all");
disp(err)
     0
err = max(abs(Policy2_all-Policy1_all),[],"all");
disp(err)
     0

Hope it helps!

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Alessandro 2024-12-5，18:18

Thanks for your help, it is a bit faster.

One small comment: I think you don't need the line

RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);

in the sense that it is done already at the very beginning (and I exclude it from the timing since it is part of data generating). In your code you are doing it twice

请先登录，再进行评论。

Answer 2

Steven Lord 2024-12-5，16:52

3
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1554084

Vectorization is one tool in your MATLAB toolbox, but so is the for loop. Each has its own pros and cons.

I think Mike Croucher is planning on writing a blog post about this, but at the risk of stealing his thunder consider the analogy of a bookshelf. If you want to move all the books from one bookshelf to another (or to packing boxes if you're moving), there are multiple approaches you could take.

If there aren't very many books and they're all light (say paperbacks) you could use a vectorized approach, carrying the whole pile at once from the first bookshelf to the next. This could save time by eliminating some or all of the round-trips between the two bookshelves.

If there are a lot of books, or some/all of them are heavy (like my bookshelf of roleplaying books, probably about 250 hard-cover books) the vectorized approach may not be possible or may be less feasible than a semi-vectorized or iterative approach. I'm no Hafþór Júlíus Björnsson, able to lift half a ton of weight. So instead of transporting all the books at once, I carried one at a time (for loop) or a small group at a time (semi-vectorized approach.)

In this case, your vectorized approach tells MATLAB "carry all the books on one shelf of the bookshelf at once" and MATLAB has to take its time each trip carrying hundreds of pounds of books. In the iterative approach it's carrying one book at a time and the round-trip travel time between bookshelves is small enough that it actually saves time over staggering under the weight of those hundreds of pounds of books.

I'm not saying abandon vectorization, of course. But just realize that neither vectorization nor for loops are going to be the best in every single scenario you encounter.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Alessandro 2024-12-5，18:20

Thanks for the very nice explanation. Looking forward to Mike Croucher blog post :)

请先登录，再进行评论。

Answer 3

John D'Errico 2024-12-5，17:29

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1554099

编辑：John D'Errico 2024-12-5，17:32

在 MATLAB Online 中打开

Read the response by @Steven Lord again, in case you missed it. (Actually, that is always good advice for any of his responses.)

Here is my take. Vectorization is a good thing much of the time. But the way I think of most vectorization code is it employs internal implicit loops. As such, they are going to be quite fast, more efficient than explicit loops written in MATLAB. So you CAN gain there. (Over the years, the MATLAB parser has grown greatly in capability, making some of those explicitly writtten loops quite efficient. Implicit loops will pretty much always win though.) But remember the cost of many vectorizations is in memory. They may force your code to generate a huge number of potenial solutions, but then throw away almost all of them.

That trade off between implicit, fast loops and memory will be important, and you need to understand it.

For example, there are many ways we can see vectorization employed. A simple matrix multiply is one. Could I write this:

A = rand(1000);
B = rand(1000);
C = A*B;

Surely you would see that as a vectorized form of the alternative of triply nested loops? In this case, there is no additional work being done. The same adds and multiplies are being done in either case. It is all in the form of implicit loops, and all done by the BLAS. A great form of vectorization.

Another common instance where we might see vectorization is to ask for the set of 20 bit binary numbers with exactly 5 bits represented as 1. The simple vectorized solution might be to generate all 20 bit numbers, something like dec2bin(0:2^20-1), and then discard those with an unacceptable number of bits. And that would work nicely enough, unless you decided to try the same thing with 100 or 1000 bit numbers.

That is, vectorization can fail miserably when it is abused. You as the user need to understand the difference, and how the vectorization was written, what it is doing internally, and when it will fail.

Finally, I've seen examples of solutions to problems that were described as vectorization, but are really not in either of the above categories. Instead, these are cases where an intelligent application of mathematics solves a problem that would have been achieved by brute force approaches otherwise. Feel free to call that not vectorization.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 4

Mike Croucher 2024-12-5，22:50

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1554224

编辑：Mike Croucher 2024-12-5，22:58

在 MATLAB Online 中打开

I made a couple more modifications and came up with a double for-loop that is fastest. One thing that I'd like to point out here is that all of your versions are 'vectorised' in some sense. For example, in your original you have

entireRHS = RetMat(:,a_c,z_c)+ beta * EVz_big;

EVz_big is a vector so that line has got vectorised multiplication and addition. Every one of your examples has got a vectorised component and a loop component. What we've all been doing in this thread as we try to improve speed is change how we chop up the vectors. With your originial 'Method 2 (vectorised)' you had this line

entireRHS = RetMat(:,:,z_c)+beta*EVz_big;

and, as suggested by @Epsilon, that line is creating some large temporary matrices. It takes time to do this and you are not doing enough compute on them to justify the bother.

n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z   = rand(n_z,n_z);
pi_z   = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
tic
% Initialize output
V1      = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(EVz,n_d);
    for a_c=1:n_a
        entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
        [max_val,max_ind] = max(entireRHS); %scalar
        %[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
        V1(a_c,z_c) = max_val;
        Policy1(a_c,z_c)=max_ind;
    end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
method1 = toc;
tic
V2      = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(EVz,n_d);
    entireRHS = RetMat(:,:,z_c)+beta*EVz_big; %(d*a',a)
    [max_val,max_ind] = max(entireRHS,[],1); %(1,a)
    V2(:,z_c) = max_val;
    Policy2(:,z_c)=max_ind;
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
method2 = toc;
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
%RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
    EVz_big = repelem(EV(:, z_c), n_d);
    
    for a_c = 1:n_a
        entireRHS = RetMat(:, a_c, z_c) + beta * EVz_big;
        [max_val, max_ind] = max(entireRHS);
        V2(a_c, z_c) = max_val;
        Policy2(a_c, z_c) = max_ind;
    end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
method3 = toc;
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
%RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
    EVz_big = repelem(beta*EV(:, z_c), n_d);
    
    for a_c = 1:n_a
        entireRHS = RetMat(:, a_c, z_c) + EVz_big;
        [max_val, max_ind] = max(entireRHS);
        V2(a_c, z_c) = max_val;
        Policy2(a_c, z_c) = max_ind;
    end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
method4 = toc;
tic
% Initialize output
V2      = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(beta*EVz,n_d);
    for a_c=1:n_a
        [max_val,max_ind] = max(RetMat(:,a_c,z_c) + EVz_big); %(d*a',1)r
        %[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
        V2(a_c,z_c) = max_val;
        Policy2(a_c,z_c)=max_ind;
    end
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
method5 = toc;
fprintf("Method 1 : %f seconds\n",method1);fprintf("Method 2 (vectorized) : %f seconds\n",method2);fprintf("Method 2 (suggested) with Alessandro fix: %f seconds\n",method3);fprintf("Method 2 (suggested) with Croucher mod: %f seconds\n",method4);fprintf("Croucher loop : %f seconds\n",method5)
Method 1 : 0.344052 seconds
Method 2 (vectorized) : 0.568002 seconds
Method 2 (suggested) with Alessandro fix: 0.361932 seconds
Method 2 (suggested) with Croucher mod: 0.333267 seconds
Croucher loop : 0.303311 seconds

So, here my final version wins but it turns out that this seems to be machine dependent. I ran this code on my M2 Mac and got the following

Method 1 : 0.170735 seconds

Method 2 (vectorized) : 0.164271 seconds

Method 2 (suggested) with Alessandro fix: 0.152090 seconds

Method 2 (suggested) with Croucher mod: 0.140408 seconds

Croucher loop : 0.142356 seconds

The final vectorised version is very slightly, but consistently faster than 'Croucher loop'. The Mac is also faster than any other of my personal machines for this problem.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 5

Mike Croucher 2024-12-6，0:13

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1554244

在 MATLAB Online 中打开

Another option, if you have the hardware, is to use a GPU. Here, you want to do as much vectorisation as you can. Using my desktop now, which has a slower CPU than whatever is used when we run code in the forums but it has quite a nice GPU (NVIDIA RTX 3070).

Here's the code

n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z   = rand(n_z,n_z);
pi_z   = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
%% CPU version
tic
% Initialize output
V0      = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
    EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
    %EVz_big = kron(EVz,ones(n_d,1));
    EVz_big = repelem(beta*EVz,n_d);
    for a_c=1:n_a
        [max_val,max_ind] = max(RetMat(:,a_c,z_c) + EVz_big); %(d*a',1)r
        %[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
        V0(a_c,z_c) = max_val;
        Policy1(a_c,z_c)=max_ind;
    end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
cpuSpeed = toc;
%% GPU version including transfers
dev = gpuDevice();
tic
V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);
% Initialize output
GPU_V1  = zeros(n_a,n_z,"gpuArray");
Policy1 = ones(n_a,n_z,"gpuArray");
for z_c=1:n_z
    EVz = V_nextGPU*pi_zGPU(z_c,:)';
    EVz_big = repelem(EVz,n_d);
    entireRHS = RetMatGPU(:,:,z_c) + beta*EVz_big; 
    [max_val,max_ind] = max(entireRHS,[],1); %(1,a)
    GPU_V1(:,z_c) = max_val;
    Policy1(:,z_c)=max_ind;
end
Policy1_all = zeros(2,n_a,n_z,"gpuArray");
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
wait(dev);   % Need to this to make sure the GPU has finished
gpuWithTransfer = toc;
%% GPU version excluding transfers
dev = gpuDevice();
V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);
tic
% Initialize output
GPU_V1  = zeros(n_a,n_z,"gpuArray");
Policy1 = ones(n_a,n_z,"gpuArray");
for z_c=1:n_z
    EVz = V_nextGPU*pi_zGPU(z_c,:)'; %(aprime,1)
    EVz_big = repelem(EVz,n_d);
    entireRHS = RetMatGPU(:,:,z_c) + beta*EVz_big; 
    [max_val,max_ind] = max(entireRHS,[],1); %(1,a)
    GPU_V1(:,z_c) = max_val;
    Policy1(:,z_c)=max_ind;
end
Policy1_all = zeros(2,n_a,n_z,"gpuArray");
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
wait(dev);   % Need to this to make sure the GPU has finished
gpuNoTransfer = toc;
fprintf("CPU version : %f seconds\n",cpuSpeed)
fprintf("GPU including transfer time to the GPU: %f seconds \n",gpuWithTransfer)
fprintf("GPU assuming data starts on GPU: %f seconds \n",gpuNoTransfer)
err = max(abs(GPU_V1-V0),[],"all");
fprintf("Maximum error is %e\n",err)

On my machine I get

>> speedThingGPU
CPU version : 0.395375 seconds
GPU including transfer time to the GPU: 0.349120 seconds 
GPU assuming data starts on GPU: 0.010866 seconds 
Maximum error is 2.220446e-16

On my machine, the CPU and the GPU versions are about the same speed if you have to include the time taken to transfer the data to the machine. I do this in the most boneheaded way possible

V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);

This is what takes most of the time for this calculation. However, this is being deliberately unfair to the GPU to demonstrate the point. You don't include the creation time of those matrices in any of your other versions and we could have easily created everything directly on the GPU if we wanted. For example,

RetMatGPU = rand(n_d,n_aprime,n_a,n_z,"gpuArray");

Random number generation is much faster on the GPU than the CPU. If we did include creation time in BOTH CPU and GPU versions, the GPU would win overall.

So, if you can construct your starting matrices on the GPU you will probably win. If you have to transfer large matrices to the GPU, things are not looking so good

2 个评论
显示无隐藏无

Alessandro 2024-12-6，17:32

编辑：Alessandro 2024-12-6，17:32

Thanks a lot for your detailed answer! The method on the GPU works very well also on my machine. An interesting point that I have learnt from yours and others' answers is that on the cpu vectorizing (i.e. eliminating loops with array-based intructions) does not always improve speed, while on the gpu the more you vectorize the better it is (subject to total memory limitation of course). Is it because Matlab has become good with loops thanks to JIT (and the JIT happens only on cpu)?

Joss Knight 2024-12-7，16:01

The JIT is almost certainly part of the answer, as well as some of the other things mentioned in the thread.The GPU also does some JITting but it doesn't (yet) know how to vectorize loops. Remember that all the GPUs performance comes from data parallelism so all a JIT would be doing would be working out how to write your loops as a single vectorized expression. So it's unlikely it would be able to do better than you can manually.

请先登录，再进行评论。

Answer 6

Umar 2024-12-5，4:01

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2170939-vectorized-code-slower-than-loops#answer_1553759

Hi @Alessandro,

Please see “Recommendations for Performance Improvement” below

1. Avoid Large Temporary Arrays: Instead of using `repelem`, consider avoiding temporary arrays altogether. You can directly compute values without expanding them into larger matrices.

2. Use Logical Indexing: If applicable, replace some array operations with logical indexing or direct computations that avoid unnecessary replication of data.

3. Preallocate Arrays Wisely: Ensure that all arrays are preallocated appropriately without excessive resizing during iterations.

4. Parallel Processing: If your MATLAB version supports it, consider using parallel processing capabilities (like `parfor`) to distribute computations across multiple cores.

5. Profile the Code: Use MATLAB's built-in profiler (`profile on`, `profile viewer`) to identify bottlenecks in your code execution and focus optimization efforts on those areas.

6. Consider Built-in Functions: Some operations might have optimized built-in functions that can perform better than custom implementations.

Here's a modified version of Method 2 that addresses some of these issues:

disp('Optimized Method 2')
tic
V2 = zeros(n_a, n_z);
Policy2 = zeros(n_a, n_z); % Changed to zeros instead of ones
for z_c = 1:n_z
  EVz = V_next * pi_z(z_c,:)'; 
  % Directly calculate entireRHS without expanding
  for a_c = 1:n_a
      entireRHS = RetMat(:, a_c, z_c) + beta * EVz; 
      [max_val, max_ind] = max(entireRHS); 
      V2(a_c, z_c) = max_val;
      Policy2(a_c, z_c) = max_ind;
  end
end
% Reshape policy outputs as before
Policy2_all = zeros(2, n_a, n_z);
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
toc

After implementing changes, it's crucial to profile the new version against both original methods to quantify performance improvements. Investigate other forms of vectorization such as `bsxfun` or implicit expansion features in newer MATLAB versions that might offer better performance with less memory overhead.

By following these strategies and continually iterating on your approach based on profiling results, you should be able to enhance the performance of your MATLAB code effectively.

2 个评论
显示无隐藏无

Alessandro 2024-12-5，13:18

编辑：Alessandro 2024-12-5，13:24

Thanks for your answer, but your code generates an error:

```

Optimized Method 2

Arrays have incompatible sizes for this operation.

Error in idea (line 75)

entireRHS = RetMat(:, a_c, z_c) + beta * EVz;

^^^^^^^^^

Vectorized code slower than loops

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（5 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无

2 个评论
显示无隐藏无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Vectorized code slower than loops

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（5 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

2 个评论 显示 无隐藏 无

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无

2 个评论
显示无隐藏无