Loops in parfor are overly slow in a very simple code

Question

Arabarra 2020-2-18

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/506209-loops-in-parfor-are-overly-slow-in-a-very-simple-code

评论： Arabarra 2020-2-19

采纳的回答： Jacob Wood

在 MATLAB Online 中打开

Hi,

I'm strugging to understand why a code that I wrote scales so poorly when ran in parallel using parpool.

The operation that I want to run in parallel is in the function unitLinear

function unitLinear(L,N)
a = rand(L,L,L);
b = rand(L,L,L);
for i=1:N
    for j = 1:N
        c  = a.*b;
    end
end

which does nothing useful, it is just a decoy to measure execution performance. ( If you are curious, it models the first step of Principal Component Analisis of a set of N volumes, each with LxLxL pixels, by computing all pairwise correlations)

My approach to run this function in parallel is with the following script testUnitLinear:

nTasks = 40; % number of taks to be executed in parallel
L = 128;        % cube sidelength  in pixels (inside testing units)
N = 100;       % length of interior loop inside testing unit
fprintf('Results for L:%d  N:%d \n',L,N);
% test apart
disp('Computing testing unit in single core');
tInitialUnit= clock();
unitLinear(L,N);
timeUnitSingle= etime(clock(),tInitialUnit);
fprintf('Testing unit in single core: %5.2f \n',timeUnitSingle);
tUnitArray = zeros(nTasks,1); % to store the time seen inside the loop
%tUnitArray = distributed(tUnitArray);
t1 = clock();
parfor i=1:nTasks
    
     tInitialUnit= clock();
     unitLinear(L,N);
     timeUnit= etime(clock(),tInitialUnit);
     
     tUnitArray(i) = timeUnit;
     fprintf('Testing unit time %5.2f \n',timeUnit);
end
tTotal= -etime(t1,clock);
fprintf('Total time: %5.2f \n',tTotal);
fprintf('Sum process time: %5.2f \n',sum(tUnitArray));
fprintf('Average process time: %5.2f \n',sum(tUnitArray)/nTasks);
fprintf('Unit in single core: %5.2f \n',timeUnitSingle);

When ran outside the parfor, the first execution of unitLinear took about 10 seconds in my system... and inside the parfoor loop (using a parpool openend with the local profile), each execution was indeed reporting about 10 seconds. So far so good. As I was working with a open pool of 16 workers (as seen here)

>> a = gcp
a = 
 Pool with properties: 
            Connected: true
           NumWorkers: 16
              Cluster: local
        AttachedFiles: {}
    AutoAddClientPath: true
          IdleTimeout: 30 minutes
          SpmdEnabled: false

.... I was expecting that the execution of the parfor loop would amount to (10 seconds per task X 40 tasks) / 16 workers = approx 25 seconds. However, the time measured by the user was around 400 seconds! As if no parallelization would take place at all! Even if htop was reporting all requested cores working (I have no other processes running in this machine)... and the fans were in fact busy as hell.

This is something that I didn't expect. I kinow that a speedup of celan 16x would be asking too much, but a speed up of 1x on 16 local cores is too unexpected, as the parfor loop couldn't bee more simple. No files, no shared variables... nothing I can think of... or am I missing something too evident? My main problem is that I cannot distinguish if this is somehow expected behavior or the symptom of something going terribly wrong in my system...

Any help is welcome!

I paste here the results of executing the code....

     >> tryUnitLinear
Results for L:128  N:100 
Computing testing unit in single core
Testing unit in single core:  9.83 
Testing unit 40 time  9.77 
Testing unit 39 time  9.68 
Testing unit 38 time  9.47 
Testing unit 37 time  9.57 
Testing unit 36 time  9.83 
Testing unit 35 time  9.67 
Testing unit 34 time 10.01 
Testing unit 33 time  9.62 
Testing unit 32 time 10.54 
Testing unit 31 time 10.40 
Testing unit 30 time 11.69 
Testing unit 29 time  9.54 
Testing unit 28 time  9.51 
Testing unit 27 time  9.59 
Testing unit 26 time  9.80 
Testing unit 25 time 10.23 
Testing unit 24 time 10.14 
Testing unit 23 time 10.17 
Testing unit 22 time  9.37 
Testing unit 21 time  9.33 
Testing unit 20 time 10.05 
Testing unit 19 time 11.25 
Testing unit 18 time 10.37 
Testing unit 17 time  9.73 
Testing unit 16 time  9.95 
Testing unit 15 time 10.30 
Testing unit 14 time  9.41 
Testing unit 13 time 10.47 
Testing unit 12 time 10.27 
Testing unit 11 time  9.52 
Testing unit 10 time  9.67 
Testing unit 9 time  9.73 
Testing unit 8 time 12.17 
Testing unit 7 time  9.96 
Testing unit 6 time 10.17 
Testing unit 5 time  9.81 
Testing unit 4 time  9.92 
Testing unit 3 time  9.33 
Testing unit 2 time  9.35 
Testing unit 1 time  9.70 
Total time: 469.00 
Sum process time: 399.08 
Average process time:  9.98 
Unit
 in single core:  9.83      
 

(by the way, I find it rather weird that the parfor visits i in exactly the reverse order of integers -I was expecting a totally random access pattern- but cannot imagine if it has some relationship with the problem I described)

thanks in advance!

Daniel

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Jacob Wood 2020-2-18

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/506209-loops-in-parfor-are-overly-slow-in-a-very-simple-code#answer_416194

Matlab actually multithreads element-wise multiplication, thus using all available cores in the "single core" case and no additional performance from the parfor implementation. See this link for more information:

https://www.mathworks.com/matlabcentral/answers/95958-which-matlab-functions-benefit-from-multithreaded-computation

3 个评论
显示 1更早的评论隐藏 1更早的评论

Arabarra 2020-2-19

在 MATLAB Online 中打开

Yesd, definitely multithreading is not an issue. In the unit function I replaced the elementwise multiplication (see below ) with an explicit loop and I get the same results...

function unitLinear(L,N)
a = rand(L,L,L);
b = rand(L,L,L);
c = zeros(L,L,L);
for i=1:N
    for j = 1:N
        %c     = a.*b;
         
        for k=1:L
            for m=1:L
                c(i,j) = a(i,j)+b(i,j);
            end
        end
    
    end
end

Arabarra 2020-2-19

update: I restarted the computers and you were right: htop now shows me that the matrix multiplication is multithreading like hell. Well, not the result I needed (as it means I cannot gain computing time :-(), but at least I know why... thanks!

Daniel

请先登录，再进行评论。

Loops in parfor are overly slow in a very simple code

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

Loops in parfor are overly slow in a very simple code

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论