Parfor: worker aborted during execution of the parfor loop

Question

Merlin Scherer 2022-1-4

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1621930-parfor-worker-aborted-during-execution-of-the-parfor-loop

评论： Abhilash 2023-11-27

When running my parfor loop on a remote cluster (with 16 c5.xlarge, 2 core machines and a dedicated headnode m5.xlarge, 2 core) I get following error:

Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
> In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 395)
In parallel_function>distributed_execution (line 746)
In parallel_function (line 578)
In FUN_CLUSTER_FORECASTING (line 54)
In parallel.internal.cluster.executeFunction (line 29)
In parallel.internal.evaluator.evaluateWithNoErrors (line 14)
In parallel.internal.evaluator/MJSStreamingEvaluator/evaluate (line 40)
In dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 354)
In dctEvaluateTask>iEvaluateTask (line 175)
In dctEvaluateTask (line 81)
In distcomp_evaluate_task>iDoTask (line 152)
In distcomp_evaluate_task (line 74)
In distcomp_evaluate_task_mvm (line 39)
Sending a stop signal to all the labs...

Y is of size (226 × 440) when I get the error. The parfor loop runs in smaller specifications without any problems (does not fail to compute when Y is of size (226 × 120)).

A simplified version of the parfor loop:

% initialize the output variable
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
% irep is an array, e.g. irep = [113,114,..,214]
irep = T_thres:T-h;
parfor (ij  = 1:length(irep))
    fun = BCTRVAR(Y(1:irep(ij),:),h,series_to_eval);
    
    % h is the forecast horizon, e.g. h = [1,2,..,12]
    for ii = 1:h
        forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
    end
end

I am not sure if the following warning is related but I get a warning on the variable Y:

'The entire array or structure Y is a broadcast variable. This might result in unnecessary communication overhead.'

Could the overhead be the cause of the issue?

4 个评论
显示 2更早的评论隐藏 2更早的评论

Merlin Scherer 2022-1-5

The problem does not reproduce when I run the batch on a 'local' cluster.

The parfor loop runs without any problems on the cloud cluster when I change the function inside the parfor loop.

To see how much data was transmited to the workers, I tested the parfor loop on the 'local' cluster with another function instead of BCTRVAR(.):

BytesSentToWorkers BytesReceivedFromWorkers

1 2.0043e+07 1.93e+05

2 2.0004e+07 1.8097e+05

3 1.9247e+07 1.6893e+05

Total 5.9294e+07 5.429e+05

Is this a large amount of data being transferred?

Merlin Scherer 2022-1-7

编辑：Merlin Scherer 2022-1-7

@Edric Ellis are there ways to get a more detailed error message?

as you suggested in Problem with parfor loop I ran the remote cluster also when only requesting one worker. This did not generate the problem. Do you have any thoughts or ideas what this could mean?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Merlin Scherer 2022-1-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1621930-parfor-worker-aborted-during-execution-of-the-parfor-loop#answer_871155

编辑：Merlin Scherer 2022-1-9

I solved the problem by changing the worker machine type from c5.xlarge with 4 GB/core to m5.xlarge with 8 GB/core. So I think that the workers must have had insufficient memory.

This answer to the question

"Please suggest me to write the parallel loop in MATLAB without the workers getting aborted during the course of execution?"

by Raymond Norris helped me figure this out.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Tengyuan Hao 2022-11-14

I increased the memory and the problem is solved. I increase from 10GB of memory per core (slot), to 20GB of memory per core (slot),

Abhilash 2023-11-27

It would be awesome if you could write out the steps you took to solve the issue. The link is broken, and I went to the thread, but it does not help at all.

Please please, programmers, when you find a solution, just write it out!

请先登录，再进行评论。