When running my parfor loop on a remote cluster (with 16 c5.xlarge, 2 core machines and a dedicated headnode m5.xlarge, 2 core) I get following error:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
> In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 395)
In parallel_function>distributed_execution (line 746)
In parallel_function (line 578)
In FUN_CLUSTER_FORECASTING (line 54)
In parallel.internal.cluster.executeFunction (line 29)
In parallel.internal.evaluator.evaluateWithNoErrors (line 14)
In parallel.internal.evaluator/MJSStreamingEvaluator/evaluate (line 40)
In dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 354)
In dctEvaluateTask>iEvaluateTask (line 175)
In dctEvaluateTask (line 81)
In distcomp_evaluate_task>iDoTask (line 152)
In distcomp_evaluate_task (line 74)
In distcomp_evaluate_task_mvm (line 39)
Sending a stop signal to all the labs...
Y is of size (226 × 440) when I get the error. The parfor loop runs in smaller specifications without any problems (does not fail to compute when Y is of size (226 × 120)).
A simplified version of the parfor loop:
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
parfor (ij = 1:length(irep))
fun = BCTRVAR(Y(1:irep(ij),:),h,series_to_eval);
forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
I am not sure if the following warning is related but I get a warning on the variable Y:
'The entire array or structure Y is a broadcast variable. This might result in unnecessary communication overhead.'
Could the overhead be the cause of the issue?