Parfor loop with mex-file call crashes all workers on one computer, but runs fine on others
13 次查看(过去 30 天)
显示 更早的评论
Hello!
We have a code which runs either in serial or parallel mode. In the part under inspection one of our mex-files is run on either the complete set of data (serial operation) or on a part of the data consistent with the current number of workers (parallel operation). When run in serial mode the code works fine, but when run in parallel mode on a 6-core computer (with 2, 4 or 6 workers) the workers crash with messages like:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining
workers.
> In distcomp.remoteparfor/handleIntervalErrorResult (line 240)
In distcomp.remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In foo3 (line 138)
In foo2 (line 142)
In foo1 (line 61)
In foo (line 166)
A little later sometimes we get:
Error using distcomp.remoteparfor/rebuildParforController (line 194)
All workers aborted during execution of the parfor loop.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
obj.rebuildParforController();
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
...
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might
have errored.
Warning: 4 worker(s) crashed while executing code in the current parallel pool. MATLAB will attempt to run the code again
on the remaining workers of the pool. View the crash dump files to determine what caused the workers to crash.
The crash dumps don't say a lot, but conclude with:
This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
When run on three other computers the code works fine, in both serial and parallel mode. Two computers with 6-core CPU:s and a notebook with a 2-core CPU. It is possible to create different size pools, and the resulting output is always as expected. I have tried the code on all computers using the same number of workers (4) where possible.
This leads me to believe the mex-file is correct.
I am at a loss concerning what to try next, and would appreciate any hints on how to move forward.
7 个评论
Christopher Grose
2019-12-3
编辑:Christopher Grose
2019-12-3
I'm having this problem as well.
Could it occur because of uninitialized variables in the mex C function (I don't think i have any uninitialized pointers)?
For example, my C codes look something like
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
/* DECLARATIONS, INPUTS */
double *Gphase = mxGetPr(prhs[0]);
double *dGdC = mxGetPr(prhs[2]);
double *StrainEM = mxGetPr(prhs[4]);
double R = mxGetScalar(prhs[5]);
int32_t p,c,ci,j;
for (j=0; j<1000; j++) {
somefunc(Gphase,dGdC,StrainEM,R,p,c,ci);
}
}
Could p,c,ci,j variables be screwing things up?
On a 28 core CPU I quickly lose half my workers, followed by a random but gradual elimination of further workers.
I get the error
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
and the worker that takes over complete the task without problems, and the functions are all fully deterministic, so the workers are just dying for some other problem
Jan
2019-12-4
It depends on what happens inside somefunc(). The output of mxGetPr() should be treated as const pointer, so do you modify the contents? p, c, and ci are declared, but not initialized - do you use them correctly? Maybe "something like" conceals the actual problem. Please post the relevant part of the real code.
回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!