Standalone MATLAB Parallel Server Randomly Fails/Succedes

6 次查看(过去 30 天)
I'm trying to get an application to run consistently on a MATLAB parallel server (which is set up on a rack of servers in my office that are not connected to the internet) and it randomly fails/succeeds with errors like:
"Expected one output from a curly brace or dot indexing expression, but there were 0 results."
or
"Unexpected failure to indicate all intervals added."
I'm using Parfor for my parallelization, I have AutoAddClientPath on, but do need to manually attach matlab scripts and data files to the parallel pool to get the application to work. I do not need to do manually attach any file when running locally. My MATLAB parallel server runs an MJS, and has 8 nodes all running Windows Server.
What I've learned so far:
  • This error always involves an object that has been transfered to the parallel pool (usually a I require to be a broadcast variable). Ive gotten this error in more than one application, but always relates to a object copied to the parallel pool.
  • When I run "disp" on the object in quesiton while debugging, it shows an empty double object, I think of class double (i.e. [ ])
  • The Failures are inconsistent, but seem to be more likely as the number of workers increases. Less than 20 workers always succeeds, 64 workers almost always fails, but has worked atleast once.
  • I'm attaching a large amount of matlab script files and data to the parallel pool. The broadcast variables I'm using can also be large.
  • My application ALWAYS works on the local cluster.
I'm running out of ideas to try to get this up and running, any help would be greatly appreciated.
  2 个评论
Edric Ellis
Edric Ellis 2021-3-18
It does sound like there is some sort of problem transferring some of your data to the workers. The fact that the failures are more likely with more workers suggests that perhaps it is a memory problem. Are you able to track resource usage on the worker machines while this is happening? When the first error message occurs, is there any error stack showing? (At a guess, this is coming from within the worker-side implementation of parfor when it is trying to unpack the data that has been sent from the client). Finally, it might be worth contacting MathWorks support directly to address this.
Nathan Ellingson
Nathan Ellingson 2021-4-7
Thanks Edric,
There are stacks for both errors, the first error however is all in my code and not mathworks code, but most of the second error is mathworks code, this is the mathworks code part of the stack:
Error using distcomp.remoteparfor/rebuildParforController (line 217)
Unexpected failure to indicate all intervals added.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
obj.rebuildParforController();
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
Error in ap.AnalysisPointsTable/Calculate (line 39)
parfor i = 1:N
I threw some breakpoints in some of the Mathworks code and saw this error is generated then the error above is genered to handle this error:
Returned in line 238 of remoteparfor:
ParallelException with properties:
identifier: 'parallel:lang:parfor:SourceCodeNotAvailable'
message: 'Worker unable to find file.'
cause: {[1×1 MException]}
remotecause: {[1×1 MException]}
stack: [8×1 struct]
Correction: []

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by