Handling errors in parfeval processes

30 次查看(过去 30 天)
I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.
In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.
I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).
At present, I am thinking about the following questions:
  1. Is the output argument in the Futures object for the failed job set to a specific value?
  2. Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
  3. Does the error message in the Futures object for the failed job provide information about a memory failure?
Best, Mark
  1 个评论
Bruno Luong
Bruno Luong 2023-4-2
+1 for the question.
I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.

请先登录,再进行评论。

采纳的回答

Walter Roberson
Walter Roberson 2023-4-2
You can potentially use try/catch to control errors on the workers.
If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.
There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.
The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.
  19 个评论
Walter Roberson
Walter Roberson 2023-4-12
I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.
Mark Brandon
Mark Brandon 2023-4-12
@Sam Marshalik@Walter Roberson. Thanks for the comments. I can confirm Sam's description of how parfeval works on a local system. I set up a parpool on a single node at your HPC system, and I defined low limit for maximum memory for the job. I started with 27 workers and a total 50 Gb of memory for all workers combined. The futures for the job would have moments where they needed to use ~15 Gb each. As the parfeval job ran, the number of active workers quickly dropped down to about 3. That said, the submitted futures all ran successful, despite the chaos of the crashing (failing) workers.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Startup and Shutdown 的更多信息

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by