Handling errors in parfeval processes
31 次查看(过去 30 天)
显示 更早的评论
I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.
In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.
I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).
At present, I am thinking about the following questions:
- Is the output argument in the Futures object for the failed job set to a specific value?
- Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
- Does the error message in the Futures object for the failed job provide information about a memory failure?
Best, Mark
1 个评论
Bruno Luong
2023-4-2
+1 for the question.
I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.
采纳的回答
Walter Roberson
2023-4-2
You can potentially use try/catch to control errors on the workers.
If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.
There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.
The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.
19 个评论
Walter Roberson
2023-4-12
I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Startup and Shutdown 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!