Handling errors in parfeval processes

Question

Mark Brandon 2023-4-2

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1939879-handling-errors-in-parfeval-processes

评论： Mark Brandon 2023-4-12

I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.

In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.

I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).

At present, I am thinking about the following questions:

Is the output argument in the Futures object for the failed job set to a specific value?
Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
Does the error message in the Futures object for the failed job provide information about a memory failure?

Best, Mark

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Bruno Luong 2023-4-2

+1 for the question.

I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Walter Roberson 2023-4-2

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1939879-handling-errors-in-parfeval-processes#answer_1207039

You can potentially use try/catch to control errors on the workers.

If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.

There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.

The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.

19 个评论
显示 17更早的评论隐藏 17更早的评论

Sam Marshalik 2023-4-6

@Walter Roberson Apologies for the mixed signals, I will clarify. Depending on the cluster environment AND the kind of job the user is running, the behavior may be different. Edric (Dev) responded to Raymond's comment with the following:

"Note that on "local" and MJS clusters, the parallel pool will not necessarily immediately terminate when a single worker crashes. On those clusters, pools that have not yet used spmd can survive losing workers."

As an example, when using MATLAB Parallel Server with a 3rd party scheduler, like Slurm, we rely on MPI to start the worker processes. In that instance, if one of the workers crahes, it terminates the entire MPI ring. However, if someone is running a local parpool or MATLAB Parallel Server on a MATLAB Job Scheduler cluster and is not using any kind of mpi workflows (e.g., spmd) then a worker crash will not terminate the pool.

Sounds like need to create some sort of flow chart or diagram in our documentation that will dive into this more. I will try to work on something that I can share in this Answer that will help address this.

Sam Marshalik 2023-4-6

在 MATLAB Online 中打开

@Mark Brandon sorry for the late reply. If you don't mind, tag my name in the response so I get a notification when you respond.

I believe that you are notified when a worker goes down in a parallel pool (don't recall the exact error). There should be a warning message in MATLAB that informs you that a worker has gone down or the parallel pool size has been reduced.
I will clarify what I meant by "The erroring future will run on a different worker.": Let's say you have a queue of futures that you kicked off and one of them causes some error. In this example, you have 8 cores on your machine, so you start a parallel pool of 8 workers. Your futures go off and are running on your workers. At some point, the problematic future is kicked off and lands on one of the workers. The future runs and at some point in the computation causes the worker to crash (e.g., out of memory). In this instance, the pool of workers is reduced by 1 and you now have 7 workers in your parallel pool. The future is moved back into the queue and it attempts to re-run on a different worker in the future (I am not sure what state it has at that point, but can look into it, if it matters to you). The worker that crashed is permanently removed from the parallel pool and can't be added back in, but, your parallel pool stays open with the remaining workers.

Your problem description is a great fit for parfeval, as you can analyze the results as they come in and terminate the remaining work (futures) if you find your result. Like Walter mentioned above, you can dig into the future to understand if it has an error associated with it. You can index into the specific future and see the error message associated with it:

>> fut(i).Error.message
ans = 
    'Requested 1x1000000000000 (931.3GB) array exceeds maximum array size preference (47.9GB). This might cause MATLAB to become unresponsive.'

If you are running work and a future does fail without killing the worker, then it will have a state of "Finished" with an error associated with it. When gathering results from your futures, keep an eye on the Error.message and see if any of the futures have one. You can keep track of all the failed futures, analyze why they are failing, and then re-run them.

Hopefully that clarifies things. Feel free to ask any other question you may have.

Mark Brandon 2023-4-9

@Sam Marshalik Thanks to you and Walter Roberson for the detailed answers. And thanks for the heads-up about the "tag my name" capability in Matlab Answers.

I have a followup question:

I found it really helpful to read your description of what happens when a worker fails versus when it returns a future with a finish/error status. In the first case, the parallel pool continues to operate, but with a smaller size (given that failed worker is no longer available), and the future that caused the failure is run again on another worker. In the second case, all worker remains available to run additional futures. But some code is needed to account for future with the error.

Regarding the first case: It would be very useful to have a simple code that simulates the case of a failed worker. I tried sending a future with the function @exit, but the worker did not "fail" but rather returned the future with a finished/error status. Are there simple ways to induce an out-of-memory error or a segmentation-error?

Regarding out-of-memory errors: Could one start a parallel pool with a bunch of workers, and then allow them to fail until the size of the pool was consistent with the amount of available memory?

Sam Marshalik 2023-4-10

编辑：Sam Marshalik 2023-4-10

在 MATLAB Online 中打开

Hey @Mark Brandon, crashing the worker will require some effort. You need to find a variable small enough to load into your machine's memory, but big enough so that when we are doing operations on it, it causes a failure. Here is an example of something I tried:

function causeFailure
[resources,~] = memory;
disp(['Max variable sizes in Gb: ' num2str(resources.MaxPossibleArrayBytes/1000/1000/1000)]);
% Generate some variable that is small enough to fit into your memory, but
% will crash MATLAB if duplicated.  I have 48Gb of memory:
aaa = rand(300000,10000);
whosResult = whos;
% Assuming that the aaa variable will be first in the list.  There is more
% refined ways to do this, but doing something quick and dirty.
disp(['Max variable sizes in Gb: ' num2str(whosResult(1).bytes/1000/1000/1000)]);
if whosResult(1).bytes > resources.MaxPossibleArrayBytes
    error("This example will not run, reduce the test variable size")
end
p = gcp('nocreate')
if isempty(p)
    p = parpool("Processes", 1);
end
f = parfeval(p, @failRun, 0, aaa);
end
function out = failRun(inputVar)
out = inputVar*inputVar;
end

In my case, my computer froze and I had to force quite MATLAB. I think if I waited long enough MATLAB would have crashed on its own. You will need to play around with the variable size, but that could perhaps work. If you do not have luck with this, let me know and I can have a chat with Dev to see if they have suggestions on a test case.

On a side note, you mentioned:

"I tried sending a future with the function @exit, but the worker did not "fail" but rather returned the future with a finished/error status. Are there simple ways to induce an out-of-memory error or a segmentation-error?"

Did the fact that the future failed, but the task did not have a state of "Failed" cause confusion? Would love to hear your feedback/input on that.

Walter Roberson 2023-4-12

@Sam Marshalik

I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.

Mark Brandon 2023-4-12

@Sam Marshalik @Walter Roberson. Thanks for the comments. I can confirm Sam's description of how parfeval works on a local system. I set up a parpool on a single node at your HPC system, and I defined low limit for maximum memory for the job. I started with 27 workers and a total 50 Gb of memory for all workers combined. The futures for the job would have moments where they needed to use ~15 Gb each. As the parfeval job ran, the number of active workers quickly dropped down to about 3. That said, the submitted futures all ran successful, despite the chaos of the crashing (failing) workers.

请先登录，再进行评论。

Handling errors in parfeval processes

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

19 个评论
显示 17更早的评论隐藏 17更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Handling errors in parfeval processes

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

19 个评论 显示 17更早的评论隐藏 17更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

19 个评论
显示 17更早的评论隐藏 17更早的评论