Handling errors in parfeval processes
30 次查看(过去 30 天)
显示 更早的评论
I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.
In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.
I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).
At present, I am thinking about the following questions:
- Is the output argument in the Futures object for the failed job set to a specific value?
- Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
- Does the error message in the Futures object for the failed job provide information about a memory failure?
Best, Mark
1 个评论
Bruno Luong
2023-4-2
+1 for the question.
I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.
采纳的回答
Walter Roberson
2023-4-2
You can potentially use try/catch to control errors on the workers.
If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.
There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.
The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.
19 个评论
Walter Roberson
2023-4-2
bg = backgroundPool();
fut = parfeval(bg, @() ones(1,1e12, 'uint8'), 1);
wait(fut)
fut
fut =
FevalFuture with properties:
ID: 1
Function: @()ones(1,1e12,'uint8')
CreateDateTime: 02-Apr-2023 21:25:10
StartDateTime: 02-Apr-2023 21:25:11
RunningDuration: 0 days 0h 0m 0s
State: finished (unread)
Error: Requested 1x1000000000000 (931.3GB) array exceeds maximum array size preference (30.9GB). This might cause MATLAB to become unresponsive.
LiveEditorEvaluationHelperEeditorId>@()ones(1,1e12,'uint8') (line 2)
fut.Error.remotecause{1}
ans =
MException with properties:
identifier: 'MATLAB:array:SizeLimitExceeded'
message: 'Requested 1x1000000000000 (931.3GB) array exceeds maximum array size preference (30.9GB). This might cause MATLAB to become unresponsive.'
cause: {}
stack: [1×1 struct]
Correction: []
Walter Roberson
2023-4-2
Note that if you fetch the outputs of a future that has an error status, then you will get an error in the fetchNext call -- the recorded exception will be thrown.
You can also get an error when you go to fetch outputs if you ask for more outputs than were recorded in the future
Sam Marshalik
2023-4-3
Hello @Mark Brandon, I am the Product Manager for MATLAB Parallel Server and would be glad to chat with you offline to help answer any additional questions that you may have. Please feel free to reach out to me at smarshal@mathworks.com.
The one question that I think was not answered is:
- Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
I am assuming you are using the Local/Process profile or running MATLAB Parallel Server with MATLAB Job Scheduler (let me know if the case is otherwise). In that scenario, if a worker crashes due to something (like out of memory), the worker will shut down and attempt to restart. If/When it comes back online, it will not be able to re-join the running parallel pool as the MPI ring has already been formed at that point and it can't add additional workers (even those that were previously part of it). The erroring future will run on a different worker.
Mark Brandon
2023-4-3
Sam, thanks for your very helpful explanation. I am using the Local/Process profile.
I have two question about description of error handling for a crashed worker.
1) Will I see the parpool reduced by one after a worker crashes?
2) You said that "The erroring future will run on a different worker." This sentence is hard to follow. Do you mean that a new worker is introduced, but that conflicts with what you said earlier in that the MPI ring cannot be modified. Or does it mean the the future for the worker that crashed will be maintained within the future but flagged with an error status of some sort?
Best,
Mark
Mark Brandon
2023-4-3
My overall objective is to find a way to respond to the rare case when a worker crashes. My computation involves a direct search for a best-fit solution for a nonlinear inverse problem. I am using a random controlled search algorithm, which means that the search is able to continue even when a job crashes. I would like to handle the crash in a more graceful fashion, so I am looking for a way to identify in the future when a job has crashed, and also be able to adjust for a change in the number of workers in the parpool.
Walter Roberson
2023-4-6
Recently in https://www.mathworks.com/matlabcentral/answers/1923895-how-to-shut-down-all-running-workers-of-paarpools#answer_1186410 @Raymond Norris said about parpools,
"If even if a single worker crashes, all workers will terminate"
I thought I had seen in the past Mathworks people say that there is recovery done to allow parfor to continue if workers crash, and you discussed above some of the recovery steps... but Raymond is a key developer for Parallel Computing Toolbox and appears to be saying something different.
Could we get clarification on this matter?
Sam Marshalik
2023-4-6
@Walter Roberson Apologies for the mixed signals, I will clarify. Depending on the cluster environment AND the kind of job the user is running, the behavior may be different. Edric (Dev) responded to Raymond's comment with the following:
"Note that on "local" and MJS clusters, the parallel pool will not necessarily immediately terminate when a single worker crashes. On those clusters, pools that have not yet used spmd can survive losing workers."
As an example, when using MATLAB Parallel Server with a 3rd party scheduler, like Slurm, we rely on MPI to start the worker processes. In that instance, if one of the workers crahes, it terminates the entire MPI ring. However, if someone is running a local parpool or MATLAB Parallel Server on a MATLAB Job Scheduler cluster and is not using any kind of mpi workflows (e.g., spmd) then a worker crash will not terminate the pool.
Sounds like need to create some sort of flow chart or diagram in our documentation that will dive into this more. I will try to work on something that I can share in this Answer that will help address this.
Sam Marshalik
2023-4-6
@Mark Brandon sorry for the late reply. If you don't mind, tag my name in the response so I get a notification when you respond.
- I believe that you are notified when a worker goes down in a parallel pool (don't recall the exact error). There should be a warning message in MATLAB that informs you that a worker has gone down or the parallel pool size has been reduced.
- I will clarify what I meant by "The erroring future will run on a different worker.": Let's say you have a queue of futures that you kicked off and one of them causes some error. In this example, you have 8 cores on your machine, so you start a parallel pool of 8 workers. Your futures go off and are running on your workers. At some point, the problematic future is kicked off and lands on one of the workers. The future runs and at some point in the computation causes the worker to crash (e.g., out of memory). In this instance, the pool of workers is reduced by 1 and you now have 7 workers in your parallel pool. The future is moved back into the queue and it attempts to re-run on a different worker in the future (I am not sure what state it has at that point, but can look into it, if it matters to you). The worker that crashed is permanently removed from the parallel pool and can't be added back in, but, your parallel pool stays open with the remaining workers.
Your problem description is a great fit for parfeval, as you can analyze the results as they come in and terminate the remaining work (futures) if you find your result. Like Walter mentioned above, you can dig into the future to understand if it has an error associated with it. You can index into the specific future and see the error message associated with it:
>> fut(i).Error.message
ans =
'Requested 1x1000000000000 (931.3GB) array exceeds maximum array size preference (47.9GB). This might cause MATLAB to become unresponsive.'
If you are running work and a future does fail without killing the worker, then it will have a state of "Finished" with an error associated with it. When gathering results from your futures, keep an eye on the Error.message and see if any of the futures have one. You can keep track of all the failed futures, analyze why they are failing, and then re-run them.
Hopefully that clarifies things. Feel free to ask any other question you may have.
Mark Brandon
2023-4-9
@Sam Marshalik Thanks to you and Walter Roberson for the detailed answers. And thanks for the heads-up about the "tag my name" capability in Matlab Answers.
I have a followup question:
I found it really helpful to read your description of what happens when a worker fails versus when it returns a future with a finish/error status. In the first case, the parallel pool continues to operate, but with a smaller size (given that failed worker is no longer available), and the future that caused the failure is run again on another worker. In the second case, all worker remains available to run additional futures. But some code is needed to account for future with the error.
Regarding the first case: It would be very useful to have a simple code that simulates the case of a failed worker. I tried sending a future with the function @exit, but the worker did not "fail" but rather returned the future with a finished/error status. Are there simple ways to induce an out-of-memory error or a segmentation-error?
Regarding out-of-memory errors: Could one start a parallel pool with a bunch of workers, and then allow them to fail until the size of the pool was consistent with the amount of available memory?
Walter Roberson
2023-4-9
Test case:
p = gcp()
output = parfevalOnAll(p, @makebig, 0)
p
output2 = parfevalOnAll(p, @whome, 1)
p
output3 = fetchOutputs(output2)
arrayfun(@disp, output3)
function makebig
assignin('base', 'small', 5);
assignin('base', 'test', ones(1,1e12, 'uint8'));
end
function out = whome
out = evalin('base', 'whos()');
end
On my system, 'small' is created on each worker, but 'test' is not because memory is exceeded. But the workers continue to exist and 'small' can be examined on each of them.
Mark Brandon
2023-4-9
Thanks for the test example. The assignment of "test" is caused by the matlab system so it never crashes the worker. When I have an out-of-memory error on my HPC system, it is caused by a worker that requires more member from the local node than is available. The worker fails in this case, and the size of the parellel pool is reduced by the number of failed (crashed) workers.
So I am looking for a way to crash a worker.
Walter Roberson
2023-4-9
I haven't had much chance to test this. Running locally, it did indeed result in a message about a worker being disconnected. But I'm also having difficulty getting back to normal -- after restarting MATLAB I got java messages about fatal error connecting to message service. I suspect I am going to need to reboot.
p = gcp()
output = parfevalOnAll(p, @makebig, 0)
p
wait(output)
output2 = parfevalOnAll(p, @whome, 1)
p
wait(output2)
output3 = fetchOutputs(output2)
arrayfun(@disp, output3)
function makebig
assignin('base', 'small', 5);
newtest = cell(0,1);
assignin('base', 'test', newtest);
while true
newtest{end+1} = ones(1,1e10, 'uint8');
assignin('base', 'test', newtest);
end
end
function out = whome
out = evalin('base', 'whos()');
end
Sam Marshalik
2023-4-10
编辑:Sam Marshalik
2023-4-10
Hey @Mark Brandon, crashing the worker will require some effort. You need to find a variable small enough to load into your machine's memory, but big enough so that when we are doing operations on it, it causes a failure. Here is an example of something I tried:
function causeFailure
[resources,~] = memory;
disp(['Max variable sizes in Gb: ' num2str(resources.MaxPossibleArrayBytes/1000/1000/1000)]);
% Generate some variable that is small enough to fit into your memory, but
% will crash MATLAB if duplicated. I have 48Gb of memory:
aaa = rand(300000,10000);
whosResult = whos;
% Assuming that the aaa variable will be first in the list. There is more
% refined ways to do this, but doing something quick and dirty.
disp(['Max variable sizes in Gb: ' num2str(whosResult(1).bytes/1000/1000/1000)]);
if whosResult(1).bytes > resources.MaxPossibleArrayBytes
error("This example will not run, reduce the test variable size")
end
p = gcp('nocreate')
if isempty(p)
p = parpool("Processes", 1);
end
f = parfeval(p, @failRun, 0, aaa);
end
function out = failRun(inputVar)
out = inputVar*inputVar;
end
In my case, my computer froze and I had to force quite MATLAB. I think if I waited long enough MATLAB would have crashed on its own. You will need to play around with the variable size, but that could perhaps work. If you do not have luck with this, let me know and I can have a chat with Dev to see if they have suggestions on a test case.
On a side note, you mentioned:
"I tried sending a future with the function @exit, but the worker did not "fail" but rather returned the future with a finished/error status. Are there simple ways to induce an out-of-memory error or a segmentation-error?"
Did the fact that the future failed, but the task did not have a state of "Failed" cause confusion? Would love to hear your feedback/input on that.
Walter Roberson
2023-4-10
There are obviously some transient error conditions; for example running out of overall system memory might not happen once a pool member is killed.
But what happens in the case that the error is repeatable, such as a future that triggers a request for more memory than the system has? The future would be requeued after it dies the first time... and then it dies the second time. Does it keep getting requeued until eventually it has killed all of the workers, or does it only get one second chance?
Sam Marshalik
2023-4-12
Hey @Walter Roberson currently, a parfeval future will continue to run on the remaining workers even after it causes a previous worker to crash. The future will go from worker to worker until all of the workers have crashed. At the end of the run, the user should have insight as to what has caused the issue via the Error message associated with the queue. They can make adjustments to their code at that time to deal with the issue.
We are investigating adding a retry mechanism, similar to parfor to help with this.
Walter Roberson
2023-4-12
I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.
Mark Brandon
2023-4-12
@Sam Marshalik@Walter Roberson. Thanks for the comments. I can confirm Sam's description of how parfeval works on a local system. I set up a parpool on a single node at your HPC system, and I defined low limit for maximum memory for the job. I started with 27 workers and a total 50 Gb of memory for all workers combined. The futures for the job would have moments where they needed to use ~15 Gb each. As the parfeval job ran, the number of active workers quickly dropped down to about 3. That said, the submitted futures all ran successful, despite the chaos of the crashing (failing) workers.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Startup and Shutdown 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)