Parfor Freezing during computation

3 次查看(过去 30 天)
Hi,
im doing an optimization where the function being optimized uses a parfor to speed-up it's calculation.
The said function look something like this:
Data(X=1:10,Y=1:10) (just reference for the data format)
Parfor x=1:10
for y=1:10
dosomething(DATA(x,y)); (uses Quad and Fzeros but i dont think its that important)
end
end
This problem is: the total program takes 3-4 days to compute and while i run the code (on a 16 core xeon server), the program will sometime stall stopping iteration. It can be after 15 minutes or 40+hour... CPU usage drop to zero but no error message. (for reference, i managed to run the entire program a couple time without any issues but i need it to be extremly reliable...). I also see a couple worker popping in and out of the command list but all of them are at 0.1% load. At first i thought it was a probleme with the optimization routine but i accidently discovered that when i kill some worker in the command promt, an error message pop-up saying a worker was aborted and then the program restart iterating! However, it will continue only on the remaining worker i didn't kill. This process was done with trial and error and didn't manage to identify the cause.
Any advice? i tried to feed the dataset with a Parpool constant and calling only the value being used in the specific parfor iteration, to refer the above exemple:
C= parallel.pool.Constant(DATA);
Parfor x=1:10
data(1,:)=C.value(x,:);
for y=1:10
dosomething(data(1,y)); (uses Quad and Fzeros but i dont think its that important)
end
end
But this procedure yielded the same crashes and this time even faster than usual (might be random).
As i said, this problem seem totally random and will sometime not even happen for a particular test run. I tried to work on simulated Data(random) and a time-series (deterministic) and both did this issue. And each time it happened, I stopped it, restart the program and it didn't stall at the same place the previous one did.
PS: it also happend on my personnal laptop (2 cores old stuff), so im pretty sure the problem is'nt from the server i use. In the matlab window, the code stall, the play button stay on pause but no CPU load and no error message.
Thanks
  1 个评论
John Meluso
John Meluso 2020-3-17
Hi Samuel, I'm curious if you ever found a reliable solution to this problem? I'm running into the same issue running a simulation on a computing cluster and -- despite the plethora of people who seem to have the same problem -- I haven't seen anyone else offer a solution. Thanks!

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by