Why are most workers idle after some time when using parfeval?

3 次查看(过去 30 天)
Hi,
I'm currently running some simulations which use parfeval to evaluate some functions many times in parallel. I noticed that when I submit a large number of jobs with parfeval, that after some time most workers go idle, while the not even half of the jobs have been finished.
I should mention that I run this function as a standalone application (compiled using mcc). Here's what my function roughly looks like:
In my case N=1e5, maxtime=inf, and myfunction takes a few seconds to several minutes to complete (there is some randomness involved, so every call is different).
function mySimulation(parameter, N, maxtime)
load(data.mat, 'largematrix');
c = parcluster('local');
pool = parpool(c.NumWorkers);
parpoolconstant = parallel.pool.Constant(largematrix)
f(1:10, 1:N) = parallel.FevalFuture; %initialize future array f
%submit parfeval jobs
for i=1:10
for idx = 1:N
f(i,idx) = parfeval(@myfuntion, 2, i, parameter, parpoolconstant)
end
end
% counter
numcompleted = 0;
afterEach(f, @(out1,out2) update_progress, 0);
% save results once all futures in f(i,:) are finished for some i
doneFuture(1:10) = parallel.FevalFuture;
for i = 1:10
doneFuture(i) = afterAll(f(i,:), @(out1,out2) saveresults(out1,out2, i), 0);
end
% check if all futures in f are completed and saved after maxtime seconds, if not, save unifinished results.
check = wait(doneFuture, 'finished', maxtime);
if check = true
disp('all done in time!')
else
disp('not all results where finished on time')
for i=1:10
if ~strcmp(doneFuture(i).State, 'finished')
saveUnfinished(f(i,:), i);
end
end
end
function update_progress %display number of finished futures
numcompleted = numcompleted +1;
if ~mod(numcompleted,1000)
disp(['conpleted ', num2str(num_completed), '/', num2str(10*N)])
end
end
function saveresults(res1,res2, i)
filename = sprintf('results_%s_%s_%s.mat', i,parameter, N);
save(filename, 'res1', 'res2', 'i', 'parameter', 'N')
end
function saveUnfinished(F,i) % function to save the finished entries of f(i,:)
if ~isempty(F(strcmp('finished', {F.State})))
[res1, res2] = fetchOutputs(F(strcmp('finished', {F.State})));
filename = sprintf('unfinished_results_%s_%s_%s.mat', i,parameter, N);
save(filename, 'res1', 'res2', 'i', 'parameter', 'N')
else
disp(['there are no results to save for i = ', num2str(i)])
end
end
end
The maxtime parameter is there save all the unsinished results after maxtime seconds.
I am running my code using 42 cores, and I noticed that after an initial period where all of them are used 100% (as seen using htop), only 2 of them are still being used (the load average in htop is only 2.28 now), while only half of the jobs have been completed (I don't know how for how long all cores are being fully used though, not necessarily until half of the jobs are done).
Can someone provide me with some insight to what is happening here?
  2 个评论
Edric Ellis
Edric Ellis 2020-5-18
There's nothing obvious going wrong here. Does the problem appear worse when you have a large number of parfeval requests? Is there any chance that you're running low on memory on your system? Does running with only a few workers make any difference? Are your parfeval requests transferring large amounts of data either in or out? (Use ticBytes and tocBytes to check)
AS
AS 2020-5-18
编辑:AS 2020-5-18
The problem only occured with a very high number of parfeval requests. I'm certain that the system was not low on memory. Only a very small amount of data is transferred to the workers (except once using the parpool constant), as it should be since the input of myfunction is just a few doubles and strings, and the output is two small arrays of doubles .
I forgot to mention that I am running this as a standalone application (compiled using mcc).
I did not manage to reliably reproduce this behavior however, since the problem did not occur again in subsequent runs with only a few minor changes to my function:
  • the the order of the dimensions in f (to get a nicer ordering of the futures in f)
  • the way saveresults is called and works:
for i=1:10
doneFuture(i) = afterAll(f(:,i), @(F) saveresults(F, i), 0, 'PassFuture',true);
end
function saveresults(F, i)
[res1, res2] = fetchOutputs(F)
filename = sprintf('results_%s_%s_%s.mat', i,parameter, N);
save(filename, 'res1', 'res2', 'i', 'parameter', 'N')
end
I don't think these changes should affect anything, but things work fine on the same machine now.
I think maybe something went wrong with the saveresults function, causing the futures in doneFuture to never reach the 'finished' state, meaning my function will just keep waiting forever after all the futures in f are finished (if timeout = inf), but that doesn't explain why the counter stopped working.
To make things even more confusing, some (possibly related) problems occur only on certain machines. I ran the updated version several times on computing nodes of my university clusters, with the timeout parameter set to a (sufficiently high) finite value and the following happened:
  • The counter did not function properly (nothing was displayed)
  • the saveresults function was not called or did not perform it's intended role (no files where saved)
  • however, all the futures in f where finished, since the 'unfinished results' that were saved all contained N rows.
This might be a long shot (since I am very new to this stuff), but could this possibly be related to Java memory issues? I've had java heap space problemes with high values for N, until I changed the java memory size in the matlab settings before compiling my code. I have no access to the java memory settings on the clusters however, and it is set to only 246 Mb (compared to 1969 Mb on my own machines).
I will try to reproduce the problematic behavior(s) as soon as I have time, but since it only seems to occur when the number of parfeval requests is very high and only after ~10h or so, that's pretty time consuming.

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by