Why are most workers idle after some time when using parfeval?
3 次查看(过去 30 天)
显示 更早的评论
Hi,
I'm currently running some simulations which use parfeval to evaluate some functions many times in parallel. I noticed that when I submit a large number of jobs with parfeval, that after some time most workers go idle, while the not even half of the jobs have been finished.
I should mention that I run this function as a standalone application (compiled using mcc). Here's what my function roughly looks like:
In my case N=1e5, maxtime=inf, and myfunction takes a few seconds to several minutes to complete (there is some randomness involved, so every call is different).
function mySimulation(parameter, N, maxtime)
load(data.mat, 'largematrix');
c = parcluster('local');
pool = parpool(c.NumWorkers);
parpoolconstant = parallel.pool.Constant(largematrix)
f(1:10, 1:N) = parallel.FevalFuture; %initialize future array f
%submit parfeval jobs
for i=1:10
for idx = 1:N
f(i,idx) = parfeval(@myfuntion, 2, i, parameter, parpoolconstant)
end
end
% counter
numcompleted = 0;
afterEach(f, @(out1,out2) update_progress, 0);
% save results once all futures in f(i,:) are finished for some i
doneFuture(1:10) = parallel.FevalFuture;
for i = 1:10
doneFuture(i) = afterAll(f(i,:), @(out1,out2) saveresults(out1,out2, i), 0);
end
% check if all futures in f are completed and saved after maxtime seconds, if not, save unifinished results.
check = wait(doneFuture, 'finished', maxtime);
if check = true
disp('all done in time!')
else
disp('not all results where finished on time')
for i=1:10
if ~strcmp(doneFuture(i).State, 'finished')
saveUnfinished(f(i,:), i);
end
end
end
function update_progress %display number of finished futures
numcompleted = numcompleted +1;
if ~mod(numcompleted,1000)
disp(['conpleted ', num2str(num_completed), '/', num2str(10*N)])
end
end
function saveresults(res1,res2, i)
filename = sprintf('results_%s_%s_%s.mat', i,parameter, N);
save(filename, 'res1', 'res2', 'i', 'parameter', 'N')
end
function saveUnfinished(F,i) % function to save the finished entries of f(i,:)
if ~isempty(F(strcmp('finished', {F.State})))
[res1, res2] = fetchOutputs(F(strcmp('finished', {F.State})));
filename = sprintf('unfinished_results_%s_%s_%s.mat', i,parameter, N);
save(filename, 'res1', 'res2', 'i', 'parameter', 'N')
else
disp(['there are no results to save for i = ', num2str(i)])
end
end
end
The maxtime parameter is there save all the unsinished results after maxtime seconds.
I am running my code using 42 cores, and I noticed that after an initial period where all of them are used 100% (as seen using htop), only 2 of them are still being used (the load average in htop is only 2.28 now), while only half of the jobs have been completed (I don't know how for how long all cores are being fully used though, not necessarily until half of the jobs are done).
Can someone provide me with some insight to what is happening here?
2 个评论
Edric Ellis
2020-5-18
There's nothing obvious going wrong here. Does the problem appear worse when you have a large number of parfeval requests? Is there any chance that you're running low on memory on your system? Does running with only a few workers make any difference? Are your parfeval requests transferring large amounts of data either in or out? (Use ticBytes and tocBytes to check)
回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!