getAllOutputArguments only returns one result per node, not per core

1 次查看(过去 30 天)
I'm running a parallel job on an SGE cluster, asking for 48 workers. I've set procsPerNode = 8 inside parallelSubmitFcn.m, so the job should use 6 nodes, and indeed it does. I can see that while it's running.
The problem is that the result obtained from getAllOutputArguments only contains results for 6 entries, as though there was only 1 worker per node.
My code simply returns 'labindex', and so the result should just be the integers 1..48. Below is the parallel job object, followed by the results, after the run. As you can see, it claims to have run all 48 tasks. However, the result only contains the first 6.
What's going on?
Thanks
-Don --------------------------------------
pjob = Parallel Job ID 144 Information ===============================
UserName : don
State : finished
SubmitTime : Tue May 08 15:32:20 EDT 2012
StartTime : Tue May 08 15:32:21 EDT 2012
Running Duration : 0 days 0h 0m 3s
- Data Dependencies
FileDependencies : /Users/don/math/MVPA/donsPause.m
PathDependencies : {}
- Associated Task(s)
Number Pending : 0
Number Running : 0
Number Finished : 48
TaskID of errors :
- Scheduler Dependent (Parallel Job)
MaximumNumberOfWorkers : 48
MinimumNumberOfWorkers : 48
>> getAllOutputArguments(pjob) ans = [1] [2] [3] [4] [5] [6] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []
  5 个评论
Don
Don 2012-5-14
Here are my functions:
function results = DonSubmitParallelJob (minWorkers, maxWorkers)
sched = findResource('scheduler', 'configuration', 'KIT Cluster Configuration');
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'donsPause.m'});
set(pjob, 'MaximumNumberOfWorkers', maxWorkers);
set(pjob, 'MinimumNumberOfWorkers', minWorkers);
t = createTask(pjob, @donsPause, 1, {});
submit(pjob);
waitForState(pjob);
results = getAllOutputArguments(pjob);
pjob
end
function y = donsPause ()
y = labindex;
end
Don
Don 2012-5-14
By the way, if I set procsPerNode = 1, the job never executes because it seems to be asking for 48 nodes, and there are only 10 in the cluster.

请先登录,再进行评论。

采纳的回答

Don
Don 2012-5-18
Solved! The command "qconf -ssconf" on the cluster showed that a parameter called "maxujobs" was set to 10. This is the max number of jobs a user can run concurrently. However, it actually applied to TASKS: It limited the total number of tasks to 10. I had to get my administrator to increase it to 100.
procsPerNode should indeed be set to 1. Then, a distributed or parallel 'job' actually submits as many jobs as there are 'tasks'. As long as 'maxujobs' is large enough, all these jobs will run concurrently, and then getAllOutputArguments will return one result per job (i.e., task).
Thanks for your help!

更多回答(1 个)

Thomas
Thomas 2012-5-14
I just checked my parallelSubmitFcn file for our SGE cluster. We keep
procsPerNode = 1;
SGE has a different way of thinking about nodes... And I run even 128 processes and get the results back correctly.. WE have a mixture of hardware with some generation having 8 processors per node and some generations having 12 processors per node. procsPerNode = 1; let everything work right with SGE.. and SGE can use the processors, remaining after a couple of them have been taken on each node by other applications.. Your systems may vary, but this works for us and allows us backfill jobs... :)
  4 个评论
Thomas
Thomas 2012-5-14
Don, not sure about how you define nodes/processors in your cluster.. We define a node as a physical node taking 1U rack space. Usually consists of 2 quad core or hex core processors thus getting 8-12 processors per node..
Don
Don 2012-5-15
Ours is the same. Each node is a physical computer with two 6-core processors on it. I have access to 10 nodes, 12 cores each. I don't manage it and didn't configure it, which is part of my problem.
Thanks

请先登录,再进行评论。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by