about spmd, gpu calculation in a cluster
2 次查看(过去 30 天)
显示 更早的评论
Hi there, I am using Matlab 2012b in a multi-node multi-processor cluster with GPU-supported video gard installed. The cluster provide a task manager (qsub) to run the code over multi-node and/or multi-processor. I set up a matlab code to run 1 node and 8 processor with spmd. It runs without any problem. But since I have to run a batch of task with different parameters, I setup a script to qsub all my programs (total 100 matlab program, each run spmd on 8 processors). The script load the task pretty faster but I just found that in this case, it will report that the work pool fail to open. But if I run the program run by run manually, it doesn't report the same error. Is that anything wrong in this case?
By the way, I am wondering if it works if I call the gpu process with gpuarray within spmd block, if there any way for matlab to avoid conflicat and/or IO error in parallelism? Thanks
2 个评论
Jason Ross
2012-9-27
编辑:Jason Ross
2012-9-27
It would probably help if you could clarify somewhat. Given that you are using qsub, I'm guessing you are using PBS Pro or Torque. Are you submitting your work using the direct integration (on the Parallel menu), or via some other means? Are you using the generic interface?
When you say you wrote a script, are you talking about a MATLAB script, shell script, or something else? Also, when you say "batch" are you talking about the MATLAB batch command, or do you mean something else? You might want to look at batch to see if it might do what you need.
When you say "the work pool failed to open", do you mean that MATLAB says "the matlabpool failed to open", or is this an error from somewhere else?
When you do "run by run manually", how are you doing that? Through MATLAB? At the command line on your system?
For your spmd questions, you might be able to use "labindex" to do what you want, it would allow you to select the GPU you want the code to run on.
It might also help if you posted some example code snippets to show what you are doing.
回答(1 个)
Jason Ross
2012-9-28
Some of the things make sense now.
For the "matlabpool failed to open" problems, I'd suspect that the scheduler has more compute resources than you do MATLAB licenses. So when you open a pool, you consume X licenses, and when you put 10X jobs on the scheduler, it tries to access 10X licenses when you only have 5X -- although the scheduler itself might have the resources to run the 10X number of jobs. If you wait for the 5X jobs to complete, can you run the next batch of 5 without incident? I know some schedulers can be configured to check a license count and hold off jobs if the required number of licenses is not available -- you might want to ask about that.
I believe this solution addresses your GPU question. It sounds like it should be possible, although the access will be serialized:
You might also want to look into using the direct integration with PBS/Torque if possible, as it sounds like you might be able to avoid the step of writing the batch file to submit the jobs to the scheduler.
2 个评论
Jason Ross
2012-10-2
That is very puzzling. Could it be that the pool is already open on some of the workers and then the follow-up jobs get placed on the same one, in effect trying to open the same pool a second time?
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel and Cloud 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!