r2011b remote file mirroring issues (non-shared file system with generic scheduler interface)
1 次查看(过去 30 天)
显示 更早的评论
Hello,
I am using a local Matlab client (r2011b) connected to our remote cluster with Torque/Moabas the resource manager/scheduler. I can successfully submit and run jobs using the generic scheduler interface, which is great, especially after reading the helpful example files for pbs in a nonshared file system included in the toolbox.
However, when I'm submitting alot of jobs [hundreds] all with very small data sets, say, 10k each, the file mirroring seems to bog my Matlab client down and keep it in a near perpetual 'busy' state. It is also very, very slow, which I can't understand why given the small amount of data and results. Maybe it has trouble mirroring at a high level with lots of files?
Essentially, I have a script that does some preprocessing and then goes through and submits all my jobs. When that's done, I close the Matlab client (since they will be running for a while!) When I reopen it, I set up the scheduler like usual:
jm = torque_scheduler;
(where "torque_scheduler" is a script I modified from the Matlab example to return the interface) and then it immediately goes "busy" and just sits there. I can see it updating files in my working folder by refreshing the explorer window and checking timestamps, but it is INCREDIBLY slow-- it will be stuck doing this for hours.
My question is-- is there some way that I can manually shut off file mirroring so that it doesn't get so slow and then copy the data myself when the jobs or done, or wait until I do a 'findJob' command? When it hits the crazy busy state, it becomes unresponsive for hours. And also of note: when I submit these jobs, my Matlab client begins usurping VERY large amounts of memory-- it was up to 8 GB at one point towards the end of my submission routine... what's up with that? Is that all the zipping/sending it's doing?
One thing that makes me wary of the file mirroring is that I keep getting errors like this every so often:
_Error using distcomp.genericscheduler/pSubmitJobCommon (line 64)
Job submission did not occur because the user supplied SubmitFcn (parallelSubmitFcn) errored.
Error using parallel.cluster.RemoteClusterAccess/startMirrorForJob (line 364)
Failed to start mirror for job with ID 20.
Error in parallelSubmitFcn (line 132)
remoteConnection.startMirrorForJob(job);
Error in distcomp.genericscheduler/pSubmitJobCommon (line 48)
feval(submitFcn, scheduler, job, setprop, args{:});
Error in distcomp.genericscheduler/pSubmitParallelJob (line 24)
scheduler.pSubmitJobCommon( job, scheduler.ParallelSubmitFcn );
Error in distcomp.simpleparalleljob/submit (line 47)
scheduler.pSubmitParallelJob(job);
Error in getsubnetworks (line 44)
submit(job);
Caused by:
Error using parallel.cluster.RemoteClusterAccess/waitForChoreToFinishOrError (line 919)_
which makes me think I should just turn it off. Seems great for small workloads or small amounts of data, but it seems to get very overwhelmed when you throw alot of jobs at it. Any suggestions? Is there an easy way to turn this off and do it manually that anybody has had success with?
Thanks-- Nick
PS: Is there any way to add some kind of taskbar or progress bar so that I can see what Matlab is doing in the background during it's massive 'busy' binge?
0 个评论
回答(7 个)
Konrad Malkowski
2012-3-15
Hi Nick,
Take a look at the getJobStateFcn.m for your scheduler.
Try commenting the code inside of the if ... else block. This should disable mirroring for jobs that are currently running but not finished.
If that doesn't improve the performance, try commenting the whole if ... else ... end block, and then execute:
remoteConnection.doLastMirrorForJob(job);
% Store the fact that we have done the last mirror so we can shortcut in the future
data.HasDoneLastMirror = true;
scheduler.setJobSchedulerData(job, data);
on a job whose status you are interested in, once it completes.
0 个评论
Thomas
2012-3-7
You might be finding residual effects of the upgrade to r2011b..
If the command window throws out a number of old jobs here, that means there was some issue when you moved up and it is trying to find metadata files of old jobs on the cluster and in your working directory and hence cannot complete the validation (this causes it to not time out either since it is still working). IF so do the following procedure
1. clear the local_scheduler _data (you can also rename the folder) /Users/USERNAME/.matlab/local_scheduler_data/R2011a
2. Empty all the metadata files and job directories in the DataLocation (on your desktop) - Parallel>Manage Configurations>Slect your Configuration and find the Data Location > Folder where Job directory is stored.
3. Remove the files that Matlab writes on the cluster for each job i.e Job#.lockstate, Job#.in.mat, Job#.out.mat, Job#.common.mat, Job#.jobout.mat, and Job#.state.mat
hope this helps..
Konrad Malkowski
2012-3-12
Hi Nick,
Does the issue occur when you leave your MATLAB on, while the jobs are running on the cluster?
How many jobs do you have running at a time, and how many jobs are on your scheduler in finished state?
What do you mean by 10k data size? Could you provide a bit more detail?
Have you tried running a single job with multiple tasks, instead of running multiple single task jobs? This should reduce the file system load by at least a factor of 2.
Konrad Malkowski
2012-3-19
No problem :-) That is my background as well :-)
Regarding your question. To force these diagnostic messages to print in your MATLAB command window use:
setSchedulerMessageHandler(@disp)
0 个评论
Nick
2012-3-19
1 个评论
Konrad Malkowski
2012-3-20
I would recommend contacting support at this point. When you create the ticket, please include your current submission scripts and point the TS to this thread.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Startup and Shutdown 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!