parpool use existing slurm job
16 次查看(过去 30 天)
显示 更早的评论
If I've already started an interactive Slurm job with 5 nodes, how can I start a parpool using the resources allocated to that existing job?
0 个评论
采纳的回答
Damian Pietrus
2024-1-16
Hello Frank,
Unfortunately once you've already started up a Slurm job across 5 nodes, there's no current way to access all of those resources. Instead, we have a few options based on if MATLAB Parallel Server is installed on the cluster or not. You can run the ver command to list all of the products installed on the cluster.
If you do not see MATLAB Parallel Server listed in the ver output, you will only be able to use the resources of one node. You can call c=parcluster('local') and start as many workers in your pool as ther are cores on the machine. If you do see it listed, you can setup a Slurm cluster profile as was mentioned above.
Once you have a Slurm cluster profile setup, you can either submit a batch job like Edric mentioned above or use a Slurm batch script that calls some MATLAB code which opens up a parpool across multiple nodes.
Take the following code as an example. The first script starts up a session on one machine, and asks for just enough resources to launch the MATLAB client. Notice that we are only asking for one machine and one core. This is because the "my_parallel_code" file will be asking for the bulk of the resources in a seaparate call to the scheduler. Also note that the total time for this "outer" job to be long enough for the "inner" parallel job to also queue up and finish.
#!/bin/sh
#SBATCH -n 1 # 1 instance of MATLAB
#SBATCH --cpus-per-task=1 # 1 core per instance
#SBATCH --mem-per-cpu=4gb # 4 GB RAM per core
#SBATCH --time=2:30:00 # 2 hours, 20 minutes
# Add MATLAB to system path
# Modify as needed to load your MATLAB module/version
module load matlab/R2023b
# Run code
matlab -batch my_parallel_code
Here in "my_parallel_code", you can see that we first call the Slurm cluster profile we called earlier. Next, we call parpool with the total amount of workers that we want. This is going to submit a separate job request to Slurm and should be where you requrest the total bulk of your resources. Finally, we run our parallel code across the multiple workers/nodes that we requested.
function my_parallel_code
% Bring cluster profile into the workspace
c = parcluster('Slurm_profile_name_here');
% Specify total number of workers here. This can span multiple nodes
if isempty(gcp('nocreate')), c.parpool(50); end
% Actual parallel code here
parfor
....
end
% Function end
end
When you look at the squeue output, you should see two jobs -- the outer sbatch job that called MATLAB initially and is using one core and then the inner parpool job (with a name of Job#) that spans multiple nodes.
Let me know if you have any questions!
2 个评论
Damian Pietrus
2024-1-17
Hey Frank,
As you noticed, the walltime of the outer job needs to be long enough for the inner job to sit in the queue, get resources, start up, and then finish. I've found that increasing that value to include a decent buffer is usually adequate, though this may vary depending on how busy the cluster is.
Right now there isn't a way of "assembling" the pool in the way that you've mentioned. However, we can work around this issue by using batch jobs like Edric mentioned. We have two options here.
If you'd like to keep everything on the cluster, your slurm batch script can call an .m file that then calls the MATLAB batch command and exits shortly after. With this option, the "outer" job only stays open long enough to submit the other jobs to the queue. Once they are successfully submitted, the script/job can exit and leave the other jobs in the queue to start on their own. Once they are finished, you can start MATLAB to fetch the results.
% Example batch script
c=parcluster('Slurm_Profile_Here');
job1 = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
...
jobN = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
disp('Jobs Submitted')
exit
The other option is to setup your own machine for remote job submission. As long as your cluster accepts SSH and SFTP connection, you can basically submit the same batch jobs in the previous example from your own machine, avoiding the "outer" job entirely -- it submits the Parallel Server job to the queue directly. If you're interested in this let me know and I can include some more info.
更多回答(2 个)
Venkat Siddarth Reddy
2024-1-15
编辑:Venkat Siddarth Reddy
2024-1-15
I understand that you are trying to create a parallel pool using the resources of a running Slurm job.
To achieve this,you will need to set up a Slurm profile in MATLAB, as this will enable "parpool" function to access the Slurm cluster.
Additionally, you will need to use the "MATLAB Parallel Server" since you're aiming to utilize multiple nodes of the cluster.
For more information on Slurm profiles and using the MATLAB Parallel Server with Slurm jobs, please refer to the following documentation:
- https://www.mathworks.com/help/matlab-parallel-server/install-and-configure-matlab-parallel-server-for-slurm-1.html
- https://it.stonybrook.edu/help/kb/using-matlabs-parallel-server-for-multi-core-and-multi-node-jobs
- https://www.mathworks.com/matlabcentral/answers/1749950-running-parfor-on-multiple-nodes-using-slurm
I hope this helps!
Edric Ellis
2024-1-16
clus = parcluster('mySlurmProfile') % set up as per Venkat's post
job = batch(clus, 'myScriptThatUsesParfor', Pool=20);
- Launch a job on your SLURM cluster with 21 workers
- When the job starts, the first worker becomes a non-interactive "client", and the remaining 20 are connected up as a parpool for use by that client
- That first worker executes your program myScriptThatUsesParfor
- Any parfor loops etc. inside that script use the 20 workers
Is that the sort of thing you want to achieve?
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Third-Party Cluster Configuration 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!