parpool use existing slurm job

16 次查看(过去 30 天)
If I've already started an interactive Slurm job with 5 nodes, how can I start a parpool using the resources allocated to that existing job?

采纳的回答

Damian Pietrus
Damian Pietrus 2024-1-16
Hello Frank,
Unfortunately once you've already started up a Slurm job across 5 nodes, there's no current way to access all of those resources. Instead, we have a few options based on if MATLAB Parallel Server is installed on the cluster or not. You can run the ver command to list all of the products installed on the cluster.
If you do not see MATLAB Parallel Server listed in the ver output, you will only be able to use the resources of one node. You can call c=parcluster('local') and start as many workers in your pool as ther are cores on the machine. If you do see it listed, you can setup a Slurm cluster profile as was mentioned above.
Once you have a Slurm cluster profile setup, you can either submit a batch job like Edric mentioned above or use a Slurm batch script that calls some MATLAB code which opens up a parpool across multiple nodes.
Take the following code as an example. The first script starts up a session on one machine, and asks for just enough resources to launch the MATLAB client. Notice that we are only asking for one machine and one core. This is because the "my_parallel_code" file will be asking for the bulk of the resources in a seaparate call to the scheduler. Also note that the total time for this "outer" job to be long enough for the "inner" parallel job to also queue up and finish.
#!/bin/sh
#SBATCH -n 1 # 1 instance of MATLAB
#SBATCH --cpus-per-task=1 # 1 core per instance
#SBATCH --mem-per-cpu=4gb # 4 GB RAM per core
#SBATCH --time=2:30:00 # 2 hours, 20 minutes
# Add MATLAB to system path
# Modify as needed to load your MATLAB module/version
module load matlab/R2023b
# Run code
matlab -batch my_parallel_code
Here in "my_parallel_code", you can see that we first call the Slurm cluster profile we called earlier. Next, we call parpool with the total amount of workers that we want. This is going to submit a separate job request to Slurm and should be where you requrest the total bulk of your resources. Finally, we run our parallel code across the multiple workers/nodes that we requested.
function my_parallel_code
% Bring cluster profile into the workspace
c = parcluster('Slurm_profile_name_here');
% Specify total number of workers here. This can span multiple nodes
if isempty(gcp('nocreate')), c.parpool(50); end
% Actual parallel code here
parfor
....
end
% Function end
end
When you look at the squeue output, you should see two jobs -- the outer sbatch job that called MATLAB initially and is using one core and then the inner parpool job (with a name of Job#) that spans multiple nodes.
Let me know if you have any questions!
  2 个评论
Frank
Frank 2024-1-16
TY, @Damian Pietrus. We do have Parallel Server installed and we're able to launch multi-node Matlab jobs using parpool. =)
In your exampke, since parpool launches a separate job the batch job would then have to wait for the parpool requested resources to be available before the original job could proceed. Depending on how long it take for the parpool resources to be allocated, the batch job could time out before the parpool job could be allocated resources.
Is there maybe a more manual way to launch parallel services on all of the nodes allocated to a running job that could then be "assembled" as a parpool?
Damian Pietrus
Damian Pietrus 2024-1-17
Hey Frank,
As you noticed, the walltime of the outer job needs to be long enough for the inner job to sit in the queue, get resources, start up, and then finish. I've found that increasing that value to include a decent buffer is usually adequate, though this may vary depending on how busy the cluster is.
Right now there isn't a way of "assembling" the pool in the way that you've mentioned. However, we can work around this issue by using batch jobs like Edric mentioned. We have two options here.
If you'd like to keep everything on the cluster, your slurm batch script can call an .m file that then calls the MATLAB batch command and exits shortly after. With this option, the "outer" job only stays open long enough to submit the other jobs to the queue. Once they are successfully submitted, the script/job can exit and leave the other jobs in the queue to start on their own. Once they are finished, you can start MATLAB to fetch the results.
% Example batch script
c=parcluster('Slurm_Profile_Here');
job1 = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
...
jobN = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
disp('Jobs Submitted')
exit
The other option is to setup your own machine for remote job submission. As long as your cluster accepts SSH and SFTP connection, you can basically submit the same batch jobs in the previous example from your own machine, avoiding the "outer" job entirely -- it submits the Parallel Server job to the queue directly. If you're interested in this let me know and I can include some more info.

请先登录,再进行评论。

更多回答(2 个)

Venkat Siddarth Reddy
编辑:Venkat Siddarth Reddy 2024-1-15
Hi @Frank,
I understand that you are trying to create a parallel pool using the resources of a running Slurm job.
To achieve this,you will need to set up a Slurm profile in MATLAB, as this will enable "parpool" function to access the Slurm cluster.
Additionally, you will need to use the "MATLAB Parallel Server" since you're aiming to utilize multiple nodes of the cluster.
For more information on Slurm profiles and using the MATLAB Parallel Server with Slurm jobs, please refer to the following documentation:
I hope this helps!
  1 个评论
Frank
Frank 2024-1-15
I only see instructions for profiles that would start a new Slurm job. How can I use the resources allocated to my running Job?

请先登录,再进行评论。


Edric Ellis
Edric Ellis 2024-1-16
Further to @Venkat Siddarth Reddy's suggestion, what you might want to do is something like this:
clus = parcluster('mySlurmProfile') % set up as per Venkat's post
job = batch(clus, 'myScriptThatUsesParfor', Pool=20);
What that batch command does is:
  1. Launch a job on your SLURM cluster with 21 workers
  2. When the job starts, the first worker becomes a non-interactive "client", and the remaining 20 are connected up as a parpool for use by that client
  3. That first worker executes your program myScriptThatUsesParfor
  4. Any parfor loops etc. inside that script use the 20 workers
Is that the sort of thing you want to achieve?

类别

Help CenterFile Exchange 中查找有关 Third-Party Cluster Configuration 的更多信息

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by