Plugin Scripts for Generic Schedulers
The generic scheduler interface provides complete flexibility to configure the interaction of the MATLAB® client, MATLAB workers, and a third-party scheduler. The plugin scripts define how MATLAB interacts with your setup.
This table lists the supported plugin script functions and the stage at which they are evaluated:
File Name | Stage |
independentSubmitFcn.m | Submitting an independent job |
communicatingSubmitFcn.m | Submitting a communicating job |
getJobStateFcn.m | Querying the state of a job |
cancelJobFcn.m | Canceling a job |
cancelTaskFcn.m | Canceling a task |
deleteJobFcn.m | Deleting a job |
deleteTaskFcn.m | Deleting a task |
postConstructFcn.m | After creating a parallel.cluster.Generic instance
|
These plugin scripts are evaluated only if they have the expected file name and are located in the folder specified by the PluginScriptsLocation property of the cluster.
Note
The independentSubmitFcn.m
must exist to submit an independent
job, and the communicatingSubmitFcn.m
must exist to submit a
communicating job.
Sample Plugin Scripts
Download Sample Plugin Scripts
To support usage of the generic scheduler interface, MathWorks® provides add-ons, or plugins for the following third-party schedulers, which you can download from GitHub® repositories or the Add-On Manager and edit to meet your requirements. Choose one of the sample plugin scripts that most closely matches your setup.
Plugin | GitHub Repository |
---|---|
Parallel Computing Toolbox™ plugin for MATLAB Parallel Server™ with Slurm | https://github.com/mathworks/matlab-parallel-slurm-plugin |
Parallel Computing Toolbox plugin for MATLAB Parallel Server with IBM Spectrum® LSF® | https://github.com/mathworks/matlab-parallel-lsf-plugin |
Parallel Computing Toolbox plugin for MATLAB Parallel Server with Grid Engine | https://github.com/mathworks/matlab-parallel-gridengine-plugin |
Parallel Computing Toolbox plugin for MATLAB Parallel Server with PBS | https://github.com/mathworks/matlab-parallel-pbs-plugin |
Parallel Computing Toolbox plugin for MATLAB Parallel Server with HTCondor | https://github.com/mathworks/matlab-parallel-htcondor-plugin |
Use either of these workflows to download the appropriate plugin scripts for your scheduler.
You can download the plugins from a GitHub repository.
Clone the GitHub repository from a command windows on your machine. For example, to clone the repository for the Parallel Computing Toolbox plugin for MATLAB Parallel Server with Slurm, use:
git clone https://github.com/mathworks/matlab-parallel-slurm-plugin
Visit the GitHub page in a browser and download the plugin as a ZIP archive.
Alternatively, to install the add-ons from the MATLAB Add-On manager, go to the Home tab and, in the Environment section, click the Add-Ons icon. In the Add-On Explorer, search for the add-on and install it.
You can also download the plugins from MATLAB Central™ File Exchange.
If the MATLAB client is unable to directly submit jobs to the scheduler,
MATLAB supports the use of the ssh
protocol to submit
commands to a remote cluster.
If the client and the cluster nodes do not have a shared file system, MATLAB supports the use of sftp
(SSH File Transfer Protocol) to
copy job and task files between your computer and the cluster.
Modify Sample Scripts
You can set additional properties to customize how the client interacts with the cluster without modifying the plugin scripts. For more information, see Customize Behavior of Sample Plugin Scripts.
If your scheduler or cluster configuration is not fully supported by one of the repositories, you can modify the scripts of one of these packages to meet your requirements. For more information on how to write a set of plugin scripts for generic schedulers, see Write Custom Plugin Scripts.
Wrapper Scripts
The sample plugin scripts use wrapper scripts to simplify
the implementation of independentSubmitFcn.m
and
communicatingSubmitFcn.m
. These scripts are not required,
however, using them is a good practice to make your code more readable. This
table describes these scripts:
File name | Description |
independentJobWrapper.sh | Used in independentSubmitFcn.m to embed a
call to the MATLAB executable with the appropriate arguments. It uses
environment variables for the location of the executable and its
arguments. For an example of its use, see Sample script for a SLURM scheduler. |
communicatingJobWrapper.sh | Used in communicatingSubmitFcn.m to
distribute a communicating job in your cluster. This script
implements the steps in Submit scheduler job to launch MPI process. For an example of its use, see Sample script for a SLURM scheduler. |
Write Custom Plugin Scripts
Note
When writing your own plugin scripts, it is a good practice to start by modifying one of the sample plugin scripts that most closely matches your setup. For a list of sample plugin scripts, see Sample Plugin Scripts.
independentSubmitFcn
When you submit an independent job to a generic cluster, the
independentSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function independentSubmitFcn(cluster,job,environmentProperties)
Each task in a MATLAB independent job corresponds to a single job on your scheduler. The
purpose of this function is to submit N
jobs to your
third-party scheduler, where N
is the number of tasks in the
independent job. Each of these jobs must:
Set the five environment variables required by the worker MATLAB to identify the individual task to run. For more information, see Configure the worker environment.
Call the appropriate MATLAB executable to start the MATLAB worker and run the task. For more information, see Submit scheduler jobs to run MATLAB workers.
Configure the worker environment. This table identifies the five environment variables and values that must be set on the worker MATLAB to run an individual task:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.independentDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
PARALLEL_SERVER_TASK_LOCATION | environmentProperties.TaskLocations{n}
for the nth task |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler jobs to run MATLAB workers. Once the five required parameters for a given job and task are defined on
a worker, the task is run by calling the MATLAB executable with suitable arguments. The MATLAB executable to call is defined in
environmentProperties.MatlabExecutable
. The arguments
to pass are defined in
environmentProperties.MatlabArguments
.
Note
If you cannot submit directly to your scheduler from the client
machine, see Submit from a Remote Host for
instructions on how to submit using ssh
.
Sample script for a SLURM scheduler. This script shows a basic submit function for a SLURM scheduler with a shared file system. For a more complete example, see Sample Plugin Scripts.
function independentSubmitFcn(cluster,job,environmentProperties) % Specify the required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.independentDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % These are used in the independentJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); for ii = 1:environmentProperties.NumberOfTasks % Specify the environment variable required to identify which task to run. setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); % Specify the command to submit the job to the SLURM scheduler. % SLURM will automatically copy environment variables to workers. commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); end end
The previous example submits a simple bash script,
independentJobWrapper.sh
, to the scheduler. The
independentJobWrapper.sh
script embeds the
MATLAB executable and arguments using environment variables:
#!/bin/sh # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use exec "${PARALLEL_SERVER_MATLAB_EXE}" ${PARALLEL_SERVER_MATLAB_ARGS}
communicatingSubmitFcn
When you submit a communicating job to a generic cluster, the
communicatingSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function communicatingSubmitFcn(cluster,job,environmentProperties)
The purpose of this function is to submit a single job to your scheduler. This job must:
Set the four environment variables required by the MATLAB workers to identify the job to run. For more information, see Configure the worker environment.
Call MPI to distribute your job to
N
MATLAB workers.N
corresponds to the maximum value specified in theNumWorkersRange
property of the MATLAB job. For more information, see Submit scheduler job to launch MPI process.
Configure the worker environment. This table identifies the four environment variables and values that must be set on the worker MATLAB to run a task of a communicating job:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.communicatingDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler job to launch MPI process. After you define the four required parameters for a given job, run your
job by launching N
worker MATLAB processes using mpiexec
.
mpiexec
is software shipped with the Parallel Computing Toolbox that implements the Message Passing Interface (MPI) standard
to allow communication between the worker MATLAB processes. For more information about
mpiexec
, see the MPICH home page.
To run your job, you must submit a job to your scheduler, which executes
the following steps. Note that matlabroot
refers to the
MATLAB installation location on your worker nodes.
Request
N
processes from the scheduler.N
corresponds to the maximum value specified in theNumWorkersRange
property of the MATLAB job.Call
mpiexec
to start worker MATLAB processes. The number of worker MATLAB processes to start on each host should match the number of processes allocated by your scheduler. Thempiexec
executable is located atmatlabroot/bin/mw_mpiexec
.The
mpiexec
command automatically forwards environment variables to the launched processes. Therefore, ensure the environment variables listed in Configure the worker environment are set before runningmpiexec
.To learn more about options for
mpiexec
, see Using the Hydra Process Manager.
Note
For a complete example of the previous steps, see the
communicatingJobWrapper.sh
script provided
with any of the sample plugin scripts in Sample Plugin Scripts.
Use this script as a starting point if you need to write your own
script.
Sample script for a SLURM scheduler. The following script shows a basic submit function for a SLURM scheduler with a shared file system.
The submitted job is contained in a bash script,
communicatingJobWrapper.sh
. This script implements
the relevant steps in Submit scheduler job to launch MPI process for a
SLURM scheduler. For a more complete example, see Sample Plugin Scripts.
function communicatingSubmitFcn(cluster,job,environmentProperties) % Specify the four required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.communicatingDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % Specify the location of the MATLAB install on the cluster nodes. % These are used in the communicatingJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); setenv('PARALLEL_SERVER_CMR', cluster.ClusterMatlabRoot); numberOfTasks = environmentProperties.NumberOfTasks; % Specify the command to submit a job to the SLURM scheduler which % requests as many processes as tasks in the job. % SLURM will automatically copy environment variables to workers. commandToRun = sprintf('sbatch --ntasks=%d communicatingJobWrapper.sh', numberOfTasks); [cmdFailed, cmdOut] = system(commandToRun); end
getJobStateFcn
When you query the state of a job created with a generic cluster, the
getJobStateFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function state = getJobStateFcn(cluster,job,state)
When using a third-party scheduler, it is possible that the scheduler can have more up-to-date information about your jobs than what is available to the toolbox from the local job storage location. This situation is especially true if your cluster does not share a file system with the client machine, where the remote file system could be slow in propagating large data files back to your local data location.
To retrieve that information from the scheduler, add a function called
getJobStateFcn.m
to the
PluginScriptsLocation
of your
cluster.
The state passed into this function is the state derived from the local job
storage. The body of this function can then query the scheduler to determine a
more accurate state for the job and return it in place of the stored state. The
function you write for this purpose must return a valid value for the state of a
job object. Allowed values are ‘pending'
,
‘queued'
, ‘running'
,
‘finished'
, or ‘failed'
.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Manage Jobs with Generic Scheduler.
cancelJobFcn
When you cancel a job created with a generic cluster, the
cancelJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelJobFcn(cluster,job)
When you cancel a job created using the generic scheduler interface, by
default this action affects only the job data in storage. To cancel the
corresponding jobs on your scheduler, you must provide instructions on what to
do and when to do it to the scheduler. To achieve this, add a function called
cancelJobFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the queue. The function must return a logical scalar indicating the success or failure of canceling the jobs on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Manage Jobs with Generic Scheduler.
cancelTaskFcn
When you cancel a task created with a generic cluster, the
cancelTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelTaskFcn(cluster,task)
When you cancel a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To cancel the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
cancelTaskFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue. The function must return a logical scalar indicating the success or failure of canceling the job on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Manage Jobs with Generic Scheduler.
deleteJobFcn
When you delete a job created with a generic cluster, the
deleteJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a job created using the generic scheduler interface, by
default, this affects only the job data in storage. To remove the corresponding
jobs on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteJobFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Manage Jobs with Generic Scheduler.
deleteTaskFcn
When you delete a task created with a generic cluster, the
deleteTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To remove the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteTaskFcn.m
to the
PluginScriptsLocation
of your
cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Manage Jobs with Generic Scheduler.
postConstructFcn
After you create an instance of your cluster in MATLAB, the postConstructFcn.m
function executes in
the MATLAB client session. For example, the following line of code creates an
instance of your cluster and runs the postConstructFcn
function associated with the ‘myProfile'
cluster
profile:
c = parcluster('myProfile');
The declaration line of the postConstructFcn
function must
be:
function postConstructFcn(cluster)
If you need to perform custom configuration of your cluster before its use,
add a function called postConstructFcn.m
to the
PluginScriptsLocation
of your cluster. The body of this
function can contain any extra setup steps you require.
Create Cluster Profile and Validate Plugin Scripts
To verify that your custom plugin scripts work well and that the parallel computing products are installed and configured correctly on your cluster, you must set up a cluster profile on a MATLAB client using the generic scheduler interface. For detailed instructions, see Create a Generic Cluster Profile.
Add User Customization
If you need to modify the functionality of your plugin scripts at run time, then
use the AdditionalProperties
property of
the generic scheduler interface.
As an example, consider the SLURM scheduler. The submit command for SLURM accepts
a –-nodelist
argument that allows you to specify the nodes you
want to run on. You can change the value of this argument without having to modify
your plugin scripts. To add this functionality, include the following code pattern
in either your independentSubmitFcn.m
or
communicatingSubmitFcn.m
script:
% Basic SLURM submit command submitCommand = 'sbatch'; % Check if property is defined if isprop(cluster.AdditionalProperties, 'NodeList') % Add appropriate argument and value to submit string submitCommand = [submitCommand ' --nodelist=' cluster.AdditionalProperties.NodeList]; end
For an example of how to use this coding pattern, see the submit functions of the scripts in Sample Plugin Scripts.
Alternatively, to modify the submit command for both independent and communication
jobs, include the code pattern above in your getCommonSubmitArgs
function. The getCommonSubmitArgs
function is a helper function
included in the sample plugin scripts that you can use to modify the submit command
for both types of jobs.
Set Additional Properties from Cluster Profile Manager
With the modification to your scripts in the previous example, you can add an
AdditionalProperties
entry to your generic cluster profile to specify a list of nodes to use. This
provides a method of documenting customization added to your plugin scripts for
anyone you share the cluster profile with.
To add the NodeList
property to your cluster
profile:
Start the Cluster Profile Manager from the MATLAB desktop by selecting Parallel > Create and Manage Cluster Profiles.
Select the profile for your generic cluster, and click Edit.
Navigate to the AdditionalProperties table, and click Add.
Enter
NodeList
as the Name.Set String as the Type.
Set the Value to the list of nodes.
Set Additional Properties from MATLAB Command Line
With the modification to your scripts in Add User Customization, you can edit the list of nodes from the MATLAB command line by setting the appropriate property of the cluster object before submitting a job:
c = parcluster; c.AdditionalProperties.NodeList = 'gpuNodeName'; j = c.batch('myScript');
Display the AdditionalProperties
object to see all currently
defined properties and their
values:
>> c.AdditionalProperties ans = AdditionalProperties with properties: ClusterHost: 'myClusterHost' NodeList: 'gpuNodeName' RemoteJobStorageLocation: '/tmp/jobs'
Manage Jobs with Generic Scheduler
The first requirement for job management is to identify the jobs on the scheduler
corresponding to a MATLAB job object. When you submit a job to the scheduler, the command that
does the submission in your submit function can return some data about the job from
the scheduler. This data typically includes a job ID. By storing that scheduler job
ID with the MATLAB job object, you can later refer to the scheduler job by this job ID
when you send management commands to the scheduler. Similarly, you can store a map
of MATLAB task IDs to scheduler job IDs to help manage individual tasks. You can
use the setJobClusterData
(Parallel Computing Toolbox) function to
store this cluster data..
Save Job Scheduler Data
This example shows how to modify the independentSubmitFcn.m
function to parse the output of each command submitted to a SLURM scheduler. You
can use regular expressions to extract the scheduler job ID for each task and
then store it using
.
% Pattern to extract scheduler job ID from SLURM sbatch output searchPattern = '.*Submitted batch job ([0-9]+).*'; jobIDs = cell(numberOfTasks, 1); for ii = 1:numberOfTasks setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); jobIDs{ii} = regexp(cmdOut, searchPattern, 'tokens', 'once'); end % set the job IDs on the job cluster data cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));
Retrieve Job Scheduler Data
This example modifies the cancelJobFcn.m
to cancel the
corresponding jobs on the SLURM scheduler. The example uses getJobClusterData
(Parallel Computing Toolbox) to retrieve
job scheduler
data.
function OK = cancelJobFcn(cluster, job) % Get the scheduler information for this job data = cluster.getJobClusterData(job); jobIDs = data.ClusterJobIDs; for ii = 1:length(jobIDs) % Tell the SLURM scheduler to cancel the job commandToRun = sprintf('scancel ''%s''', jobIDs{ii}); [cmdFailed, cmdOut] = system(commandToRun); end OK = true;
Submit from a Remote Host
If the MATLAB client is unable to submit directly to your scheduler, use parallel.cluster.RemoteClusterAccess
(Parallel Computing Toolbox)
to establish a connection and run commands on a remote host.
The following code executes a command on a remote host,
remoteHostname
, as the user,
user
.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccess... ('remoteHostname', 'user'); % Execute a command on remoteHostname [cmdFailed, cmdOut] = access.runCommand(commandToRun);
For an example of plugin scripts using remote host submission, see the remote submission mode in Sample Plugin Scripts.
Submit Without a Shared File System
If the MATLAB client does not have a shared file system with the cluster nodes, use
parallel.cluster.RemoteClusterAccess
(Parallel Computing Toolbox)
to establish a connection and copy job and task files between the client and cluster
nodes.
This parallel.cluster.RemoteClusterAccess
(Parallel Computing Toolbox)
object uses the ssh
protocol, and hence requires an
ssh
daemon service running on the remote host. To establish a
connection, you must either have an ssh
agent running on your
machine, or provide one of the following:
A user name and password
A valid identity file
Proper responses for multifactor authentication
When the client does not have a shared file system with the cluster nodes, you must specify both a local job storage location to use on the client and a remote job storage location to use on the cluster. The remote job storage location must be available to all nodes of the cluster.
parallel.cluster.RemoteClusterAccess
uses file mirroring to
continuously synchronize the local job and task files with those on the cluster.
When file mirroring first starts, local job and task files are uploaded to the
remote job storage location. As the job executes, the file mirroring continuously
checks the remote job storage location for new files and updates, and copies the
files to the local storage on the client. This procedure ensures the MATLAB client always has an up-to-date view of the jobs and tasks executing
on the scheduler.
This example connects to the remote host, remoteHostname
, as
the user, user
, and establishes
/remote/storage
as the remote cluster storage location to
synchronize with. It then starts file mirroring for a job, copying the local files
of the job to /remote/storage
on the cluster, and then syncing
any changes back to the local
machine.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror... ('remoteHostname', '/remote/storage', 'user'); % Start file mirroring for job access.startMirrorForJob(job);