Why does MATLAB Parallel Server validation fail or stall at the Job test/createJob stage (independent jobs)?

9 次查看(过去 30 天)

采纳的回答

MathWorks Support Team
MathWorks Support Team 2024-4-12,0:00
编辑:MathWorks Support Team 2024-4-12,18:53
This can be caused by a number of possible issues, including, but not limited to the following:
  • Misconfigured cluster profile settings
  • Installation and licensing errors
  • Mismatching, outdated, or corrupted integration scripts
  • The scheduler has rejected your submission
  • Insufficient resources on the cluster
  • Connection issues
  • Invalid user credentials
  • The headnode or client does not have necessary third-party scheduler utilities installed
  • The job was cancelled by the cluster or a cluster admin
  • Interference from user-made functions or path definitions

 

Misconfigured cluster profile settings

Please make sure that your cluster profile settings are correctly setup. If not, you may be attempting to submit a job that cannot be accepted by the cluster. If you're unsure what your cluster profile settings should look like view the links below, depending on your cluster's scheduler.
For MATLAB Job Scheduler
For HPC Pack
For other third-party Schedulers

 

Installation and licensing errors

Please make sure that all nodes on the cluster have at least MATLAB and MATLAB Parallel Server installed on them. MATLAB Parallel Server is neither forwards nor backwards compatible, so clients submitting jobs from MATLAB must be submitting to a release of MATLAB Parallel Server that is installed on the cluster. For example, if MATLAB Parallel Server R2024a is installed on the cluster, clients can submit jobs from MATLAB R2024a. Multiple releases of MATLAB Parallel Server can be installed on the cluster, which means multiple releases of MATLAB can submit jobs.
All nodes on the cluster only need access to a MATLAB Parallel Server license. To check whether a node is able to checkout a license or not, follow these steps.
Linux clusters
  1. Open Terminal
  2. CD into MATLAB's installation's bin directory (ex: cd /usr/local/MATLAB/R2024a/bin)
  3. Run the command ./matlab -dmlworker -r "ver -support, exit" | grep Parallel
  4. The command above should list MATLAB Parallel Server with a license number. If MATLAB Parallel Server is not listed, then it is not installed. If the license number is missing or unknown, then it is not licensed properly.
Windows clusters
  1. Open Command Prompt. You can search for it in the Windows Start Menu.
  2. CD into MATLAB's installation's bin directory (ex: cd "C:\Program Files\MATLAB\R2024a\bin")
  3. Run the command .\matlab.exe -dmlworker -batch "ver -support, exit" | findstr Parallel
  4. The command above should list MATLAB Parallel Server with a license number. If MATLAB Parallel Server is not listed, then it is not installed. If the license number is missing or unknown, then it is not licensed properly.

 

Mismatching, outdated, or corrupted integration scripts

If the cluster you are submitting to is using a third-party scheduler, please make sure that you are using the latest version of the integration scripts and they match the cluster's scheduler. For example, if the cluster is using Slurm as its scheduler, you'll want to make sure you're using the Slurm integration scripts and latest version of them. To download the latest version of the integration scripts, see the link below and then select the desired scheduler.
MATLAB Parallel Server integration scripts for third party schedulers
If you're worried that your integration scripts are corrupted (which may manifest itself in "file or directory not found" errors), please try manually downloading the integration scripts from our website, rather than using git clone.

 

The scheduler has rejected your submission

Clusters may be setup with certain restrictions or rules. If you're unsure what those rules or restrictions are, please review the error message in your validation report. For example, your validation report may see a warning stating that your job submission was rejected because you attempted to request more resources than what is available to you. If you're unsure what restrictions or rules there are on the cluster, please reach out to your cluster's administrator(s).

 

Insufficient resources on the cluster

If you are attempting to submit a job that uses more resources than what's available on the cluster's hardware, then your job may fail or be stuck in a "queued" state. Make sure that you know what the cluster's maximum resources are and that you are requesting an amount that can be honored by its hardware. If you're unsure, you may use scheduler commands or reach out to your cluster administrator(s) for more information.

 

Connection issues

If the client submitting the job is unable to connect to the headnode or the headnode is unable to connect to the worker node(s), then the independent batch job will fail. Please make sure that the client can connect to the headnode and the headnode can connect to the worker node(s). If you're unsure if the correct ports are opened for communicating, please visit the links below, based on the cluster's scheduler.

 

Invalid user credentials

If the user provides invalid credentials to connect to the head node, then the job will fail. In some cases, the user may also need to authenticate to the worker nodes from the head node and if they fail to do so, then the job will fail. If your cluster is using a third-party scheduler, please ensure that you are able to connect to the headnode with the credentials given outside of MATLAB and can submit basic jobs that don't use MATLAB.

 

The headnode or client does not have necessary third-party scheduler utilities installed

If you are using a third-party scheduler, you may need to make sure the scheduler's utilities are installed on the cluster and/or the client, depending on the scheduler.
For HPC Pack
For other third-party schedulers
Please make sure the scheduler's utilities are installed on the headnode.

 

The job was cancelled by the cluster or a cluster admin

With a third-party scheduler, if the cluster or a cluster administrator cancels the job for any reason, then this will result in the job failing. Please check your scheduler's queue and contact your cluster administrator if there are indications that the job was intentionally cancelled by somebody other than yourself.

 

Interference from user-made functions or path definitions

If a user has a function on their path with a name that is used by a MATLAB Parallel Server function, then this may cause interference and usually results in error messages pertaining specifically to the function, such as "The arguments should be all cell arrays or not." Please try temporarily removing any custom pathdef.m files or functions that share names with MATLAB Parallel Server functions.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Installation 的更多信息

标签

尚未输入任何标签。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by