Troubleshooting and Debugging
Attached Files Size Limitations
The combined size of all attached files for a job is limited to 4 GB.
File Access and Permissions
Ensuring That Workers on Windows Operating Systems Can Access Files
By default, a worker on a Windows® operating system is installed as a service running as
LocalSystem
, so it does not have access to mapped network
drives.
Often a network is configured to not allow services running as
LocalSystem
to access UNC or mapped network shares. In
this case, you must run the mjs service under a different user with rights to
log on as a service. See the section Set the User (MATLAB Parallel Server) in the MATLAB®
Parallel Server™ System Administrator's Guide.
Task Function Is Unavailable
If a worker cannot find the task function, it returns the error message
Error using ==> feval Undefined command/function 'function_name'.
The worker that ran the task did not have access to the function
function_name
. One solution is to make sure the location
of the function's file, function_name.m
, is included in the
job's AdditionalPaths
property. Another solution is to
transfer the function file to the worker by adding
function_name.m
to the AttachedFiles
property of the job.
Load and Save Errors
If a worker cannot save or load a file, you might see the error messages
??? Error using ==> save Unable to write file myfile.mat: permission denied. ??? Error using ==> load Unable to read file myfile.mat: No such file or directory.
In determining the cause of this error, consider the following questions:
What is the worker's current folder?
Can the worker find the file or folder?
What user is the worker running as?
Does the worker have permission to read or write the file in question?
Tasks or Jobs Remain in Queued State
A job or task might get stuck in the queued state. To investigate the cause of this problem, look for the scheduler's logs:
Spectrum LSF® schedulers might send emails with error messages.
Microsoft® Windows HPC Server (including CCS), LSF®, PBS Pro®, and TORQUE save output messages in a debug log. See the
getDebugLog
reference page.If using a generic scheduler, make sure the submit function redirects error messages to a log file.
Possible causes of the problem are:
The MATLAB worker failed to start due to licensing errors, the executable is not on the default path on the worker machine, or is not installed in the location where the scheduler expected it to be.
MATLAB could not read/write the job input/output files in the scheduler's job storage location. The storage location might not be accessible to all the worker nodes, or the user that MATLAB runs as does not have permission to read/write the job files.
If using a generic scheduler:
The environment variable
PARALLEL_SERVER_DECODE_FUNCTION
was not defined before the MATLAB worker started.The decode function was not on the worker's path.
No Results or Failed Job
Task Errors
If your job returned no results (i.e., fetchOutputs(job)
returns an empty cell array), it is probable that the job failed and some of its
tasks have their Error
properties set.
You can use the following code to identify tasks with error messages:
errmsgs = get(yourjob.Tasks, {'ErrorMessage'}); nonempty = ~cellfun(@isempty, errmsgs); celldisp(errmsgs(nonempty));
This code displays the nonempty error messages of the tasks found in the job
object yourjob
.
Debug Logs
If you are using a supported third-party scheduler, you can use the getDebugLog
function to read
the debug log from the scheduler for a particular job or task.
For example, find the failed job on your LSF scheduler, and read its debug log:
c = parcluster('my_lsf_profile') failedjob = findJob(c, 'State', 'failed'); message = getDebugLog(c, failedjob(1))
Connection Problems Between the Client and MATLAB Job Scheduler
For testing connectivity between the client machine and the machines of your compute cluster, you can use Admin Center. For more information about Admin Center, including how to start it and how to test connectivity, see Start Admin Center (MATLAB Parallel Server) and Test MATLAB Job Scheduler Cluster Connectivity in Admin Center (MATLAB Parallel Server).
Detailed instructions for other methods of diagnosing connection problems between the client and MATLAB Job Scheduler can be found in some of the Bug Reports listed on the MathWorks Web site.
The following sections can help you identify the general nature of some connection problems.
Client Cannot See the MATLAB Job Scheduler
If you cannot locate or connect to your MATLAB Job Scheduler with parcluster
, the most
likely reasons for this failure are:
The MATLAB Job Scheduler is currently not running.
Firewalls do not allow traffic from the client to the MATLAB Job Scheduler.
The client and the MATLAB Job Scheduler are not running the same version of the software.
The client and the MATLAB Job Scheduler cannot resolve each other's short hostnames.
The MATLAB Job Scheduler is using a nondefault
BASE_PORT
setting as defined in themjs_def
file, and theHost
property in the cluster profile does not specify this port.
MATLAB Job Scheduler Cannot See the Client
If a warning message says that the MATLAB Job Scheduler cannot open a TCP connection to the client computer, the most likely reasons for this are
Firewalls do not allow traffic from the MATLAB Job Scheduler to the client.
The MATLAB Job Scheduler cannot resolve the short hostname of the client computer. Use
pctconfig
to change the hostname that the MATLAB Job Scheduler will use for contacting the client.
"One of your shell's init files contains a command that is writing to stdout..."
The example code for generic schedulers with non-shared file systems contacts an sftp server to handle the file transfer to and from the cluster's file system. This use of sftp is subject to all the normal sftp vulnerabilities. One problem that can occur results in an error message similar to this:
One of your shell's init files contains a command that is writing to stdout, interfering with RemoteClusterAccess. The stdout read was: <some output> Find and wrap the command with a conditional test, such as if ($?TERM != 0) then if ("$TERM" != "dumb") then <your command> endif endif
The sftp server starts a shell, usually bash or tcsh, to set your standard read and write permissions appropriately before transferring files. The server initializes the shell in the standard way, calling files like .bashrc and .cshrc. The problem occurs if your shell emits text to standard out when it starts. That text is transferred back to the sftp client running inside MATLAB, and is interpreted as the size of the sftp server's response message.
To work around this error, locate the shell startup file code that is emitting the
text, and either remove it or bracket it within if
statements to
see if the sftp server is starting the shell:
if ($?TERM != 0) then if ("$TERM" != "dumb") then /your command/ endif endif
You can test this outside of MATLAB with a standard UNIX or Windows sftp command-line client before trying again in MATLAB. If the problem is not fixed, an error message persists:
> sftp yourSubmitMachine
Connecting to yourSubmitMachine... Received message too long 1718579042
If the problem is fixed, you should see:
> sftp yourSubmitMachine
Connecting to yourSubmitMachine...