failing to run Reinforcement learning job on the cluster

2 次查看(过去 30 天)
I have a custom reinforcement learning environment in which I train an agent using the SAC algorithm. The training runs smoothly on my desktop with four cores, but attempting to speed up the process on the university cluster has been unsuccessful. Below is some information about the job. Can this issue be resolved?
>> (jobRL5)
jobRL5 =
Job
Properties:
ID: 103
Type: pool
Username: amomani1
State: failed
SubmitDateTime: 13-May-2024 18:25:52
StartDateTime: 13-May-2024 18:27:12
RunningDuration: 0 days 13h 39m 28s
NumWorkersRange: [11 11]
NumThreads: 2
AutoAttachFiles: true
Auto Attached Files: List files
AttachedFiles: R:\amomani1\matlabcodes_SI_2023a\talbot_inversion.m
R:\amomani1\matlabcodes_SI_2023a\talbot_inversion2.m
R:\amomani1\matlabcodes_SI_2023a\talbotcode.m
AutoAddClientPath: true
AdditionalPaths: \\lightning.bu.binghamton.edu\matlab\nonshared\23a\IntegrationScripts\spiedie
\\lightning.bu.binghamton.edu\matlab\nonshared\23a
C:\Users\amomani1\Documents\MATLAB
C:\Users\amomani1\AppData\Local\Temp\8\Editor_retgg
FileStore: [1x1 parallel.FileStore]
ValueStore: [1x1 parallel.ValueStore]
EnvironmentVariables: {}
Associated Tasks:
Number Pending: 0
Number Running: 0
Number Finished: 11
Task ID of Errors: []
Task ID of Warnings: []
Task Scheduler IDs: 4857192
>> c.getDebugLog(jobRL5)
LOG FILE OUTPUT:
Node file: compute[078,162]
Starting SMPD on compute078 compute162 ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -debug 0 &
Checking that SMPD processes are running (Attempt 1 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
No SMPD process running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
No SMPD process running on compute162
Checking that SMPD processes are running (Attempt 2 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
SMPD process found running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
SMPD process found running on compute162
All SMPDs launched
Machine args: -hosts 2 compute078 6 compute162 6
"/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_mpiexec" -smpd -phrase MATLAB -port 27192 -l -hosts 2 compute078 6 compute162 6 -genvlist PARALLEL_SERVER_DECODE_FUNCTION,PARALLEL_SERVER_STORAGE_LOCATION,PARALLEL_SERVER_STORAGE_CONSTRUCTOR,PARALLEL_SERVER_JOB_LOCATION,PARALLEL_SERVER_DEBUG,PARALLEL_SERVER_LICENSE_NUMBER,MLM_WEB_LICENSE,MLM_WEB_USER_CRED,MLM_WEB_ID,TZ,MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG,MDCE_LICENSE_NUMBER "/cm/shared/apps/Mathworks-MPS/2023a/bin/worker" -parallel
job aborted:
rank: node: exit code[: error message]
0: compute078: -2
1: compute078: -2
2: compute078: -2
3: compute078: -2
4: compute078: -2
5: compute078: -2
6: compute162: -2
7: compute162: -2
8: compute162: -2
9: compute162: -2
10: compute162: -2
11: compute162: 1: process 11 exited without calling finalize
Stopping SMPD ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -shutdown -phrase MATLAB -port 27192
Exiting with code: 123
  1 个评论
Edric Ellis
Edric Ellis 2024-5-15
This looks like you aren't getting as far as running any sort of job on the cluster. Contact MathWorks support, they can help sort out this sort of thing.

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Third-Party Cluster Configuration 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by