Some issues about running parallel Matlab jobs in cluster

19 次查看(过去 30 天)
I was trying to run parallel jobs in cluster.
I launched the matlab engine from python using
matlab.engine.start_matlab()
I submitted the python jobs using slurm.
Some parallel jobs work, or work at first several function evaluations.
Some parallel jobs do not work. And it shows the following message:
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/l
ocal_cluster_jobs/R2018a/Job12.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job44.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job92.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Then there is no message like "connectd to 4 workers". And the running time is approaximately 4 times slower.
Some jobs work perfectly at first, but after some matlab evaluations, the parallel jobs seems not work. And it shows the following information:
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
connected to 4 workers.
best so far in the initial data -0.366481187141
EI, 1th job, 0th iteration, func=overlap, q=4
EI takes 0.589984893799 seconds
EI suggests points:
[[ 5.62950456e-33 3.93339029e-32 1.00000000e-01 1.00000000e-01]
[ 4.47912054e-02 1.10230209e-30 6.65211795e-02 1.00000000e-01]
[ 1.39665061e-33 7.55063601e-33 5.47105991e-31 8.32740650e-33]
[ 2.35956500e-32 1.00000000e-01 1.00000000e-01 1.00000000e-01]]
evaluating takes 6.06351319949 mins
evaluating takes capital 1.0 so far
retraining the model takes 6.38158202171 seconds
But after some matlab evaluations, there is only "Starting parallel pool (parpool) using the 'local' profile ..." without connected to 4 workers and the running time is almost 4 times slower.
EI, VOI 0.0, best so far -0.920065378916
EI, 1th job, 44th iteration, func=overlap, q=4
EI takes 1.0830039978 seconds
EI suggests points:
[[ 0.03138878 0.09221666 0.06362966 0.00471924]
[ 0.09320626 0.08688824 0.05820342 0.06846487]
[ 0.04571523 0.00076893 0.01246081 0.00150557]
[ 0.00917403 0.06416236 0.07597119 0.09062772]]
Starting parallel pool (parpool) using the 'local' profile ...
evaluating takes 26.8635237853 mins
evaluating takes capital 45.0 so far
retraining the model takes 1807.75688004 seconds

采纳的回答

Michael
Michael 2019-2-3
编辑:Walter Roberson 2019-2-3
  4 个评论
Guilherme Salvador Vieira
Solved my problem as well! Thanks for sharing this link, I was not aware of this overwritting problem between different pools.
Xiaoyang Guo
Xiaoyang Guo 2020-4-24
Seems promising...need two or three days to see whether this works

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

产品


版本

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by