Some issues about running parallel Matlab jobs in cluster
19 次查看(过去 30 天)
显示 更早的评论
I was trying to run parallel jobs in cluster.
I launched the matlab engine from python using
matlab.engine.start_matlab()
I submitted the python jobs using slurm.
Some parallel jobs work, or work at first several function evaluations.
Some parallel jobs do not work. And it shows the following message:
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/l
ocal_cluster_jobs/R2018a/Job12.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job44.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster failed to cancel the job execution. The error was: Unable to read file '/rhome/chong009/.matlab/local_cluster_jobs/R2018a/Job92.in.mat'. No such file or directory.
> In parallel.internal.cluster.CJSJobMethods.cancelOneJob (line 53)
In parallel.job.CJSConcurrentJob/cancelOneJob (line 57)
In parallel.Job/cancel (line 1348)
In parallel.Cluster/hDeleteOneJob (line 911)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Warning: The cluster reported an error while deleting an unavailable job. This job may already have been deleted.
> In parallel.internal.cluster.CJSJobMethods.destroyOneJob (line 77)
In parallel.job.CJSConcurrentJob/destroyOneJob (line 52)
In parallel.Job/delete (line 1279)
In parallel.Cluster/hDeleteOneJob (line 926)
In parallel.internal.pool.InteractiveClient>iDeleteJobs (line 873)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 481)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 791)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parpool (line 98)
In parallel.internal.pool.PoolArrayManager.getOrAutoCreateWithCleanup (line 60)
In pctTryCreatePoolIfNecessary (line 23)
In distcomp.remoteparfor.tryRemoteParfor
In parallel_function (line 433)
In crank_nicolson (line 38)
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Then there is no message like "connectd to 4 workers". And the running time is approaximately 4 times slower.
Some jobs work perfectly at first, but after some matlab evaluations, the parallel jobs seems not work. And it shows the following information:
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 5 21 22 68 69 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
connected to 4 workers.
best so far in the initial data -0.366481187141
EI, 1th job, 0th iteration, func=overlap, q=4
EI takes 0.589984893799 seconds
EI suggests points:
[[ 5.62950456e-33 3.93339029e-32 1.00000000e-01 1.00000000e-01]
[ 4.47912054e-02 1.10230209e-30 6.65211795e-02 1.00000000e-01]
[ 1.39665061e-33 7.55063601e-33 5.47105991e-31 8.32740650e-33]
[ 2.35956500e-32 1.00000000e-01 1.00000000e-01 1.00000000e-01]]
evaluating takes 6.06351319949 mins
evaluating takes capital 1.0 so far
retraining the model takes 6.38158202171 seconds
But after some matlab evaluations, there is only "Starting parallel pool (parpool) using the 'local' profile ..." without connected to 4 workers and the running time is almost 4 times slower.
EI, VOI 0.0, best so far -0.920065378916
EI, 1th job, 44th iteration, func=overlap, q=4
EI takes 1.0830039978 seconds
EI suggests points:
[[ 0.03138878 0.09221666 0.06362966 0.00471924]
[ 0.09320626 0.08688824 0.05820342 0.06846487]
[ 0.04571523 0.00076893 0.01246081 0.00150557]
[ 0.00917403 0.06416236 0.07597119 0.09062772]]
Starting parallel pool (parpool) using the 'local' profile ...
evaluating takes 26.8635237853 mins
evaluating takes capital 45.0 so far
retraining the model takes 1807.75688004 seconds
0 个评论
采纳的回答
Michael
2019-2-3
编辑:Walter Roberson
2019-2-3
4 个评论
Guilherme Salvador Vieira
2020-4-5
Solved my problem as well! Thanks for sharing this link, I was not aware of this overwritting problem between different pools.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!