Matlab Parallel Computing on Cluster - File not found (Task8-32.in.mat)
8 次查看(过去 30 天)
显示 更早的评论
Hi,
I am trying to use Matlab on a cluster with multiple nodes.
For now, I'm trying with 2 nodes of 16 cores each.
I have generated a new Generic Cluster profile using the plugin scripts for sun grid engine (sge).
The independent job validation is working fine, while the spmd, pool and parpool tests fail (only if I use more than 1 node!).
Looking at the job logs, I saw that the problem was related to mw_mpiexec (MPI was crashing).
I tried to use a different mpi -> mpich-4.1.1 and now MPI isn't crashing anymore, however the matlab instances on the different nodes are not able to find the files automatically generated from the validation cases.
I am reporting the log file of the validation attached.
Could you please help me solving this issue?
Thank you,
Antonio
0 个评论
回答(1 个)
Raymond Norris
2023-5-16
Hi @Antonio Cioffi. I'm not sure why mpiexec is crashing, but I can tell you why you're getting validation issues. When you switch MPI libraries, you need to point MATLAB to the correct libmpi.so. When you say you've tried different MPI, how did you go about it? You'll need to create your own mpiLibConf.m file to point to your libmpi.so (see the documentation for more info).
The reason I can tell you MATLAB is not loading the correct library is because of the following
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
The [number] is the MPI rank. This is telling you that each worker is creating a file in the folder Job25 with the filename Task1. And they're all "task 1" because they haven't properly started -- they're not aware there are other MPI ranks. What it should show is
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task28"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task31"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task30"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task29"
This is an indication that each worker has started up correctly. Therefore, MATLAB must not be finding the correct libmpi.so.
I would suggest you contact support@mathworks.com and they can help you figure out why SGE can't run multi-node (you have passwordless-ssh between the compute nodes, right?).
2 个评论
Raymond Norris
2023-5-18
I want to clarify what's happening, though. Notice the following
./mpiexec -info | grep device
will display
--with-device=ch3:nemesis
Therefore, shared memory for intranode communication and TCP for internode (default for nemesis per https://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-README.txt). If it was built with
--with-device=ch3:nemesis:mxm (Mellanox InfiniBand)
--with-device=ch3:nemesis:ofi
--with-device=ch4:ucx
then traffic would be natively going over IB. Instead, I believe what you are getting is IPoIB.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Third-Party Cluster Configuration 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!