Problems with training Network with multi-gpus

3 次查看(过去 30 天)
I am now using matlab neural network toolbox to train my personal neural network. I have no problem with it running perfectly in my personal computer and the HPC(high performance computer) cluster of my university if I set the 'ExecutationEnvironment' property in function trainingOptions to be 'gpu'. and the hpc cluster of my school could provides me more than one GPU. So i just modified my programme and set the 'ExecutationEnvironment' property to 'parallel'. then this code was tested on my personal computer and hte HPC cluster. It could work well on my personal computer, but in my school cluster an error was thrown like that:
Error using trainNetwork (line 154) The parallel pool that SPMD was using has been shut down.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 10) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 2. This might be due to network problems, or the interactive communicating job might have errored.
Furthermore, my University has several HPC clusters, so I just test this code in another HPC equipped with GPUs. and this time the error is different but the code cannot still work. the error is like :
trainNet Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers. ======================================================================================== | Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning | | (hh:mm:ss) | RMSE | Loss | Rate | ======================================================================================== Error using trainNetwork (line 154) The NCCL library failed to initialize, with error 'unhandled cuda error'.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 15) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) Error detected on workers 1 2. Error using gpuArray/gop>iNcclReduce (line 305) The NCCL library failed to initialize, with error 'unhandled cuda error'.
is there anyone can help with this problem. Ps: the code could run perfectly on my personal computer. the matlab version I am now using is matlab/2018a; and I also checked the code in nnet.internal.cnn.ParallelTrainer/train (line 67) its just a simple synax: spmd, some codes end.
  12 个评论
Yang Gao
Yang Gao 2018-8-1
many thanks for your response, and sorry about for responding so late cause I was sick last two days. I will try your code. thanks.
Gökalp
Gökalp 2020-11-16
Also i got similar error (only line number is 96)
A few weeks ago i ran a similar codes with multiple-gpu but this time i got this error
Error in unetDalak_gt (line 456)
net = trainNetwork(ds,lgraph,options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
The parallel pool that SPMD was using has been shut down.
Also i tried the code in my collegues computer (with a single gpu and windows 10) and it worked perfectly. both we are using matlab 2020b
the deatils of my computer gpu details are given below
i tried gop.m without any modification (/usr/local/MATLAB/R2020b/toolbox/parallel/gpu/@gpuArray)
and also i followed the steps of @Joss Knight (/home/medi/Desktop/MATLAB/@gpuArray) but nothing changed
can you help me to solve this error please?

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Parallel and Cloud 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by