Problems with training Network with multi-gpus

Question

Yang Gao 2018-7-26

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/412175-problems-with-training-network-with-multi-gpus

评论： Gökalp 2020-11-16

I am now using matlab neural network toolbox to train my personal neural network. I have no problem with it running perfectly in my personal computer and the HPC(high performance computer) cluster of my university if I set the 'ExecutationEnvironment' property in function trainingOptions to be 'gpu'. and the hpc cluster of my school could provides me more than one GPU. So i just modified my programme and set the 'ExecutationEnvironment' property to 'parallel'. then this code was tested on my personal computer and hte HPC cluster. It could work well on my personal computer, but in my school cluster an error was thrown like that:

Error using trainNetwork (line 154) The parallel pool that SPMD was using has been shut down.

Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);

Error in tarinTask (line 10) TrainMyUnet;

Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) The parallel pool that SPMD was using has been shut down.

The client lost connection to worker 2. This might be due to network problems, or the interactive communicating job might have errored.

Furthermore, my University has several HPC clusters, so I just test this code in another HPC equipped with GPUs. and this time the error is different but the code cannot still work. the error is like :

Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);

Error in tarinTask (line 15) TrainMyUnet;

Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) Error detected on workers 1 2. Error using gpuArray/gop>iNcclReduce (line 305) The NCCL library failed to initialize, with error 'unhandled cuda error'.

is there anyone can help with this problem. Ps: the code could run perfectly on my personal computer. the matlab version I am now using is matlab/2018a; and I also checked the code in nnet.internal.cnn.ParallelTrainer/train (line 67) its just a simple synax: spmd, some codes end.

12 个评论
显示 10更早的评论隐藏 10更早的评论

Yang Gao 2018-7-26

编辑：Walter Roberson 2018-7-27

在 MATLAB Online 中打开

the complete error information shown in the first cluster:

Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 1 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
connected to 4 workers.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|========================================================================================|
Error using trainNetwork (line 154)
The parallel pool that SPMD was using has been shut down.
Error in TrainMyUnet (line 19)
[net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 10)
TrainMyUnet;
Caused by:
    Error using nnet.internal.cnn.ParallelTrainer/train (line 67)
    The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 2. This might be due to network problems,
or the interactive communicating job might have errored.

the gpus in this cluster is telsa k80.

Yang Gao 2018-7-26

编辑：Walter Roberson 2018-7-27

在 MATLAB Online 中打开

the complete error information in the second cluster:

Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|========================================================================================|
Error using trainNetwork (line 154)
The NCCL library failed to initialize, with error 'unhandled cuda error'.
Error in TrainMyUnet (line 19)
[net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 15)
TrainMyUnet;
Caused by:
    Error using nnet.internal.cnn.ParallelTrainer/train (line 67)
    Error detected on workers 1 2.
        Error using gpuArray/gop>iNcclReduce (line 305)
        The NCCL library failed to initialize, with error 'unhandled cuda
        error'.

the gpus in this cluster are telsa V100

Yang Gao 2018-7-27

在 MATLAB Online 中打开

thanks for your response. I have tried your code, and the results is like below:

Lab 1:

ans =
    CUDADevice with properties:
                        Name: 'Tesla K80'
                       Index: 1
           ComputeCapability: '3.7'
              SupportsDouble: 1
               DriverVersion: 9.2000
              ToolkitVersion: 9
          MaxThreadsPerBlock: 1024
            MaxShmemPerBlock: 49152
          MaxThreadBlockSize: [1024 1024 64]
                 MaxGridSize: [2.1475e+09 65535 65535]
                   SIMDWidth: 32
                 TotalMemory: 1.2800e+10
             AvailableMemory: 1.2662e+10
         MultiprocessorCount: 13
                ClockRateKHz: 823500
                 ComputeMode: 'Default'
        GPUOverlapsTransfers: 1
      KernelExecutionTimeout: 0
            CanMapHostMemory: 1
             DeviceSupported: 1
              DeviceSelected: 1

Lab 2:

ans =
    CUDADevice with properties:
                        Name: 'Tesla K80'
                       Index: 2
           ComputeCapability: '3.7'
              SupportsDouble: 1
               DriverVersion: 9.2000
              ToolkitVersion: 9
          MaxThreadsPerBlock: 1024
            MaxShmemPerBlock: 49152
          MaxThreadBlockSize: [1024 1024 64]
                 MaxGridSize: [2.1475e+09 65535 65535]
                   SIMDWidth: 32
                 TotalMemory: 1.2800e+10
             AvailableMemory: 1.2662e+10
         MultiprocessorCount: 13
                ClockRateKHz: 823500
                 ComputeMode: 'Default'
        GPUOverlapsTransfers: 1
      KernelExecutionTimeout: 0
            CanMapHostMemory: 1
             DeviceSupported: 1
              DeviceSelected: 1

Error using gpuHelp (line 5) The parallel pool that SPMD was using has been shut down.

The client lost connection to worker 1. This might be due to network problems, or the interactive communicating job might have errored.

it seems that the first statement can work while the second one is not working, and throw an error : the parpool that the SPMD is using has been shut down

Yang Gao 2018-7-27

编辑：Walter Roberson 2018-7-27

在 MATLAB Online 中打开

I have also tested the code in another HPC cluster equipped with telsa V100, and the result is like below:

Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
ans =
   Pool with properties:
              Connected: true
             NumWorkers: 2
                Cluster: local
          AttachedFiles: {}
      AutoAddClientPath: true
            IdleTimeout: 30 minutes (30 minutes remaining)
            SpmdEnabled: true
Lab 1:
    ans =
      CUDADevice with properties:
                          Name: 'Tesla V100-PCIE-16GB'
                         Index: 1
             ComputeCapability: '7.0'
                SupportsDouble: 1
                 DriverVersion: 9.1000
                ToolkitVersion: 9
            MaxThreadsPerBlock: 1024
              MaxShmemPerBlock: 49152
            MaxThreadBlockSize: [1024 1024 64]
                   MaxGridSize: [2.1475e+09 65535 65535]
                     SIMDWidth: 32
                   TotalMemory: 1.6946e+10
               AvailableMemory: 1.6371e+10
           MultiprocessorCount: 80
                  ClockRateKHz: 1380000
                   ComputeMode: 'Default'
          GPUOverlapsTransfers: 1
        KernelExecutionTimeout: 0
              CanMapHostMemory: 1
               DeviceSupported: 1
                DeviceSelected: 1
Lab 2:
    ans =
      CUDADevice with properties:
                          Name: 'Tesla V100-PCIE-16GB'
                         Index: 2
             ComputeCapability: '7.0'
                SupportsDouble: 1
                 DriverVersion: 9.1000
                ToolkitVersion: 9
            MaxThreadsPerBlock: 1024
              MaxShmemPerBlock: 49152
            MaxThreadBlockSize: [1024 1024 64]
                   MaxGridSize: [2.1475e+09 65535 65535]
                     SIMDWidth: 32
                   TotalMemory: 1.6946e+10
               AvailableMemory: 1.6371e+10
           MultiprocessorCount: 80
                  ClockRateKHz: 1380000
                   ComputeMode: 'Default'
          GPUOverlapsTransfers: 1
        KernelExecutionTimeout: 0
              CanMapHostMemory: 1
               DeviceSupported: 1
                DeviceSelected: 1
Lab 1:
    ans =
         2     2
         2     2
Lab 2:
    ans =
         2     2
         2     2

Joss Knight 2018-7-27

在 MATLAB Online 中打开

There's not much to suggest that isn't quite a lot work. It seems the NVIDIA library we use to communicate between GPUs is either crashing or erroring on this cluster. This could be an OS configuration issue (perhaps it's an unsupported operating system, or some sort of virtualized system), or a GPU configuration issue (a broken driver, hardware setup or virtualization issue). The only way to tackle this is to go through MathWorks tech support, and you will need some system information about the cluster. A good place to start would be to try some older NVIDIA drivers on that system (perhaps starting with the 388 drivers and working upwards).

We can talk about how to work around your problem, by disabling NCCL. This should fix the issue but it will make parallel training slower.

A way that should work is to shadow the gpuArray method gop with a version that removes the classname input. Start by opening gpuArray/gop

edit gpuArray/gop

Then use Save As to save it to a new location; somewhere on your path (or local folder) in a directory called @gpuArray.

Now modify line 47 where it says

if ~strcmp(classname, 'gpuArray')

replacing it with

if true

To check MATLAB sees your new version, try

clear classes
which gpuArray/gop

Now start the pool in your cluster, and add this new version of gop:

addAttachedFiles(gcp, 'gpuArray/gop');

Now run, and you should find everything works. If not, we can try shadowing the the ParallelTrainer class in a similar way to stop it from trying to use NCCL.

Of course, a better solution than this is to use your V100 cluster instead - this will have much better performance anyway.

Sorry about this. We actually test internally on K80 multi-gpu devices, and we are running on the 9.2 drivers ourselves, so something unusual or non-standard is going on.

Yang Gao 2018-8-1

many thanks for your response, and sorry about for responding so late cause I was sick last two days. I will try your code. thanks.

Gökalp 2020-11-16

在 MATLAB Online 中打开

Also i got similar error (only line number is 96)

A few weeks ago i ran a similar codes with multiple-gpu but this time i got this error

Error in unetDalak_gt (line 456)
net = trainNetwork(ds,lgraph,options);
Caused by:
    Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
    The parallel pool that SPMD was using has been shut down.

Also i tried the code in my collegues computer (with a single gpu and windows 10) and it worked perfectly. both we are using matlab 2020b

the deatils of my computer gpu details are given below

i tried gop.m without any modification (/usr/local/MATLAB/R2020b/toolbox/parallel/gpu/@gpuArray)

and also i followed the steps of @Joss Knight (/home/medi/Desktop/MATLAB/@gpuArray) but nothing changed

can you help me to solve this error please?

请先登录，再进行评论。

请先登录，再回答此问题。

Problems with training Network with multi-gpus

12 个评论
显示 10更早的评论隐藏 10更早的评论

回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

Problems with training Network with multi-gpus

12 个评论 显示 10更早的评论隐藏 10更早的评论

回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

12 个评论
显示 10更早的评论隐藏 10更早的评论