multiple gpu slow?

I just got a machine with 4 1080 TI GPUs and I want to see how fast it can go. So I ran the demo DeepLearningRCNNObjectDetectionExample.m with various ExecutionEnvironment settings in trainingOptions. I first used 'gpu' and it runs about 0.6 sec per minibatch. Then I tried to use 'multi-gpu', then it created 4 workers (one for each GPU), but then the time for each minibatch took about 4 sec. Why the multi-gpu option resulted in about 7X slower than one gpu option? Is it a bug or what? BTW, I use Matlab 2017a on Windows 10 Server 2016 and CUDA 8.0 kit.

回答(2 个)

Birju Patel
Birju Patel 2017-5-8
编辑:Birju Patel 2017-5-8

1 个投票

Hi,
This is due to a limitation with NVIDIA’s GPU-to-GPU communication on Windows. If you have the option, you should consider using Linux for multi-GPU training instead. You should increase the batch size from 128 to something like 1024. This provides more work to each GPU while reducing the communication cost, giving better overall utilization.
See the following reference page for additional details about the MiniBatchSize:
https://www.mathworks.com/help/nnet/ref/trainingoptions.html#namevaluepairarguments
You can read more about getting the best performance from multi-gpu training here:
https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/d/Deep_Learning_in_Cloud_Whitepaper.pdf

6 个评论

Joss Knight
Joss Knight 2017-5-8
编辑:Joss Knight 2017-5-8
There are no hard and fast rules for scaling up to multi-gpu but generally you should maximize your mini-batch size on each GPU based on its memory capacity, and then scale both the mini-batch size and learning rate by the number of GPUs. This should give you the maximum possible improvement in throughput and convergence.
However, as pointed out, Windows does suffer from a longer communication overhead than Linux. That does not necessarily mean you can't get a benefit, it just means you need to be doing more work on each iteration.
Thanks a lot for the answers. I missed the learning rate which should increase when the batch size increased. It's helpful.
However, after increased batch size to 1024, I still see the GPU utilization rate is around 10% for 4 GPUs each which is pretty low. If I use only 1 GPU, increase batch size can increase utilization rate to 50% at most. I guess it's IO bottleneck when running 4 GPUs. Anyways to fix this problem other than running Linux? Thanks.
I believe this issue is particular to R-CNN which has to spend a lot of time processing the images to crop the ROIs. For a single GPU, you should use the parallel option (in your Computer Vision preferences) to improve that. For multi-gpu training, this is something that we're currently working on. Still, the utilisation sounds low, so we'll look into it. How is your data stored? Is your file system particularly slow, perhaps?
Hi Tai-Wu,
There are two network training calls in the demo. The first one calls trainNetwork, the second one calls trainRCNNObject Detector.
Which one are you reporting utilization numbers for? For example, do you see 50% usage for 1 GPU for trainNetwork or trainRCNNObjectDetector?
You can comment out the call to trainRCNNObjectDetector and re-run the demo to figure this out.
Hi Birju Patel.
I am setting up a new machine with multi-gpu for CNN training. Do you suggest me to intall Windows or Linux as SO? Do you know if there are any improvement in Windows system to solve the "issue" of 2017?

请先登录,再进行评论。

Marco Francini
Marco Francini 2017-9-4

0 个投票

I also have this issue with a 2x GTX1080 Ti GPUs system. I use transfer learning (Alexnet) for my application following https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/d/Deep_Learning_in_Cloud_Whitepaper.pdf using ImageDatastore with a ssd drive.
The number of images per second the system can process during training with 2 GPUs is the half of what the system can do with 1 GPU! Looking at GPU load with GPU-Z, I see that with 2 GPUs the utilization jumps from 40% to 0% continuiosly while with one GPU the utilization is always above 50%.
I use Windows 10 Enterprise with Nvidia driver 385.41-desktop-win10-64bit-international-whql.exe installed.

类别

帮助中心File Exchange 中查找有关 GPU Computing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by