Why multiple GPUs slower than one GPU?

2 次查看（过去 30 天）

Mantas Vaitonis 2018-10-4

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/422123-why-multiple-gpus-slower-than-one-gpu

评论： Mantas Vaitonis 2018-10-7

Dear All,

On my machine there are 2 GPUs. Why moving data to multiple GPUs in my case is about 5x slower, than working with just one GPU, environment WIN10, MATLAB R2017b. Here is code and example:

clear;
dd1=rand(100000,200,10 );
cc1=rand(100000,200,10 );
tic
dd=gpuArray(dd1);
cc=gpuArray(cc1);
wait (gpuDevice);
toc
nGPUs = gpuDeviceCount();
parpool('local', nGPUs );
d1=rand(100000,200,10 );
d2(1)={d1(1:50000,:,:)};
d2(2)={d1(50001:100000,:,:)};
c1(1:nGPUs) = {zeros(50000,200,10)};
tic
parfor i = 1:nGPUs
  gpuDevice(i);
  c=gpuArray(c1{i});
  d=gpuArray(d2{i}); 
end
toc

6 个评论
显示 4更早的评论隐藏 4更早的评论

Mantas Vaitonis 2018-10-5

在 MATLAB Online 中打开

Sorry if I was not clear at the beginning, here is code with comments, where is with only one GPU and where are two:

clear;
%with default 1 GPU
dd1=rand(100000,200,10 );
cc1=rand(100000,200,10 );
tic
dd=gpuArray(dd1);
cc=gpuArray(cc1);
wait (gpuDevice);
toc %end of 1 GPU
nGPUs = gpuDeviceCount(); %start pool of 2 GPUS
parpool('local', nGPUs );
d1=rand(100000,200,10 );
d2(1)={d1(1:50000,:,:)};
d2(2)={d1(50001:100000,:,:)};
c1(1:nGPUs) = {zeros(50000,200,10)};
tic
parfor i = 1:nGPUs
  gpuDevice(i);
  c=gpuArray(c1{i});
  d=gpuArray(d2{i});
end
toc

After I run the code I get this result:

Elapsed time is 1.672483 seconds.
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
Elapsed time is 22.065303 seconds.

Thus, moving data to one GPU takes 1.672483 seconds and moving divided data to two GPUs takes 22.065303 seconds. Or am I understanding this wrong? I have a code for GPU in linear fashion with runs thousands of times faster than with CPU. I work with big data and in order to be able to pass more data due to memory I thought adding second GPU and dividing data on both GPUs would increase the speed of algorithm, but it became slower.

Joss Knight 2018-10-6

You're not just moving data to two GPUs, you're moving it from the client to the pool, and then onto the GPUs. Communicating between processes takes time. Also, you don't call wait(gp) before you call tic which means the copy-to-device hasn't finished when you start timing.

In a real multi-GPU example you would be doing significant computation and constructing data on the pool, rather than on the client. This example is all overhead and so isn't very representative. You would see a similar issue if you opened a pool of only one worker.

Also, you don't need to select the gpuDevice since selecting a different GPU on each worker is done automatically for communicating jobs.

Mantas Vaitonis 2018-10-7

Yes you are right. I did not select gpuDevice and did construct data on the pool then the speed improved significantly and it is faster than one GPU. But it is achieved if data is constructed on the pool, but if the data is already predefined on the cient, there is no way to overcome overhead? Maybe you could help me a bit more? In my experiment I would load data from file of size (5000000x300x50), how should I move data to the pool? And what would be the way to divide this data for both GPUs?

请先登录，再进行评论。

请先登录，再回答此问题。