Why multiple GPUs slower than one GPU?

Dear All,
On my machine there are 2 GPUs. Why moving data to multiple GPUs in my case is about 5x slower, than working with just one GPU, environment WIN10, MATLAB R2017b. Here is code and example:
clear;
dd1=rand(100000,200,10 );
cc1=rand(100000,200,10 );
tic
dd=gpuArray(dd1);
cc=gpuArray(cc1);
wait (gpuDevice);
toc
nGPUs = gpuDeviceCount();
parpool('local', nGPUs );
d1=rand(100000,200,10 );
d2(1)={d1(1:50000,:,:)};
d2(2)={d1(50001:100000,:,:)};
c1(1:nGPUs) = {zeros(50000,200,10)};
tic
parfor i = 1:nGPUs
gpuDevice(i);
c=gpuArray(c1{i});
d=gpuArray(d2{i});
end
toc

6 个评论

Does anyone know why it is so? I did find one thread that did say it could be due to Windows environment? Is it because of it? And multiple GPU will not be faster than one?
Here is code and example:
But we don't know what you see when you run it. And we don't know what code are you using to compare with 1 GPU.
Sorry if I was not clear at the beginning, here is code with comments, where is with only one GPU and where are two:
clear;
%with default 1 GPU
dd1=rand(100000,200,10 );
cc1=rand(100000,200,10 );
tic
dd=gpuArray(dd1);
cc=gpuArray(cc1);
wait (gpuDevice);
toc %end of 1 GPU
nGPUs = gpuDeviceCount(); %start pool of 2 GPUS
parpool('local', nGPUs );
d1=rand(100000,200,10 );
d2(1)={d1(1:50000,:,:)};
d2(2)={d1(50001:100000,:,:)};
c1(1:nGPUs) = {zeros(50000,200,10)};
tic
parfor i = 1:nGPUs
gpuDevice(i);
c=gpuArray(c1{i});
d=gpuArray(d2{i});
end
toc
After I run the code I get this result:
Elapsed time is 1.672483 seconds.
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
Elapsed time is 22.065303 seconds.
Thus, moving data to one GPU takes 1.672483 seconds and moving divided data to two GPUs takes 22.065303 seconds. Or am I understanding this wrong? I have a code for GPU in linear fashion with runs thousands of times faster than with CPU. I work with big data and in order to be able to pass more data due to memory I thought adding second GPU and dividing data on both GPUs would increase the speed of algorithm, but it became slower.
What i managed to do is to implement some calculations on one GPU and then on both. Then using parfor and dividing data and both proved to be faster on two GPUs. However, still moving data to two GPUs takes more time then just to one. Isin't some way to improve it? The code I use now is below:
clear;
%with default 1 GPU
dd1=rand(10000,4000,10 );
cc1=zeros(10000,4000,10 );
tic%time to move data to one GPU and make calculations on it
dd=gpuArray(dd1);
cc=dd*2;
wait (gpuDevice);
toc %end of 1 GPU
nGPUs = gpuDeviceCount(); %start pool of 2 GPUS
parpool('local', nGPUs );
d1=rand(10000,4000,10 );
d2(1)={d1(1:5000,:,:)};
d2(2)={d1(5001:10000,:,:)};
c1(1:nGPUs) = {zeros(5000,4000,10)};
tstart = tic;% calculate time for moving data to GPU and calculations
parfor i = 1:nGPUs
gd=gpuDevice(i);
c=gpuArray(c1{i});
d=gpuArray(d2{i});
tic% time for calculation on GPU
c=d*2;
wait(gd);
time=toc;
fprintf('Time on GPU: %f\n',time);
end
toc(tstart)
You're not just moving data to two GPUs, you're moving it from the client to the pool, and then onto the GPUs. Communicating between processes takes time. Also, you don't call wait(gp) before you call tic which means the copy-to-device hasn't finished when you start timing.
In a real multi-GPU example you would be doing significant computation and constructing data on the pool, rather than on the client. This example is all overhead and so isn't very representative. You would see a similar issue if you opened a pool of only one worker.
Also, you don't need to select the gpuDevice since selecting a different GPU on each worker is done automatically for communicating jobs.
Yes you are right. I did not select gpuDevice and did construct data on the pool then the speed improved significantly and it is faster than one GPU. But it is achieved if data is constructed on the pool, but if the data is already predefined on the cient, there is no way to overcome overhead? Maybe you could help me a bit more? In my experiment I would load data from file of size (5000000x300x50), how should I move data to the pool? And what would be the way to divide this data for both GPUs?

请先登录,再进行评论。

回答(0 个)

类别

帮助中心File Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by