gather takes really long after using ptx file /CUDA
4 次查看(过去 30 天)
显示 更早的评论
I try to make a matrixmultiplication using CUDA via ptx file to take advantage over the matlab internal functions. My .cu codes can calculate the matrixmultiplication faster than matlabs internal function, but the gather command after running the kernel takes much longer than after using matlab commands and gpuArray before:
Here my matlab file to compare both:
g=gpuDevice
reset(g)
clear all
N=1024;
A=rand(N,N);
B=rand(N,N);
%gpuDevice using Matlab
A_gpu=gpuArray(A);%Create array on GPU
B_gpu=gpuArray(B);%Create array on GPU
tic
C_gpu=A_gpu*B_gpu;
toc
tic
C=gather(C_gpu);%copy array from GPU to CPU
toc
%now using CUDA
A=A';
a_gpu=gpuArray(A(:)');%Create array on GPU make vector
b_gpu=gpuArray(B(:)');%Create array on GPU make vector
c_gpu=gpuArray(zeros(N*N,1));
k = parallel.gpu.CUDAKernel('matrixmul.ptx', 'matrixmul.cu');
k.ThreadBlockSize = [N,1,1];
k.GridSize=[N,N];
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
toc
tic
c2=gather(o);%back to host
toc
%check
max(max(abs(C-c2)))
My .cu file looks like this: _global_ void matrixmul( double *c, double *a, double *b) { _shared_ double cache[1024]; int cacheIndex = threadIdx.x;
int Aind=threadIdx.x + blockIdx.x * gridDim.y;
int Bind=threadIdx.x + blockIdx.y * gridDim.x;
cache[cacheIndex]=a[Aind]*b[Bind];
__syncthreads();
int i=blockDim.x/2;
while (i != 0) {
if (cacheIndex<i)
cache[cacheIndex]+=cache[cacheIndex+i];
__syncthreads();
i/=2;
}
if (cacheIndex == 0)
c[blockIdx.y *gridDim.y + blockIdx.x ]=cache[0];
}
In my version i use directly vectors instead of matrices and I transposed the 2 Matrix before starting the calculation to take advantage of the better order inside the vector for the memory access:
Thats what I get back: Elapsed time is 0.110911 seconds. Elapsed time is 0.007010 seconds. Elapsed time is 0.001937 seconds. Elapsed time is 3.651635 seconds.
ans =
1.0800e-12
As you see the first gather command takes only 0.007 seconds while the second one needs more than 3sec. Also if I put all my calling stuff into some function, also the call of this function takes a lot (without even reading the gpuArray.
Any suggestions whats going wrong here?
Thanks
Robert
0 个评论
回答(1 个)
James Lebak
2013-1-24
In MATLAB R2012a and later, GPU functions execute asynchronously in MATLAB. To get accurate timings, you need to call the wait function to make sure that gpu execution is finished. To accurately measure the time taken by MATLAB's multiply or by your kernel, rewrite your code as follows:
tic
C_gpu=A_gpu*B_gpu;
wait(g); % g is the value returned by gpuDevice, above
toc
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
wait(g); % g is the value returned by gpuDevice, above
toc
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 GPU Computing 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!