Hello, i made a simple cuda kernel to measure global memory transfer speed to the cuda processors:
__global__ void SR2add(float* dataout,const float* datain,int size) {
int mindex=blockIdx.x*blockDim.x+threadIdx.x;
if (mindex>=size)
return;
dataout[mindex]=datain[mindex];
}
The matlab function i wrote for it:
function GPU_MemBandTest()
import parallel.gpu.GPUArray
xsize=1024;
ysize=768;
vectorsize=xsize*ysize;
threadpblock=1024;
k=parallel.gpu.CUDAKernel('MemBandTest.ptx', 'MemBandTest.cu');
k.ThreadBlockSize=[threadpblock,1,1];
k.GridSize=[ceil(vectorsize/threadpblock),1];
ddatain=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
dataout=rand(vectorsize,1,'single');
ddataout=GPUArray(dataout);
tic
for i=1:1000
[ddataout]=feval(k,ddataout,ddatain,vectorsize);
end
time=toc;
disp(['ms time= ' num2str(time)])
disp([num2str(vectorsize*4/(time*10^6)) 'GB/s'])
end
I got ms time= 0.73629 and 4.2724GB/s result for that. I would like to ask: 1; that am i doing correctly the measurement? 2; Is there anything i can do to speed up this simple code or this is an expectable result for this kernel in matlab?
I have MATLAB R2011a, CUDA Toolkit 3.2, gt425m device, newest driver installed for it
If I use float* datain instead of const float* datain, the execution time goes up to 2.4ms
3; What could be the explanation of this?
Thanks for anyone who helps,
Gaszton