Hello, i made a simple cuda kernel to measure global memory transfer speed to the cuda processors:
    __global__ void SR2add(float* dataout,const float* datain,int size) {  
  int mindex=blockIdx.x*blockDim.x+threadIdx.x;
  if (mindex>=size) 
    return;
  dataout[mindex]=datain[mindex];
      }
The matlab function i wrote for it:
     function GPU_MemBandTest()
        import parallel.gpu.GPUArray
        xsize=1024;
        ysize=768;
        vectorsize=xsize*ysize;
        threadpblock=1024;
        k=parallel.gpu.CUDAKernel('MemBandTest.ptx', 'MemBandTest.cu');
        k.ThreadBlockSize=[threadpblock,1,1];
        k.GridSize=[ceil(vectorsize/threadpblock),1];
        ddatain=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
        dataout=rand(vectorsize,1,'single');
        ddataout=GPUArray(dataout);
        tic
        for i=1:1000
            [ddataout]=feval(k,ddataout,ddatain,vectorsize);
        end
        time=toc;
        disp(['ms time= ' num2str(time)])
        disp([num2str(vectorsize*4/(time*10^6)) 'GB/s'])
    end
I got ms time= 0.73629 and 4.2724GB/s result for that. I would like to ask: 1; that am i doing correctly the measurement? 2; Is there anything i can do to speed up this simple code or this is an expectable result for this kernel in matlab?
I have MATLAB R2011a, CUDA Toolkit 3.2, gt425m device, newest driver installed for it
If I use float* datain instead of const float* datain, the execution time goes up to 2.4ms
3; What could be the explanation of this?
Thanks for anyone who helps,
Gaszton