very simple cuda ptx code memory speed

Question

Gaszton 2011-5-19

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/7803-very-simple-cuda-ptx-code-memory-speed

Hello, i made a simple cuda kernel to measure global memory transfer speed to the cuda processors:

    __global__ void SR2add(float* dataout,const float* datain,int size) {  
  int mindex=blockIdx.x*blockDim.x+threadIdx.x;
  if (mindex>=size) 
    return;
  dataout[mindex]=datain[mindex];
      }

The matlab function i wrote for it:

     function GPU_MemBandTest()
        import parallel.gpu.GPUArray
        xsize=1024;
        ysize=768;
        vectorsize=xsize*ysize;
        threadpblock=1024;
        k=parallel.gpu.CUDAKernel('MemBandTest.ptx', 'MemBandTest.cu');
        k.ThreadBlockSize=[threadpblock,1,1];
        k.GridSize=[ceil(vectorsize/threadpblock),1];
        ddatain=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
        dataout=rand(vectorsize,1,'single');
        ddataout=GPUArray(dataout);
        tic
        for i=1:1000
            [ddataout]=feval(k,ddataout,ddatain,vectorsize);
        end
        time=toc;
        disp(['ms time= ' num2str(time)])
        disp([num2str(vectorsize*4/(time*10^6)) 'GB/s'])
    end

I got ms time= 0.73629 and 4.2724GB/s result for that. I would like to ask: 1; that am i doing correctly the measurement? 2; Is there anything i can do to speed up this simple code or this is an expectable result for this kernel in matlab?

I have MATLAB R2011a, CUDA Toolkit 3.2, gt425m device, newest driver installed for it

If I use float* datain instead of const float* datain, the execution time goes up to 2.4ms

3; What could be the explanation of this?

Thanks for anyone who helps,

Gaszton

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Edric Ellis 2011-5-20

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/7803-very-simple-cuda-ptx-code-memory-speed#answer_10776

To answer your questions:

You're a factor of 2 out in your bandwidth calculation because you need to consider that the data is being read and written by the device. So the total data transfer is twice the size of the data
On my machine, with that factor of 2, I get 40 GB/s on a C2070, which is fairly reasonable. As the size of the data increases, this rate increases.
If you make datain be non-const, we treat that as an input-output variable (you can see this from the properties of the CUDAKernel) and allocate space to store the result. So, for CUDAKernel, const-correctness is actually very important for performance!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

very simple cuda ptx code memory speed

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

very simple cuda ptx code memory speed

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论