kernel cuda gpu slow, why?

Question

jorge 2014-4-27

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/127411-kernel-cuda-gpu-slow-why

评论： Joss Knight 2014-5-14

I have a code that performs matrix multiplications, I decided implemeatar on gpu with kernel.cu kernel.ptx the matrix multiplications.

the kernel is generic for matrix multiplication of size (DIMX*DIMY)*(DIMY*DIMZ)

kernel:

__global__ void MatMultNoShared(float* A, float* B, float* C, 
                                int ARows, int ACols, int BRows, 
                                int BCols, int CRows, int CCols, 
                                int TILE_DIM)
{
 float CValue = 0;
 //indice of threads
 int Col = blockIdx.y*TILE_DIM + threadIdx.y;
 int Row = blockIdx.x*TILE_DIM + threadIdx.x;
 for(int k=0; k<(TILE_DIM + ACols-1)/TILE_DIM; k++)
 {
    for(int n=0;n<TILE_DIM; ++n)
    {
    if((k*TILE_DIM + n<ACols && Row < ARows) && 
            (k*TILE_DIM + n < BRows && Col < BCols ))
        CValue += A[Row*ACols + k*TILE_DIM + n]*B[(k*TILE_DIM + n)*BCols + Col];
     }
  }
     if (Row < CRows && Col < CCols) 
       {
     C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x * blockDim.x) +           threadIdx.x] = CValue;
  }
}

result:

matrix    |    time             |      GFLOP
------------------------------------------------------
          | gpuArray | kernel   | gpuArray   | kernel 
1024*1024 | 0.009165 | 0.131942 | 234.195459 | 16.268014 
2048*2048 | 0.049744 | 0.918414 | 345.279053 | 18.701454 
3072*3072 | 0.160027 | 3.105348 | 362.267296 | 18.668641 
4096*4096 | 0.375305 | 7.339200 | 366.161794 | 18.724408

this kernel(MatMultNoShared) is too slow.

kernel slow, why? I expected that the kernel was 10 times faster than the gpuArray

The second kernel is only to square matrix

#define TILE_DIM 16
__global__ void simpleMultiply(float *a, float* b, float *c, int N)
{
    int ACols=N;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
      for(int k=0; k<(TILE_DIM + ACols-1)/TILE_DIM; k++)
      {
          for (int i = 0; i < TILE_DIM; i++) 
          {
              sum += a[row*TILE_DIM+i] * b[i*N+col];
          }
      }
      c[row*N+col] = tanh(sum);
  }

result:

   matrix    |    time             |      GFLOP
  ------------------------------------------------------
            | gpuArray | kernel   | gpuArray   | kernel 
  1024*1024  0.009609   0.006878   223.384168   312.054593 
  2048*2048  0.050874   0.039116   337.615311   439.097961 
  3072*2048  0.160703   0.123678   360.742749   468.736544 
  4096*2048  0.375426   0.290781   366.043310   472.596134

similar time, similar GFLOP kernel slow, why? I expected that the kernel was 10 times faster than the gpuArray

the last kernel, with shared memory

#define TILE_DIM 16
__global__ void simpleMultiply(float *a, float* b, float *c, int N)
{
    int ACols=N;
    __shared__ float aTile[TILE_DIM][TILE_DIM];
                     bTile[TILE_DIM][TILE_DIM+1];
      int row = blockIdx.y * blockDim.y + threadIdx.y;
      int col = blockIdx.x * blockDim.x + threadIdx.x;
      float sum = 0.0f;
      aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; 
      bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y*N+col];
      for(int k=0; k<(TILE_DIM + ACols-1)/TILE_DIM; k++)
      {
            aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
            bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y*N+col];
          for (int i = 0; i < TILE_DIM; i++) 
          {
              sum += aTile[threadIdx.y][i]* b[i*N+col];
          }
      }
      c[row*N+col] = (sum);
  }

result:

matrix    |    time             |      GFLOP
  ------------------------------------------------------
            | gpuArray | kernel   | gpuArray   | kernel 
1024*1024   0.008948   0.008453    239.872186 253.939580 
2048*2048   0.051383   0.051315    334.269557 334.712226 
3072*3072   0.162450   0.167174    356.863683 346.779796 
4096*4096   0.375368   0.393502    366.099592 349.228523

I don't understand, why the last kernel is slower than the previous kernel?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Joss Knight 2014-5-14

This question is confusing on many levels. Can you clarify the following?

Why are you expecting your kernels to run faster than the gpuArray implementation?
What is your launch configuration? Please show the MATLAB code you used to launch your kernel.
How are you collecting your timings?
You ask why your 2nd and 3rd kernels are slower than either the gpuArray version or the previous kernels, when actually they are faster. What is slower than you expect and why were you not expecting that?
Why does your 2nd kernel call tanh on the result?
Have you checked that your kernels give the right answers? I don't understand how the 2nd and 3rd kernels could be giving the right answer: they seem to be accessing the wrong elements of A; the 3rd kernel doesn't sync threads before using data that it has copied into shared memory; it doesn't actually use the bTile array; and I may be wrong but since each thread is writing a single output element, it can't be right to be indexing the shared memory with the thread index since that is fixed for this thread and you need each thread to traverse the input arrays. Also, MATLAB defines and accesses matrices in column-major order, while you appear to be accessing the data in row major order in all your kernels. So do confirm that you've checked the output is correct for all these kernels.