Main Content

coder.gpu.kernel

Pragma that maps for-loops to GPU kernels

Description

coder.gpu.kernel() is a loop-level pragma that you must place immediately before a for-loop. This pragma generates a kernel with the dimensions computed from the loop parameters.

Note

The coder.gpu.kernel pragma overrides all parallel loop analysis checks that the software performs. Use coder.gpu.kernelfun first before using the coder.gpu.kernel pragma.

Note

Using the coder.gpu.kernel pragma for-loops that contains reductions is not recommended.

coder.gpu.kernel(B,T) generates a kernel with the dimensions specified by B and T. B[Bx,By,Bz] is an array that defines the number of blocks in the grid along dimensions x and y (z not used). T[Tx,Ty,Tz] is an array that defines the number of threads in the block along dimensions x, y, and z.

A value of -1 for B and T indicates that GPU Coder™ must infer the grid and block dimensions automatically. The coder.gpu.kernel pragma generates errors for invalid grid and block dimensions.

coder.gpu.kernel(B,T,M,name) specifies optional arguments M and name. M is a positive integer that specifies the minimum number of blocks per streaming multiprocessor. Increasing M can reduce the register usage within a kernel and improve kernel occupancy. A value of -1 for M indicates that GPU Coder must use the default value of 1. name is a character array that allows you to customize the name of the generated kernel.

Specifying the kernel pragma overrides all parallel loop analysis checks. This override allows loops to be parallelized in situations where parallel loop analysis cannot prove that all iterations are independent of each other. Ensure that the loop is safe to parallelize.

This function is a code generation function. It has no effect in MATLAB®.

example

Examples

collapse all

This example shows how to use the kernel pragma in a function and generate CUDA® code.

In one file, write the entry-point function scalars that accepts two vector inputs,x and y, of size 1x4096 and one scalar input, scale. The function has two for-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place the coder.gpu.kernel(-1,128) pragma immediately before the vector addition loop. This pragma creates a kernel that defaults to the number of blocks and allocates 128 threads per block. Place the coder.gpu.kernel() pragma immediately before the cumulative summation loop to generate a kernel with the dimensions computed from the loop parameters.

function [vout, sout1] = scalars(x,y,scale)
    sout1 = 0;
    vout = zeros(size(x));
    
    coder.gpu.kernel(-1,128);
    for i=1:1024
        vout(i) = x(i) + y(i);
    end
    
    coder.gpu.kernel();
    for i=1:4096
        sout1 = (x(i)*scale) + sout1;    
    end
end

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}...
 -report scalars

GPU Coder creates two kernels: the scalars_kernel1 for the vector addition and the scalars_kernel2 kernel for the cumulative sum. No kernel is needed to initialize sout1=0.

  scalars_kernel1<<<dim3(8U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x,
                                                            *gpu_vout);
  scalars_kernel2<<<dim3(4U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(scale, *gpu_x,
                                                             gpu_sout1);

The scalars_kernel1 has 8 blocks with 128 threads per block for a total of 1024 threads, one for adding each element. The scalars_kernel2 has 4 blocks with 1024 threads per block, resulting in a total of 4096 threads.

You can use variables or expressions when specifying the kernel dimensions. For example, you can rewrite the scalars entry-point function so that the grid and block dimensions are specified at compile time.

function [vout, sout1] = scalars(x,y,scale, a)
    sout1 = 0;
    vout = zeros(size(x));
    
    coder.gpu.kernel(1,a);
    for i=1:1024
        vout(i) = x(i) + y(i);
    end
    
    coder.gpu.kernelfun();
    for i=1:length(x)
        sout1 = (x(i)*scale) + sout1;    
    end
end

Use the codegen function to generate the CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),20,1024}...
 -report scalars

Version History

Introduced in R2017b