Main Content

parallel.gpu.CUDAKernel

Kernel executable on GPU

Description

A CUDAKernel object represents a CUDA kernel that can execute on a GPU. You create the kernel from CU and PTX files. For an example of how to create and use a CUDAKernel object, see Run CUDA or PTX Code on GPU.

Note

You cannot save or load CUDAKernel objects.

Creation

Description

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile) creates a CUDAKernel object using the PTX code ptxFile and the CUDA® source file cuFile. The PTX file must contain only a single entry point.

Use feval with kern as an input to execute the CUDA kernel on the GPU. For information on executing your kernel object, see Run a CUDAKernel.

example

kern = parallel.gpu.CUDAKernel(ptxFile,cuFile,func) creates a CUDAKernel for the function entry point defined by func. func must unambiguously define the appropriate kernel entry point in the PTX file.

example

kern = parallel.gpu.CUDAKernel(ptxFile,cProto) creates a CUDAKernel object using the PTX file ptxFile and the C prototype cProto. cProto is the C function prototype for the kernel call that kern represents. The PTX file must contain only a single entry point.

example

kern = parallel.gpu.CUDAKernel(ptxFile,cProto,func) creates a CUDAKernel object from a PTX file and C prototype for the function entry point defined by func. func must unambiguously define the appropriate kernel entry point in the PTX file.

Input Arguments

expand all

Name of a PTX file or PTX code.

You can provide the name of a PTX file, or pass its contents as a string.

Example: "simpleEx.ptx"

Data Types: char | string

Name of a CUDA source file, specified as a character vector.

The function examines the CUDA source file to find the function prototype for the CUDA kernel that is defined in the PTX code. The CUDA source file must contain a kernel definition starting with '__global__'.

Example: "simpleEx.cu"

Data Types: char | string

Function entry point, specified as a character vector. func must unambiguously define the appropriate entry point in the PTX file.

Note

The parallel.gpu.CUDAKernel function searches for the specified entry point in the PTX file, and matches on any substring occurrences. Therefore, you should not name any of your entry points as substrings of any others.

Example: "add1"

Data Types: char | string

C prototype for the kernel call, specified as a character vector. Specify multiple input arguments separated by commas.

Example: "float *,float,int"

Data Types: char | string

Properties

expand all

Size of a block of threads on the kernel, specified as a vector of positive integers of length 1, 2, or 3 (since thread blocks can be up to 3-dimensional). The product of the elements of ThreadBlockSize must not exceed the MaxThreadsPerBlock for this kernel, and no element of ThreadBlockSize can exceed the corresponding element of the GPUDevice property MaxThreadBlockSize.

Example: [8 8 8]

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

This property is read-only.

Maximum number of threads permissible in a single block for this CUDA kernel. The product of the elements of ThreadBlockSize must not exceed this value.

Example: 1024

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Size of grid of thread blocks, specified as an integer vector of length 3. This is effectively the number of thread blocks launched independently by the GPU. None of the elements of this vector can exceed the corresponding element in the vector of the MaxGridSize property of the GPUDevice object.

Example: [977 1 1]

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

The amount of dynamic shared memory (in bytes) that each thread block can use. Each thread block has an available shared memory region. This memory is shared with registers on the multiprocessors. SharedMemorySize must not exceed the MaxShmemPerBlock property of the GPUDevice object.

As with all memory, this needs to be allocated before the kernel is launched. It is common for the size of this shared memory region to be tied to the size of the thread block. Setting this value on the kernel ensures that each thread in a block can access this available shared memory region.

Example: 16000

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

This property is read-only.

The entry point name in the PTX code called by the kernel.

Example: "_Z13returnPointerPKfPy"

Data Types: char | string

This property is read-only.

The maximum number of left hand side arguments that the kernel supports. It cannot be greater than the number of right hand side arguments, and if any inputs are constant or scalar it will be less.

Example: 1

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

This property is read-only.

The required number of right hand side arguments needed to call this kernel. All inputs need to define either the scalar value of an input, the elements for a vector input/output, or the size of an output argument.

Example: 5

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

This property is read-only.

Cell array of character vectors of length NumRHSArguments. Each of the character vectors indicates what the expected MATLAB® data type for that input is by specifying a numeric type such as uint8, single, or double followed by the word scalar or vector to indicate if we are passing by reference or value. In addition, if that argument is only an input to the kernel, it is prefixed by in; and if it is an input/output, it is prefixed by inout. This allows you to decide how to efficiently call the kernel with both MATLAB arrays and gpuArray objects, and to see which of the kernel inputs are being treated as outputs.

Example: {'inout double vector'} {'in double vector'} {'in double vector'} {'in uint32 scalar'} {'in uint32 scalar'}

Data Types: cell

Object Functions

fevalEvaluate kernel on GPU
setConstantMemorySet some constant memory on GPU
existsOnGPUDetermine if gpuArray or CUDAKernel is available on GPU

Examples

collapse all

This example shows how to create a CUDAKernel object using a PTX file and a CU file, or using a PTX file and the function prototype.

The CUDA source file simpleEx.cu contains the following code:

/*
* Add a constant to a vector.
*/
__global__ void addToVector(float * pi, float c, int vecLen)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < vecLen) {
       pi[idx] += c;
   }
}

Compile the CU file into a PTX file using mexcuda with the -ptx option.

mexcuda -ptx simpleEx.cu
Building with 'NVIDIA CUDA Compiler'.
MEX completed successfully.

Create a CUDA kernel using the PTX file and the CU file.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","simpleEx.cu");

Create a CUDA kernel using the PTX file and the function prototype of the addToVector function.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","float *,float,int");

Both of the preceding statements return a kernel object that you can use to call the addToVector CUDA kernel.

This example shows how to create a CUDAKernel object from a PTX file with more than one entry point.

Suppose your CU file, myfun.cu, contains a function add1 for adding two doubles together and a function add2 for adding two vectors together.

__global__ void add1( double * a, double b ) 
{
    *a += b;
}

__global__ void add2( double * v1, const double * v2 ) 
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}

Compile the CU file into a PTX file using mexcuda with the -ptx option.

mexcuda -ptx myfun.cu
Building with 'NVIDIA CUDA Compiler'.
MEX completed successfully.

The PTX file contains two entry points corresponding to the add1 and add2 functions. When your PTX code contains multiple entry points, you must specify an entry when creating your kernel.

Create a kernel for adding two doubles together and specify the entry point add1.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu","add1");

Version History

Introduced in R2010b