Main Content

Pass GPU Inputs to Entry-Point Functions

Since R2024a

This example shows how to configure GPU Coder™ to pass GPU inputs to entry-point functions and produce GPU outputs. When you create inputs on the GPU in the caller of the entry-point function and access them on the GPU in the entry-point function, you can avoid creating unnecessary copies of memory and outputs between the CPU and the GPU. This approach can improve the performance of generated code when you integrate it with code that produces and consumes data on a GPU. Additionally, this example demonstrates how to generate code for functions that accept GPU inputs of unknown size by using the emxArray data type.

Third-Party Prerequisites

  • CUDA-enabled NVIDIA® GPU and compatible driver.

Verify GPU Environment

To verify that the compilers and libraries necessary for running this example are set up correctly, use the coder.checkGpuInstall function.

envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

The sobelEdgeDetection Entry-Point Function

The sobelEdgeDetection entry-point function is a Sobel edge detection algorithm that takes an image input and produces image output that shows the edges.

type sobelEdgeDetection.m
function outputImg = sobelEdgeDetection(inputImg)
%

% Copyright 2023 The MathWorks, Inc.
    coder.gpu.kernelfun();
    inputSize = size(inputImg);
    outputSize = inputSize -2;
    outputImg = zeros(outputSize, 'like', inputImg);
    inputImg = double(inputImg);
    for colIdx = 1:outputSize(2)
        for rowIdx = 1:outputSize(1)
            hDiff = inputImg(rowIdx, colIdx) + 2* inputImg(rowIdx, colIdx+1) + inputImg(rowIdx,colIdx + 2) - ...
                inputImg(rowIdx + 2, colIdx) - 2* inputImg(rowIdx + 2, colIdx+1) - inputImg(rowIdx + 2,colIdx + 2);
            vDiff = inputImg(rowIdx, colIdx) + 2* inputImg(rowIdx + 1, colIdx) + inputImg(rowIdx + 2,colIdx) - ...
                inputImg(rowIdx, colIdx + 2) - 2* inputImg(rowIdx + 1, colIdx + 2) - inputImg(rowIdx + 2,colIdx + 2);
            diff = hDiff*hDiff + vDiff*vDiff;
            if diff > 3600
                outputImg(rowIdx, colIdx) = 255;
            else
                outputImg(rowIdx, colIdx) = 0;
            end
        end
    end
end

Generate GPU Code and Run gpuPerformanceAnalyzer on CPU

Use coder.gpuConfig to create a GPU code configuration object and use the codegen command to generate a MEX function.

cfg = coder.gpuConfig("mex");
imRGB = imread("peppers.png");
imGray = rgb2gray(imRGB);
codegen -config cfg -args {imGray} sobelEdgeDetection
Code generation successful.
gpuPerformanceAnalyzer("sobelEdgeDetection",{imGray},Config=cfg,OutFolder="sobleEdgeWithCPUIO");
### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

By default, GPU Coder expects the inputs from the CPU and produces the output on the CPU. It copies the data from the CPU to the GPU before running the computation on the GPU and copies the results back to CPU.

The GPU Performance Analyzer report shows that copying memory takes most of the time.

The GPU Performance Analyzer results for sobelEdgeDetection showing CPU2GPUCopy and GPU2CPUCopy events taking most of the time

Generate GPU Code and Run gpuPerformanceAnalyzer on GPU

The Sobel edge detection algorithm passes the input immediately to the GPU to compute the edges and produces the final results on the GPU. If an algorithm passes the inputs to and takes the outputs from the GPU, it does not need to copy memory from or to the CPU. GPU Coder can only produce outputs on the GPU when the GPU output types are supported.

Pass the inputs to the GPU by using the gpuArray function.

imGrayGpu = gpuArray(imGray);
codegen -config cfg -args {imGrayGpu} sobelEdgeDetection
Code generation successful.

You can also use coder.typeof to represent the inputs on the GPU.

inputImg = coder.typeof(imGray, "Gpu", true);
codegen -config cfg -args {inputImg} sobelEdgeDetection
Code generation successful.

Run gpuPerformanceAnalyzer with the inputs and outputs on the GPU.

gpuPerformanceAnalyzer("sobelEdgeDetection",{imGrayGpu},Config=cfg,OutFolder="sobleEdgeWithGPUIO");
### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

With the inputs and outputs on the GPU, the entry-point function does not contain memory copy events between the CPU and GPU.

The performance analyzer results for sobelEdgeDetection showing no CPU2GPUCopy or GPU2CPUCopy events

Use emxArray data types as unbounded GPU Inputs in Generated Function Interfaces

emxArray is a flexible data type used in generated code to handle arrays of varying sizes. You can use unbounded GPU inputs as the input type in GPU Coder, then construct emxArray data as an input or output from the generated code.

Examine MATLAB Function

The doubleElements entry-point function doubles the elements of the input array.

type doubleElements.m
function Y = doubleElements(X)
%

% Copyright 2024 The MathWorks, Inc.
    coder.gpu.kernelfun();
    Y = 2 * X;
end

Generate Initial Source Code for Entry-Point Function Using Unknown Size GPU Inputs

Use coder.typeof to create a GPU input type of unknown size.

cfg = coder.gpuConfig('lib');
inputs = {coder.typeof(int32(0), [inf inf], 'Gpu', true)};
codegen -config cfg -args inputs doubleElements.m -report
Code generation successful: View report

The function prototype for doubleElements in the generated code is:

extern void doubleElements(const emxArray_int32_T *gpu_X,

emxArray_int32_T *gpu_Y);

Write a Customized Main File to Initialize the emxArray Data

The file doubleElements_main.cu demonstrates how to create emxArray types on the GPU. To initialize an emxArray type on the GPU, allocate and initialize the GPU memory for input, then construct the emxArray_int32_T structure with the data field pointing to the GPU buffer. It is optional to allocate memory for GPU output, especially when the output size is unknown at compile time, because the entry point function handles memory allocation.

This example also shows how you can pass differently sized inputs to the same entry-point function.

type doubleElements_main.cu
#include "doubleElements.h"
#include "doubleElements_initialize.h"

#include "MWCudaDimUtility.hpp"
#include "MWCudaMemoryFunctions.hpp"

#include <iostream>
#include <iomanip>

__global__ void initializeArray(int *array, int size) {
    uint64_T gThreadId = mwGetGlobalThreadIndex();
    if (gThreadId < size) {
        array[gThreadId] = gThreadId;
    }
}

void processAndPrintArray(int32_T* size_in) {
    int totalSize = size_in[0] * size_in[1];

    int *deviceBuffer;
    mwCudaMalloc(&deviceBuffer, totalSize * sizeof(int32_T));

    int32_T threadsPerBlock = 256;
    int32_t blocksPerGrid = (totalSize + threadsPerBlock - 1) / threadsPerBlock;

    initializeArray<<<blocksPerGrid, threadsPerBlock>>>(deviceBuffer, totalSize);

    emxArray_int32_T gpu_in;
    emxArray_int32_T gpu_out;

    gpu_in.data = deviceBuffer;
    gpu_in.size = size_in;
    gpu_in.allocatedSize = totalSize * sizeof(int32_T);
    gpu_in.numDimensions = 2;
    gpu_in.canFreeData = false;

    int32_T size_out[2] = {0, 0};
    gpu_out.data = static_cast<int32_T *>(nullptr);
    gpu_out.size = size_out;
    gpu_out.allocatedSize = 0;
    gpu_out.numDimensions = 2;
    gpu_out.canFreeData = false;

    doubleElements(&gpu_in, &gpu_out);

    int32_T* hostBuffer = new int32_T[totalSize];
    cudaMemcpy(hostBuffer, gpu_out.data, totalSize * sizeof(int32_T), cudaMemcpyDeviceToHost);

    int dim1 = size_in[0];
    int dim2 = size_in[1];
    std::cout << "Doubled each element in a " << dim1 << " x " << dim2 << " matrix:" << std::endl;
    for (int i = 0; i < dim1; i++) {
        for (int j = 0; j < dim2; j++) {
            std::cout << std::setw(4) << hostBuffer[j + dim2 * i];
        }
        std::cout << std::endl;
    }
    std::cout << std::endl;

    mwCudaFree(gpu_in.data);
    mwCudaFree(gpu_out.data);
    delete[] hostBuffer;
}

int main(int argc, char *argv[])
{
    doubleElements_initialize();

    int32_T size_in_1[2] = {5, 5};
    processAndPrintArray(size_in_1);

    int32_T size_in_2[2] = {10, 10};
    processAndPrintArray(size_in_2);

    return 0;
}

Generate an Executable

Configure code generation to compile the customized main file with the generated code. Generate the executable file.

cfg = coder.gpuConfig("exe");
cfg.CustomSource = "doubleElements_main.cu";
cfg.CustomInclude = ".";
codegen -config cfg -args inputs doubleElements_main.cu doubleElements.m -report
Code generation successful: View report

The code generator produces an executable file named doubleElements in the current working folder. Run the executable using these commands:

if ispc
  !doubleElements.exe
else
  !./doubleElements
end
Doubled each element in a 5 x 5 matrix:
   0   2   4   6   8
  10  12  14  16  18
  20  22  24  26  28
  30  32  34  36  38
  40  42  44  46  48

Doubled each element in a 10 x 10 matrix:
   0   2   4   6   8  10  12  14  16  18
  20  22  24  26  28  30  32  34  36  38
  40  42  44  46  48  50  52  54  56  58
  60  62  64  66  68  70  72  74  76  78
  80  82  84  86  88  90  92  94  96  98
 100 102 104 106 108 110 112 114 116 118
 120 122 124 126 128 130 132 134 136 138
 140 142 144 146 148 150 152 154 156 158
 160 162 164 166 168 170 172 174 176 178
 180 182 184 186 188 190 192 194 196 198

See Also

Functions

Objects

Related Topics