Analyze Performance of Generated CUDA Code

This example uses:

This example shows how to analyze and optimize the performance of generated CUDA® code by using the gpuPerformanceAnalyzer function.

The gpuPerformanceAnalyzer function generates code and collects metrics on CPU and GPU activities in the generated code. The function generates a report containing a chronological timeline plot that you can use to visualize, identify, and mitigate performance bottlenecks in the generated CUDA code.

This example generates the performance analyzer report for a fog rectification algorithm. For more information, see Fog Rectification.

Third-Party Prerequisites

CUDA-enabled NVIDIA® GPU.
NVIDIA CUDA toolkit and driver. For information on the supported versions of the compilers and libraries, see Third-Party Hardware.
Environment variables for the compilers and libraries. For information on setting up the environment variables, see Setting Up the Prerequisite Products.
Permissions to access GPU performance counters. From CUDA toolkit v10.1, NVIDIA restricts access to performance counters to only admin users. To enable GPU performance counters to for all users, see Permission issue with Performance Counters (NVIDIA).

Verify GPU Environment

To verify that the compilers and libraries necessary for this example are set up correctly, use the coder.checkGpuInstall function.

envCfg = coder.gpuEnvConfig("host");
envCfg.BasicCodegen = 1;
envCfg.Profiling = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

When the Quiet property of the coder.gpuEnvConfig object is true, the coder.checkGpuInstall function returns only warning or error messages.

Examine Fog Rectification Algorithm

This example takes a foggy RGB image as input. To improve the foggy input image, the algorithm performs fog removal and contrast enhancement. The diagram shows the steps for both these operations.

The steps in the fog rectification algorithm separated into fog removal and contrast enhancement stages

To perform fog removal, the algorithm estimates the dark channel of the image, calculates the airlight map based on the dark channel, and refines the airlight map by using filters. The restoration stage creates a defogged image by subtracting the refined airlight map from the input image.

To enhance the contrast, the algorithm then converts the restored image to grayscale and creates a histogram of intensity values in the image. It normalizes the histogram and then calculates the cumulative density function (CDF) of intensity values. The algorithm uses contrast stretching to expand the range of values and make the features stand out more clearly in the output image.

type fog_rectification.m

function [out] = fog_rectification(input) %#codegen
%

%   Copyright 2017-2023 The MathWorks, Inc.

coder.gpu.kernelfun;

% restoreOut is used to store the output of restoration
restoreOut = zeros(size(input),"double");

% Changing the precision level of input image to double
input = double(input)./255;

%% Dark channel Estimation from input
darkChannel = min(input,[],3);

% diff_im is used as input and output variable for anisotropic 
% diffusion
diff_im = 0.9*darkChannel;
num_iter = 3;

% 2D convolution mask for Anisotropic diffusion
hN = [0.0625 0.1250 0.0625; 0.1250 0.2500 0.1250;
 0.0625 0.1250 0.0625];
hN = double(hN);

%% Refine dark channel using Anisotropic diffusion.
for t = 1:num_iter
    diff_im = conv2(diff_im,hN,"same");
end

%% Reduction with min
diff_im = min(darkChannel,diff_im);

diff_im = 0.6*diff_im ;

%% Parallel element-wise math to compute
%  Restoration with inverse Koschmieder's law
factor = 1.0./(1.0-(diff_im));
restoreOut(:,:,1) = (input(:,:,1)-diff_im).*factor;
restoreOut(:,:,2) = (input(:,:,2)-diff_im).*factor;
restoreOut(:,:,3) = (input(:,:,3)-diff_im).*factor;
restoreOut = uint8(255.*restoreOut);

%%
% Stretching performs the histogram stretching of the image.
% im is the input color image and p is cdf limit.
% out is the contrast stretched image and cdf is the cumulative
% prob. density function and T is the stretching function.

% RGB to grayscale conversion
im_gray = im2gray(restoreOut);
[row,col] = size(im_gray);

% histogram calculation
[count,~] = imhist(im_gray);
prob = count'/(row*col);

% cumulative Sum calculation
cdf = cumsum(prob(:));

% Utilize gpucoder.reduce to find less than particular probability.
% This is equal to "i1 = length(find(cdf <= (p/100)));", but is 
% more GPU friendly.

% lessThanP is the preprocess function that returns 1 if the input
% value from cdf is less than the defined threshold and returns 0 
% otherwise. gpucoder.reduce then sums up the returned values to get 
% the final count.
i1 = gpucoder.reduce(cdf,@plus,"preprocess", @lessThanP);
i2 = 255 - gpucoder.reduce(cdf,@plus,"preprocess", @greaterThanP);

o1 = floor(255*.10);
o2 = floor(255*.90);

t1 = (o1/i1)*[0:i1];
t2 = (((o2-o1)/(i2-i1))*[i1+1:i2])-(((o2-o1)/(i2-i1))*i1)+o1;
t3 = (((255-o2)/(255-i2))*[i2+1:255])-(((255-o2)/(255-i2))*i2)+o2;

T = (floor([t1 t2 t3]));

restoreOut(restoreOut == 0) = 1;

u1 = (restoreOut(:,:,1));
u2 = (restoreOut(:,:,2));
u3 = (restoreOut(:,:,3));

% replacing the value from look up table
out1 = T(u1);
out2 = T(u2);
out3 = T(u3);

out = zeros([size(out1),3], "uint8");
out(:,:,1) = uint8(out1);
out(:,:,2) = uint8(out2);
out(:,:,3) = uint8(out3);
end

function out = lessThanP(input)
p = 5/100;
out = uint32(0);
if input <= p
    out = uint32(1);
end
end

function out = greaterThanP(input)
p = 5/100;
out = uint32(0);
if input >= 1 - p
    out = uint32(1);
end
end

Generate GPU Performance Analyzer Report

To analyze the performance of the generated code by using the gpuPerformanceAnalyzer function, create a code configuration object with a dynamic library build type by using the dll input argument. Enable the option to create a coder.EmbeddedCodeConfig configuration object.

cfg = coder.gpuConfig("dll","ecoder",true);

Run gpuPerformanceAnalyzer with the default iteration count of 2.

inputImage = imread("foggyInput.png");
inputs  = {inputImage};
designFileName = "fog_rectification";

gpuPerformanceAnalyzer(designFileName,inputs, ...
    Config=cfg,NumIterations=2);

### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Starting SIL execution for 'fog_rectification'
    To terminate execution: clear fog_rectification_sil
### Application stopped
### Stopping SIL execution for 'fog_rectification'
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

Alternatively, you can use the -gpuprofile option of the codegen command to enable GPU profiling and create the GPU performance analyzer report.

codegen -config cfg -gpuprofile fog_rectification.m -args inputs

When code generation completes, the software generates a fog_rectification_sil executable. To run the software-in-the-loop (SIL) executable, you can execute

fog_rectification_sil(inputImage);

Click the clear fog_rectification_sil link in the MATLAB Command Window. The GPU performance analyzer report is available after terminating the SIL executable.

### Application stopped
### Stopping SIL execution for 'fog_rectification'
### Starting profiling data processing
### Profiling data processing finished
    Open GPU Performance Analyzer report: open('/home/test/gpucoder-ex87489778/codegen/dll/fog_rectification/html/gpuProfiler.mldatx')

GPU Performance Analyzer Report

The GPU performance analyzer report lists GPU and CPU activities, events, and performance metrics in a chronological timeline plot that you can use to visualize, identify, and address performance bottlenecks in the generated CUDA code.

The GPU Performance Analyzer results for fog_rectification.

These numbers are representative. The actual values depend on your hardware setup. The profiling in this example was performed using MATLAB® R2024b on a machine with a 12 core, 3.6GHz Intel® Xeon® CPU, and an NVIDIA Quadro RTX 6000 GPU.

Profiling Timeline

The profiling timeline shows the complete trace of all events that have a runtime higher than the threshold value. This image shows a snippet of the profiling trace when the threshold value is 0.0 ms.

The Profiling Timeline for fog_rectification. It is zoomed in between 0.1ms and 0.3ms.

You can use the mouse wheel or the equivalent touchpad option to control the zoom level of the timeline. Alternatively, you can use the timeline summary at the top of the panel to control the zoom level and navigate the timeline plot.

The tooltips on each event indicate the start time, end time, and duration of the selected event on the CPU and the GPU. The tooltips also indicate the time elapsed between the kernel launch on the CPU and the actual execution of the kernel on the GPU.

The tooltip for the entry-point function.

Right-click an event to open the context menu and add a trace between the CPU and corresponding GPU events. You can also use the context menu to view the generated CUDA code that corresponds to an event on the code pane.

Event Statistics

The Event Statistics pane shows additional information for the selected event. For example, the fog_rectification_kernel1 shows information about the kernels such as the start time, end time, launch parameters, shared memory, and registers per thread.

The Event Statistics pane for fog_rectification_kernel1

Profiling Summary

The Profiling Summary pane includes bar charts that provide an overview of the GPU and CPU activities. The bar chart changes according to the zoom level of the profiling timeline. This image shows a snippet of the profiling summary for the region selected on the timeline.

The Profiling Summary. It shows the range is between 0.159ms and 0.302ms.

Trace Code

You can use the Code pane to trace from the MATLAB code to the CUDA code or from the CUDA code to the MATLAB code. Traceable code is blue on the side that you are tracing from and orange on the side that you are tracing to. As you move your pointer to the traceable code, the pane highlights the code in purple and traces the corresponding code on the other side. When you select a code section, the pane highlights the code in yellow. The code remains selected until you press Esc or select different code. To change the side that you are tracing from, select code on the other side.

The code pane with fog_rectification.cu and fog_rectification.m.

To explore the tracing, right-click the fog_rectification_loop0 event on the Loops row of the profiling timeline and select View generated code. This action highlights the for-loop and its corresponding MATLAB code.

The code pane shows the CUDA and MATLAB code for fog_rectification_loop_0.

In the MATLAB code pane, scroll to the top of the MATLAB function so the trace is out of view. The trace view icon tells you that the highlighted CUDA code has one trace that is not in view.

The code pane with an icon pointing to one code trace that is offscreen

To clear the selection, press Esc or select different code.

When code traces to more than one place in the corresponding source or generated code:

If you pause over the code that you are tracing, you can see the number of traces at the top of the code pane.
If some traces are not in view, the trace view icons indicate how many traces are out of view.
If you select the code that you want to trace in the code pane, you can select the trace that you want to see at the top of the code pane.

The top of the code pane showing a trace from a line of fog_rectification.m to lines of code in fog_rectification.cu.

Call Tree

This section lists the GPU events called from the CPU. Each event in the call tree lists the execution times as percentages of the caller function. This metric can help you to identify performance bottlenecks in the generated code. You can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.

Filters

This section provides filtering options for the report. Select Show entire profiling session to view profiling results for the entire application, including initialize and terminate. Alternatively, select Show single run to view results from an individual run of the design function.

Under Filter Events, You can also filter by:

Event Threshold — Skip events shorter than the given threshold.
Memory Allocation/Free — Show GPU device memory allocation and deallocation related events on the CPU activities bar.
Memory Transfers — Show host-to-device and device-to-host memory transfers.
Kernels — Show CPU kernel launches and GPU kernel activities.
Other Event — Show other GPU related events such as synchronization and waiting for GPU.