Analyze Performance of Code Generated for Deep Learning Networks
This example shows you how to analyze and optimize the performance of the generated CUDA® code for deep learning networks by using the gpuPerformanceAnalyzer
function.
The GPU Coder Performance Analyzer runs a software-in-the-loop (SIL) execution that collects metrics on CPU/GPU activities in the generated code and provides a chronological timeline plot to visualize, identify, and mitigate performance bottlenecks in the generated CUDA code. This example generates the performance analysis report for the Generate Digit Images on NVIDIA GPU Using Variational Autoencoder example from GPU Coder. For more information, see Generate Digit Images on NVIDIA GPU Using Variational Autoencoder.
Third-Party Prerequisites
CUDA enabled NVIDIA® GPU.
NVIDIA CUDA toolkit and driver.
NVIDIA Nsight™ Systems. For information on the supported versions of the compilers and libraries, see Third-Party Hardware.
Environment variables for the compilers and libraries. For setting up the environment variables, see Setting Up the Prerequisite Products.
The profiling workflow of this example depends on the profiling tools from NVIDIA that accesses GPU performance counters. From CUDA toolkit v10.1, NVIDIA restricts access to performance counters to only admin users. To enable GPU performance counters to be used by all users, see the instructions provided in Permission issue with Performance Counters (NVIDIA).
Verify GPU Environment
To verify that the compilers and libraries for running this example are set up correctly, use the coder.checkGpuInstall
function.
envCfg = coder.gpuEnvConfig('host'); envCfg.DeepLibTarget = 'cudnn'; envCfg.DeepCodegen = 1; envCfg.Quiet = 1; coder.checkGpuInstall(envCfg);
Pretrained Variational Autoencoder Network
Autoencoders have two parts: the encoder and the decoder. The encoder takes an image input and outputs a compressed representation (the encoding), which is a vector of size latent_dim
, equal to 20 in this example. The decoder takes the compressed representation, decodes it, and recreates the original image.
VAEs differ from regular autoencoders in that they do not use the encoding-decoding process to reconstruct an input. Instead, they impose a probability distribution on the latent space, and learn the distribution so that the distribution of outputs from the decoder matches that of the observed data. Then, they sample from this distribution to generate new data.
This example uses the decoder network trained in the Train Variational Autoencoder (VAE) to Generate Images example. To train the network yourself, see Train Variational Autoencoder (VAE) to Generate Images (Deep Learning Toolbox).
The generateVAE Entry-Point Function
The generateVAE
entry-point function loads the dlnetwork
object from the trainedDecoderVAENet
MAT-file into a persistent variable and reuses the persistent object for subsequent prediction calls. It initializes a dlarray
object containing 25 randomly generated encodings, passes them through the decoder network, and extracts the numeric data of the generated image from the deep learning array object.
type('generateVAE.m')
function generatedImage = generateVAE(decoderNetFileName,latentDim,Environment) %#codegen % Copyright 2020-2021 The MathWorks, Inc. persistent decoderNet; if isempty(decoderNet) decoderNet = coder.loadDeepLearningNetwork(decoderNetFileName); end % Generate random noise randomNoise = dlarray(randn(1,1,latentDim,25,'single'),'SSCB'); if coder.target('MATLAB') && strcmp(Environment,'gpu') randomNoise = gpuArray(randomNoise); end % Generate new image from noise generatedImage = sigmoid(predict(decoderNet,randomNoise)); % Extract numeric data from dlarray generatedImage = extractdata(generatedImage); end
Generate Performance Analyzer Report
To analyze the performance of the generated code using gpuPerformanceAnalyzer
, create a code configuration object with a dynamic library ('dll'
) build type. Because the gpuPerformanceAnalyzer
function accepts only an Embedded Coder™ configuration object, enable the option to create a coder.EmbeddedCodeConfig
configuration object.
cfg = coder.gpuConfig('dll','ecoder',true);
Use the coder.DeepLearningConfig
function to create a CuDNN deep learning configuration object and assign it to the DeepLearningConfig
property of the GPU code configuration object.
cfg.TargetLang = 'C++'; cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
Run gpuPerformanceAnalyzer
with the default iteration count of 2.
latentDim = 20; Env = 'gpu'; matfile = 'trainedDecoderVAENet.mat'; inputs = {coder.Constant(matfile), coder.Constant(latentDim), coder.Constant(Env)}; designFileName = 'generateVAE'; gpuPerformanceAnalyzer(designFileName, inputs, ... 'Config', cfg, 'NumIterations', 2);
### Starting GPU code generation Code generation successful: View report ### GPU code generation finished ### Starting SIL execution for 'generateVAE' To terminate execution: clear generateVAE_sil ### Host application produced the following standard output (stdout) messages: Generating '/tmp/nsys-report-fc93.qdstrm' [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [===========50% ] mw_nsysData.nsys-rep [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [==20% ] mw_nsysData.nsys-rep [1/1] [========41% ] mw_nsysData.nsys-rep [1/1] [===================81% ] mw_nsysData.nsys-rep [1/1] [=====================89% ] mw_nsysData.nsys-rep [1/1] [========================100%] mw_nsysData.nsys-rep [1/1] [========================100%] mw_nsysData.nsys-rep Generated: /home/lnarasim/Documents/MATLAB/ExampleManager/lnarasim.Bdoc23a.j2174901/gpucoder-ex10368834/mw_nsysData.nsys-rep ### Stopping SIL execution for 'generateVAE' ### Starting profiling data processing ### Profiling data processing finished ### Showing profiling data
GPU Performance Analyzer
The GPU Performance Analyzer exposes GPU and CPU activities, events, and performance metrics in a chronological timeline plot to accurately visualize, identify and address performance bottlenecks in the generated CUDA® code.
These numbers are representative. The actual values depend on your hardware setup. This profiling was done using MATLAB R2023a on a machine with an 6 core, 3.5GHz Intel® Xeon® CPU, and an NVIDIA TITAN XP GPU.
Profiling Timeline
The profiling timeline shows the complete trace of all events that have a runtime higher than the threshold value. A snippet of the profiling trace is shown.
You can use the mouse wheel (or an equivalent touchpad option) to zoom into and out of the timeline. Alternatively, you can use the timeline summary at the top of the panel to zoom and navigate the timeline plot.
The tooltips on each event indicate the start time, end time and duration of the selected event on the CPU and the GPU. It also indicates the time elapsed between the kernel launch on the CPU and the actual execution of the kernel on the GPU.
Event statistics
The event statistics panel shows additional information for the selected event. For example, the dlnetwork_predict_kernel2
shows the following statistics:
Insights
The insights panel gives an pie chart overview of the GPU and CPU activities. The pie chart changes according to the zoom level of the profiling timeline. A snippet of the insights panel is shown. Within the region selected on the timeline, it shows that the GPU utilization is only 33%.
Call Tree
This section lists the GPU events called from the CPU. Each event in the call tree lists the execution times as percentages of caller function. This metric can help you to identify performance bottlenecks in generated code. You can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.
Filters
This section provides filtering options for the report.
View Mode - Use this option to view profiling results for the entire application (including initialization and terminate) or the design function (without initialization and terminate).
Event Threshold - Skip events shorter than the given threshold.
Memory Allocation/Free - Show GPU device memory allocation and deallocation related events on the CPU activities bar.
Memory Transfers - Show host-to-device and device-to-host memory transfers.
Kernels - Show CPU kernel launches and GPU kernel activities.
Others - Show other GPU related events such as synchronization and waiting for GPU.
Improving the Performance of the generateVAE
From the performance analyzer report, it is clear that a significant portion of the execution time is spent on memory allocation and deallocation. To improve the performance, you can turn on GPU memory manager and run the analysis again.
cfg = coder.gpuConfig('dll'); cfg.GpuConfig.EnableMemoryManager = true; cfg.TargetLang = 'C++'; cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn'); gpuPerformanceAnalyzer(designFileName, inputs, ... 'Config', cfg, 'NumIterations', 2);
### Starting GPU code generation Code generation successful: View report ### GPU code generation finished ### Starting SIL execution for 'generateVAE' To terminate execution: clear generateVAE_sil ### Host application produced the following standard output (stdout) messages: Generating '/tmp/nsys-report-3847.qdstrm' [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [===========50% ] mw_nsysData.nsys-rep [1/1] [0% ] mw_nsysData.nsys-rep [1/1] [==20% ] mw_nsysData.nsys-rep [1/1] [========41% ] mw_nsysData.nsys-rep [1/1] [===================81% ] mw_nsysData.nsys-rep [1/1] [=====================89% ] mw_nsysData.nsys-rep [1/1] [========================100%] mw_nsysData.nsys-rep [1/1] [========================100%] mw_nsysData.nsys-rep Generated: /home/lnarasim/Documents/MATLAB/ExampleManager/lnarasim.Bdoc23a.j2174901/gpucoder-ex10368834/mw_nsysData.nsys-rep ### Stopping SIL execution for 'generateVAE' ### Starting profiling data processing ### Profiling data processing finished ### Showing profiling data
With GPU memory manager enabled, the GPU utilization increases to 43%.
See Also
Functions
Objects
Related Topics
- GPU Programming Paradigm
- Code Generation by Using the GPU Coder App
- Code Generation Using the Command Line Interface
- Code Generation for Deep Learning Networks by Using cuDNN
- Code Generation for Deep Learning Networks by Using TensorRT
- Analyze Performance of the Generated CUDA Code
- GPU Profiling on NVIDIA Jetson Platforms