GPU Memory Allocation and Minimization
Discrete and Managed Modes
GPU Coder™ provides you access to two different memory allocation
(malloc
) modes available in the CUDA® programming model, cudaMalloc
and
cudaMallocManaged
. cudaMalloc
API is applicable to
the traditionally separate CPU, and GPU global memories.
cudaMallocManaged
is applicable to Unified
Memory.
From a programmer point of view, a traditional computer architecture requires that data be allocated and shared between the CPU and GPU memory spaces. The need for applications to manage data transfers between these two memory spaces adds to increased complexity. Unified memory creates a pool of managed memory, shared between the CPU and the GPU. The managed memory is accessible to both the CPU and the GPU through a single pointer. Unified memory attempts to optimize memory performance by migrating data to the device that needs it, at the same time hiding the migration details from the program. Though unified memory simplifies the programming model, it requires device-sync calls when data written on the GPU is being accessed on the CPU. GPU Coder inserts these synchronization calls. According to NVIDIA®, unified memory can provide significant performance benefits when targeting embedded hardware like the NVIDIA Tegra®.
To change the memory allocation mode in the GPU Coder app, use the Malloc Mode
drop-down box under
More Settings->GPU Coder. When using the command-line interface,
use the MallocMode
build configuration property and set it to either
'discrete'
or 'unified'
.
Note
In a future release, the unified memory allocation
(cudaMallocManaged
) mode will be removed when targeting NVIDIA GPU devices on the host development computer. You can continue to use
unified memory allocation mode when targeting NVIDIA embedded platforms.
GPU Memory Manager
You can use the GPU memory manager for efficient memory allocation, management, and improving run-time performance. The GPU memory manager creates a collection of large GPU memory pools and manages allocation and deallocation of chunks of memory blocks within these pools. By creating large memory pools, the memory manager reduces the number of calls to the CUDA memory APIs, improving run-time performance. You can use the GPU memory manager for MEX and standalone CUDA code generation.
To enable the GPU memory manager, use one of these methods:
In a GPU code configuration object (
coder.gpuConfig
), set theEnableMemoryManager
property totrue
.In the GPU Coder app, on the GPU Code tab, select GPU Memory Manager.
In the Simulink® Configuration Parameters dialog box, Code Generation > GPU Code pane, select the Memory manager parameter.
For CUDA code that uses NVIDIA CUDA libraries, such as cuFFT, cuBLAS, and cuSOLVER, you can enable the use of GPU memory manager for efficient memory allocation and management.
To use memory pools with CUDA libraries, enable the memory manager using one the methods above and:
In the GPU code configuration object (
coder.gpuConfig
), enable theEnableCUFFT
,EnableCUBLAS
, orEnableCUSOLVER
properties.In the GPU Coder app, on the GPU Code tab, select Enable cuFFT, Enable cuBLAS, or Enable cuSOLVER.
In the Simulink Configuration Parameters dialog box, Code Generation > GPU Code pane, select the cuFFT, cuBLAS, or cuSOLVER parameters.
Memory Minimization
GPU Coder analyzes the data dependency between CPU and GPU partitions and performs
optimizations to minimize the number of cudaMemcpy
function calls in the
generated code. The analysis also determines the minimum set of locations where data must be
copied between CPU and GPU by using cudaMemcpy
.
For example, the function foo
has sections of code that process data
sequentially on the CPU and in parallel on the GPU.
function [out] = foo(input1,input2) … % CPU work input1 = … input2 = … tmp1 = … tmp2 = … … % GPU work kernel1(gpuInput1, gpuTmp1); kernel2(gpuInput2, gpuTmp1, gpuTmp2); kernel3(gpuTmp1, gpuTmp2, gpuOut); … % CPU work … = out end
An unoptimized CUDA implementation can potentially have multiple cudaMemcpy
function calls to transfer the inputs gpuInput1,gpuInput2
, and the
temporary results gpuTmp1,gpuTmp2
between kernel calls. Because the
intermediate results gpuTmp1,gpuTmp2
are not used outside the GPU, they
can be stored within the GPU memory resulting in fewer cudaMemcpy
function calls. These optimizations improve overall
performance of the generated code. The optimized
implementation is:
gpuInput1 = input1; gpuInput2 = input2; kernel1<<< >>>(gpuInput1, gpuTmp1); kernel2<<< >>>(gpuInput2, gpuTmp1, gpuTmp2); kernel3<<< >>>(gpuTmp1, gpuTmp2, gpuOut); out = gpuOut;
To eliminate redundant cudaMemcpy
calls, GPU Coder analyzes all uses and definitions of a given
variable and uses status flags to perform minimization. An example of the original code and
what the generated code looks like is shown in this table.
Original Code | Optimized Generated Code |
---|---|
A(:) = … … for i = 1:N gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); end … end … … = C; |
A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) gA = A; A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) C = gC; C_isDirtyOnGpu = false; end … = C; |
The _isDirtyOnCpu
flag tells the GPU Coder memory optimization about routines where the given variable is declared and
used either on the CPU or on then GPU.
Shared Memory Manager for CUDA MEX
In particular, when you generate CUDA MEX code, GPU Coder creates a single universal memory manager that handles the memory management
for all running CUDA MEX functions, thereby further improving the performance of the MEX functions.
To view the shared MEX memory manager properties and manage allocation, create a
gpucoder.MemoryManager
object by using the cudaMemoryManager
function. To free the GPU memory that is not in use, call
the freeUnusedMemory
function. See How Shared GPU Memory Manager Improves Performance of Generated MEX.