cuSOLVER Example
This example solves the systems of linear equations Ax = B
for
x
by using the cuSOLVER library. The matrices A
and
B
must have the same number of rows. If A
is a scalar,
then A\B
is equivalent to A.\B
. If A
is a square n-by-n matrix and B
is a matrix with n rows, then x =
A\B
is a solution to the equation A*x = B
, if it exists. The
MATLAB® implementation of backslash
is:
function [x] = backslash(A,b) if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end x = A\b; end
Prepare backslash
for Kernel Creation
GPU Coder™ requires no special pragma to generate calls to
libraries. Just as before, there are two ways to generate CUDA® kernels — coder.gpu.kernelfun
and coder.gpu.kernel
. In this example, we utilize the
coder.gpu.kernelfun
pragma to generate CUDA kernels. The modified backslash
function is:
function [x] = backslash(A,b) %#codegen if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end coder.gpu.kernelfun() x = A\b; end
Note
A minimum size is required on the input data for replacing math operators and functions with cuSOLVER library implementations. The minimum threshold is 128 elements.
Generated CUDA Code
When you generate CUDA code, GPU Coder creates function calls to initialize the cuSOLVER library, perform
mldivide
operations, and release hardware resources that the cuSOLVER
library uses. A snippet of the generated CUDA code is:
cusolverEnsureInitialization(); /* Copyright 2017 The MathWorks, Inc. */ cudaMemcpy(b_gpu_A, A, 1152UL, cudaMemcpyHostToDevice); blackslash_kernel1<<<dim3(1U, 1U, 1U), dim3(160U, 1U, 1U)>>>(b_gpu_A,gpu_A); cudaMemcpy(b_A, gpu_A, 1152UL, cudaMemcpyDeviceToHost); cusolverDnDgetrf_bufferSize(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, &cusolverWorkspaceReq); cusolverWorkspaceTypeSize = 8; cusolverInitWorkspace(); cudaMemcpy(gpu_A, b_A, 1152UL, cudaMemcpyHostToDevice); cusolverDnDgetrf(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, (real_T *) cusolverWorkspaceBuff, &gpu_ipiv_t[0], gpu_info_t); A_dirtyOnGpu = true; cudaMemcpy(&info_t, gpu_info_t, 4UL, cudaMemcpyDeviceToHost);
To initialize the cuSOLVER library and create a handle to the cuSOLVER library context,
the function cusolversEnsureInitialization()
calls
cusolverDnCreate()
cuSOLVER API. It allocates hardware resources on the
host and device.
static void cusolverEnsureInitialization(void) { if (cusolverGlobalHandle == NULL) { cusolverDnCreate(&cuSolverGlobalHandle); } }
backslash_kernel1
zero pads the matrix A
. This
kernel is launched with a single block of 512 threads.
static __global__ __launch_bounds__(160, 1) void backslash_kernel1(const real_T * A, real_T *b_A) { int32_T threadId; ; ; threadId = (int32_T)(((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + (int32_T)((threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x)); if (!(threadId >= 144)) { /* Copyright 2017 The MathWorks, Inc. */ b_A[threadId] = A[threadId]; } }
Calls to cudaMemcpy
transfer the matrix A
from the
host to the device. The function cusolverDnDgetrf
computes the LU
factorization of an m×n matrix:
P*A = L*U
where A is an m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.
cuSOLVER Standalone Code
For functions like qr
that only have partial support in cuSOLVER,
GPU Coder uses LAPACK library where necessary. For MEX
functions, the code generator uses the LAPACK library that is included with MATLAB. For standalone code, the code generator uses the LAPACK library that you
specify. To specify the LAPACK library:
At the command line, define your own
coder.LAPACKCallback
class containing the LAPACK library information and assign it to theCustomLAPACKCallback
property of the code configuration object.In the GPU Coder app, set Custom LAPACK library callback to your LAPACK library.
For example, to generate a standalone executable, you can use the following code
generation script. Here myLAPACK
is the name of the custom coder.LAPACKCallback
class containing the LAPACK library
information.
cfg = coder.gpuConfig('exe'); cfg.CustomLAPACKCallback = 'myLAPACK'; cfg.GenerateExampleMain = 'GenerateCodeAndCompile'; classdef myLAPACK < coder.LAPACKCallback methods (Static) function hn = getHeaderFilename() hn = 'lapacke.h'; end function updateBuildInfo(buildInfo, buildctx) [~,linkLibExt] = buildctx.getStdLibInfo(); cudaPath = getenv('CUDA_PATH'); libPath = 'lib\x64'; buildInfo.addIncludePaths(fullfile(cudaPath,'include')); libName = 'cusolver'; libPath = fullfile(cudaPath,libPath); buildInfo.addLinkObjects([libName linkLibExt], libPath, ... '', true, true); lapackLocation = 'C:\LAPACK\win64'; % specify path to LAPACK libraries includePath = fullfile(lapackLocation,'include'); buildInfo.addIncludePaths(includePath); libPath = fullfile(lapackLocation,'lib'); libName = 'mllapack'; buildInfo.addLinkObjects([libName linkLibExt], libPath, ... '', true, true); buildInfo.addDefines('HAVE_LAPACK_CONFIG_H'); buildInfo.addDefines('LAPACK_COMPLEX_STRUCTURE'); end end end