cuSOLVER Example

This example solves the systems of linear equations Ax = B for x by using the cuSOLVER library. The matrices A and B must have the same number of rows. If A is a scalar, then A\B is equivalent to A.\B. If A is a square n-by-n matrix and B is a matrix with n rows, then x = A\B is a solution to the equation A*x = B, if it exists. The MATLAB^® implementation of backslash is:

function [x] = backslash(A,b)
if (isscalar(A))
    x = coder.nullcopy(zeros(size(b)));
else
    x = coder.nullcopy(zeros(size(A,2),size(b,2)));
end

x = A\b;

end

Prepare `backslash` for Kernel Creation

GPU Coder™ requires no special pragma to generate calls to libraries. Just as before, there are two ways to generate CUDA^® kernels — coder.gpu.kernelfun and coder.gpu.kernel. In this example, we utilize the coder.gpu.kernelfun pragma to generate CUDA kernels. The modified backslash function is:

function [x] = backslash(A,b) %#codegen

if (isscalar(A))
    x = coder.nullcopy(zeros(size(b)));
else
    x = coder.nullcopy(zeros(size(A,2),size(b,2)));
end

coder.gpu.kernelfun()
x = A\b;

end

Note

A minimum size is required on the input data for replacing math operators and functions with cuSOLVER library implementations. The minimum threshold is 128 elements.

Generated CUDA Code

When you generate CUDA code, GPU Coder creates function calls to initialize the cuSOLVER library, perform mldivide operations, and release hardware resources that the cuSOLVER library uses. A snippet of the generated CUDA code is:

  cusolverEnsureInitialization();

  /*    Copyright 2017 The MathWorks, Inc. */
  cudaMemcpy(b_gpu_A, A, 1152UL, cudaMemcpyHostToDevice);  
  blackslash_kernel1<<<dim3(1U, 1U, 1U), dim3(160U, 1U, 1U)>>>(b_gpu_A,gpu_A);
  cudaMemcpy(b_A, gpu_A, 1152UL, cudaMemcpyDeviceToHost);
  cusolverDnDgetrf_bufferSize(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12,
    &cusolverWorkspaceReq);
  cusolverWorkspaceTypeSize = 8;
  cusolverInitWorkspace();
  cudaMemcpy(gpu_A, b_A, 1152UL, cudaMemcpyHostToDevice);
  cusolverDnDgetrf(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, (real_T *)
                   cusolverWorkspaceBuff, &gpu_ipiv_t[0], gpu_info_t);
  A_dirtyOnGpu = true;
  cudaMemcpy(&info_t, gpu_info_t, 4UL, cudaMemcpyDeviceToHost);

To initialize the cuSOLVER library and create a handle to the cuSOLVER library context, the function cusolversEnsureInitialization() calls cusolverDnCreate() cuSOLVER API. It allocates hardware resources on the host and device.

static void cusolverEnsureInitialization(void)
{
  if (cusolverGlobalHandle == NULL) {
    cusolverDnCreate(&cuSolverGlobalHandle);
  }
}

backslash_kernel1 zero pads the matrix A. This kernel is launched with a single block of 512 threads.

static __global__ __launch_bounds__(160, 1) void backslash_kernel1(const real_T *
  A, real_T *b_A)
{
  int32_T threadId;
  ;
  ;
  threadId = (int32_T)(((gridDim.x * gridDim.y * blockIdx.z + gridDim.x *
    blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) +
                       (int32_T)((threadIdx.z * blockDim.x * blockDim.y +
    threadIdx.y * blockDim.x) + threadIdx.x));
  if (!(threadId >= 144)) {
    /*    Copyright 2017 The MathWorks, Inc. */
    b_A[threadId] = A[threadId];
  }
}

Calls to cudaMemcpy transfer the matrix A from the host to the device. The function cusolverDnDgetrf computes the LU factorization of an m×n matrix:

P*A = L*U

where A is an m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.

cuSOLVER Standalone Code

For functions like qr that only have partial support in cuSOLVER, GPU Coder uses LAPACK library where necessary. For MEX functions, the code generator uses the LAPACK library that is included with MATLAB. For standalone code, the code generator uses the LAPACK library that you specify. To specify the LAPACK library:

At the command line, define your own coder.LAPACKCallback class containing the LAPACK library information and assign it to the CustomLAPACKCallback property of the code configuration object.
In the GPU Coder app, set Custom LAPACK library callback to your LAPACK library.

For example, to generate a standalone executable, you can use the following code generation script. Here myLAPACK is the name of the custom coder.LAPACKCallback class containing the LAPACK library information.

cfg = coder.gpuConfig('exe');
cfg.CustomLAPACKCallback = 'myLAPACK';
cfg.GenerateExampleMain = 'GenerateCodeAndCompile';

classdef myLAPACK < coder.LAPACKCallback
    methods (Static)
        function hn = getHeaderFilename()
            hn = 'lapacke.h';
        end
        function updateBuildInfo(buildInfo, buildctx)
            [~,linkLibExt] = buildctx.getStdLibInfo();
            cudaPath = getenv('CUDA_PATH'); 
            libPath = 'lib\x64';            
            
            buildInfo.addIncludePaths(fullfile(cudaPath,'include'));
            libName = 'cusolver';
            libPath = fullfile(cudaPath,libPath);
            buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
                '', true, true);
            
            lapackLocation = 'C:\LAPACK\win64'; % specify path to LAPACK libraries 
            
            includePath = fullfile(lapackLocation,'include');
            buildInfo.addIncludePaths(includePath);
            libPath = fullfile(lapackLocation,'lib');
            libName = 'mllapack';
            
            buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
                '', true, true);
            buildInfo.addDefines('HAVE_LAPACK_CONFIG_H');
            buildInfo.addDefines('LAPACK_COMPLEX_STRUCTURE');
        end
    end
end

For more information, see Speed Up Linear Algebra in Generated Standalone Code by Using LAPACK Calls.