Analyze and Model Data on GPU
This example shows how to improve code performance by executing on a graphical processing unit (GPU). Execution on a GPU can improve performance if:
Your code is computationally expensive, where computing time significantly exceeds the time spent transferring data to and from GPU memory.
Your workflow uses functions with
gpuArray(Parallel Computing Toolbox) support and large array inputs.
When writing code for a GPU, start with code that already performs well on a CPU. Vectorization is usually critical for achieving high performance on a GPU. Convert code to use functions that support GPU array arguments and transfer the input data to the GPU. For more information about MATLAB functions with GPU array inputs, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Many functions in Statistics and Machine Learning Toolbox™ automatically execute on a GPU when you use GPU array input data. For example, you can create a probability distribution object on a GPU, where the output is a GPU array.
pd = fitdist(gpuArray(x),"Normal")
Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information about supported devices, see GPU Support by Release (Parallel Computing Toolbox). For the complete list of Statistics and Machine Learning Toolbox™ functions that accept GPU arrays, see Functions and then, in the left navigation bar, scroll to the Extended Capability section and select GPU Arrays.
Examine Properties of GPU
You can query and select your GPU device using the
gpuDevice function. If you have multiple GPUs, you can examine the properties of all GPUs detected in your system by using the
gpuDeviceTable function. Then, you can select a specific GPU for single-GPU execution by using its index (
D = gpuDevice
D = CUDADevice with properties: Name: 'Tesla V100-PCIE-32GB' Index: 1 ComputeCapability: '7.0' SupportsDouble: 1 DriverVersion: 11.2000 ToolkitVersion: 11 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 3.4090e+10 AvailableMemory: 3.3552e+10 MultiprocessorCount: 80 ClockRateKHz: 1380000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 0 CanMapHostMemory: 1 DeviceSupported: 1 DeviceAvailable: 1 DeviceSelected: 1
Execute Function on GPU
Explore a data distribution on a GPU using descriptive statistics.
Generate a data set of normally distributed random numbers on a GPU.
dist = randn(1e5,1e4,"gpuArray");
dist is a GPU array.
TF = isgpuarray(dist)
TF = logical 1
Execute a function with a GPU array input argument. For example, calculate the sample skewness for each column in
dist is a GPU array, the
skewness function executes on the GPU and returns the result as a GPU array.
skew = skewness(dist);
Verify that the output
skew is a GPU array.
TF = isgpuarray(skew)
TF = logical 1
Evaluate Speedup of GPU Execution
Evaluate function execution time on the GPU and compare performance with execution on a CPU.
Comparing the time taken to execute code on a CPU and a GPU can be useful in determining the appropriate execution environment. For example, if you want to compute descriptive statistics from sample data, considering the execution time and the data transfer time is important to evaluating the overall performance. If a function has GPU array support, as the number of observations increases, computation on the GPU generally improves compared to the CPU.
Measure the function run time in seconds by using the
gputimeit (Parallel Computing Toolbox) function.
gputimeit is preferable to
timeit for functions that use a GPU, because it ensures operation completion and compensates for overhead.
skew = @() skewness(dist); t = gputimeit(skew)
t = 0.6485
Evaluate the performance difference between the GPU and CPU by independently measuring the CPU execution time. In this case, execution of the code is faster on the GPU than on the CPU.
The performance of code on a GPU is heavily dependent on the GPU used. For additional information about measuring and improving GPU performance, see Measure and Improve GPU Performance (Parallel Computing Toolbox).
Single Precision on GPU
You can improve the performance of your code by calculating in single precision instead of double precision.
Determine the execution time of the
skewness function using an input argument of the
dist data set in single precision.
dist_single = single(dist); skew_single = @() skewness(dist_single); t_single = gputimeit(skew_single)
t_single = 0.2244
In this case, execution of the code with single precision data is faster than execution with double precision data.
The performance improvement is dependent on the GPU card and total number of cores. For more information about using single precision with a GPU, see Measure and Improve GPU Performance (Parallel Computing Toolbox).
Dimensionality Reduction and Model Fitting on GPU
Implement dimensionality reduction and classification workflows on a GPU.
pca(principal component analysis) function reduces data dimensionality by replacing several correlated variables with a new set of variables that are linear combinations of the original variables.
fitcensemblefunction fits many classification learners to form an ensemble model that can make better predictions than a single learner.
Both functions are computationally intensive and can be significantly accelerated using a GPU.
For example, consider the
humanactivity data set. The data set contains 24,075 observations of five physical human activities: sitting, standing, walking, running, and dancing. Each observation has 60 features extracted from acceleration data measured by smartphone accelerometer sensors. The data set contains the following variables:
actid— Response vector containing the activity IDs in integers: 1, 2, 3, 4, and 5 representing sitting, standing, walking, running, and dancing, respectively
actnames— Activity names corresponding to the integer activity IDs
feat— Feature matrix of 60 features for 24,075 observations
featlabels— Labels of the 60 features
Use 90% of the observations to train a model that classifies the five types of human activities, and use 10% of the observations to validate the trained model. Specify a 10% holdout for the test set by using
Partition = cvpartition(actid,"Holdout",0.10); trainingInds = training(Partition); % Indices for the training set testInds = test(Partition); % Indices for the test set
Transfer the training and test data to the GPU.
XTrain = gpuArray(feat(trainingInds,:)); YTrain = gpuArray(actid(trainingInds)); XTest = gpuArray(feat(testInds,:)); YTest = gpuArray(actid(testInds));
Find the principal components for the training data set
[coeff,score,~,~,explained,mu] = pca(XTrain);
Find the number of components required to explain at least 99% of variability.
idx = find(cumsum(explained)>99,1);
Determine the principal component scores that represent
X in the principal component space.
XTrainPCA = score(:,1:idx);
Fit an ensemble of learners for classification.
template = templateTree("MaxNumSplits",20,"Reproducible",true); classificationEnsemble = fitcensemble(XTrainPCA,YTrain, ... "Method","AdaBoostM2", ... "NumLearningCycles",30, ... "Learners",template, ... "LearnRate",0.1, ... "ClassNames",[1; 2; 3; 4; 5]);
To use the trained model for the test set, you need to transform the test data set by using the PCA obtained from the training data set.
XTestPCA = (XTest-mu)*coeff(:,1:idx);
Evaluate the accuracy of the trained classifier with the test data.
classificationError = loss(classificationEnsemble,XTestPCA,YTest);
Transfer to Local Workspace
Transfer data or model properties from a GPU to the local workspace for use with a function that does not support GPU arrays.
Transferring GPU arrays can be costly and is generally not necessary unless you need to use the results with functions that do not support GPU arrays, or use the results in another workspace where a GPU is unavailable.
gather (Parallel Computing Toolbox) function transfers data from the GPU into the local workspace. Gather the
dist data, and then confirm that the data is no longer a GPU array.
dist = gather(dist); TF = isgpuarray(dist)
TF = logical 0
gather function transfers properties of a machine learning model from a GPU into the local workspace. Gather the
classificationEnsemble model, and then confirm that the model properties that were previously GPU arrays, such as X, are no longer GPU arrays.
classificationEnsemble = gather(classificationEnsemble); TF = isgpuarray(classificationEnsemble.X)
TF = logical 0