Deep Learning with MATLAB on Multiple GPUs
MATLAB® supports training a single deep neural network using multiple GPUs in parallel. By using parallel workers with GPUs, you can train with multiple GPUs on your local machine, on a cluster, or on the cloud. Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:
How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size.
Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication.
Tip
To train a single network using multiple GPUs on your local machine, you can
simply specify the ExecutionEnvironment
option as
"multi-gpu"
without changing the rest of your code. The
trainnet
functions automatically uses your available GPUs for
training computations. For an example showing how to train a network using multiple
local GPUs, see Train Network Using Automatic Multi-GPU Support.
When you train on a remote cluster, specify the
ExecutionEnvironment
option as
"parallel-auto"
. If the cluster has access to one or more
GPUs, then trainnet
only use the GPUs for training. Workers
without a unique GPU are never used for training computation.
If you want to use more resources, you can scale up deep learning training to clusters or the cloud. To learn more about parallel options, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud. To try an example, see Train Network in the Cloud Using Automatic Parallel Support.
Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.
Use Multiple GPUs in Local Machine
Note
If you run MATLAB on a single machine in the cloud that you connect to via ssh or remote desktop protocol (RDP), then network execution and training uses the same code as if you were running on your local machine.
If you have access to a machine with multiple GPUs, you can train a network using
the trainnet
function by setting the
ExecutionEnvironment
training option to
"multi-gpu"
using the trainingOptions
function.
The "multi-gpu"
option allows you to use multiple GPUs in a
local parallel pool. If there is no current parallel pool,
trainnet
automatically starts a local parallel pool using
your default cluster profile settings. The pool has as many workers as the number of
available GPUs.
For information on how to perform custom training using multiple GPUs in your local machine, see Run Custom Training Loops on a GPU and in Parallel.
Use Multiple GPUs in Cluster
For training with multiple GPUs in a remote cluster, set the
ExecutionEnvironment
training option to
"parallel-auto"
or "parallel-gpu"
using
the trainingOptions
function.
If there is no current parallel pool, trainnet
automatically
starts a parallel pool using your default cluster profile settings. If the pool has
access to GPUs, then only workers with a unique GPU perform training computation. If
the pool does not have GPUs, then training takes place on all available CPU workers
instead.
For information on how to perform custom training using multiple GPUs in a remote cluster, see Run Custom Training Loops on a GPU and in Parallel.
Optimize Mini-Batch Size and Learning Rate
Convolutional neural networks are typically trained iteratively using mini-batches
of images. This is because the whole data set is usually too large to fit into GPU
memory. For optimum performance, you can experiment with the mini-batch size by
changing the MiniBatchSize
option using the trainingOptions
function.
The optimal mini-batch size depends on your exact network, data set, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs, in order to keep the workload on each GPU constant. For example, if you are training on a single GPU using a mini-batch size of 64, and you want to scale up to training with four GPUs of the same type, you can increase the mini-batch size to 256 so that each GPU processes 64 observations per iteration.
Because increasing the mini-batch size improves the significance of each iteration, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in mini-batch size. Depending on your application, a larger mini-batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.
You can use the Experiment Manager app to find optimal training options by sweeping through a range of hyperparameter values or by using Bayesian optimization. For more information on how to use Experiment Manager, see Create a Deep Learning Experiment for Classification.
Select Particular GPUs to Use for Training
If you do not want to use all of your GPUs, you can select the GPUs that you want to use for training and inference directly. Doing so can be useful to avoid training on a poor-performance GPU, for example, your display GPU.
If your GPUs are in your local machine, you can use the gpuDeviceTable
(Parallel Computing Toolbox) and gpuDeviceCount
(Parallel Computing Toolbox) functions to
examine your GPU resources and determine the index of the GPUs you want to use.
For single GPU training with the "auto"
or
"gpu"
options, by default, MATLAB uses the GPU device with index 1
. You can use a
different GPU by selecting the device before you start training. Use gpuDevice
(Parallel Computing Toolbox) to select the desired
GPU using its
index:
gpuDevice(index)
trainnet
automatically uses the selected GPU when you set the
ExecutionEnvironment
option to "auto"
or
"gpu"
.For multiple GPU training with the "multi-gpu"
option, by
default, MATLAB uses all available GPUs in your local machine. If you want to exclude
GPUs, you can start the parallel pool in advance and select the devices manually.
For example, suppose you have three GPUs but you only want to use the devices with
indices 1
and 3
. You can use the following
code to start a parallel pool with two workers and select one GPU on each
worker.
useGPUs = [1 3]; parpool("Processes",numel(useGPUs)); spmd gpuDevice(useGPUs(spmdIndex)); end
trainnet
automatically uses the current parallel pool when
you set the ExecutionEnvironment
option to
"multi-gpu"
(or "parallel-auto"
or
"parallel-gpu"
for the same result).
Train Multiple Networks on Multiple GPUs
If you want to train multiple models in parallel with one GPU each, start a
parallel pool with one worker per available GPU, and train each network on a
different worker. Use parfor
or parfeval
to simultaneously execute a network on each worker. Use the trainingOptions
function to set the
ExecutionEnvironment
name-value option to
"gpu"
on each worker.
For example, use code of the following form to train multiple networks in parallel on all available GPUs:
options = trainingOptions("sgdm",ExecutionEnvironment="gpu"); parfor i=1:gpuDeviceCount("available") trainnet(…,options); end
To run in the background without blocking your local MATLAB, use parfeval
. For examples showing how to train
multiple networks using parfor
and
parfeval
, see
Advanced Support for Fast Multi-Node GPU Communication
Some multi-GPU features in MATLAB, including the trainnet
function, are optimized for
direct communication via fast interconnects for improved performance.
If you have appropriate hardware connections, then data transfer between multiple GPUs uses fast peer-to-peer communication, including NVLink, if available.
If you are using a Linux® compute cluster with fast interconnects between machines such as Infiniband,
or fast interconnects between GPUs on different machines, such as GPUDirect RDMA, you might
be able to take advantage of fast multi-node support in MATLAB. Enable this support on all the workers in your pool by setting the
environment variable PARALLEL_SERVER_FAST_MULTINODE_GPU_COMMUNICATION
to
1
. Set this environment variable in the Cluster Profile
Manager.
This feature is part of the NVIDIA NCCL library for GPU communication. To configure it, you must set additional environment variables to define the network interface protocol, especially NCCL_SOCKET_IFNAME
. For more information, see the NCCL documentation and in particular the section on NCCL Environment Variables.
See Also
trainnet
| trainingOptions
| dlnetwork
| gpuDevice
(Parallel Computing Toolbox) | spmd
(Parallel Computing Toolbox) | imageDatastore