Deep Learning with Big Data
Typically, training deep neural networks requires large amounts of data that often do not fit in memory. You do not need multiple computers to solve problems using data sets too large to fit in memory. Instead, you can divide your training data into mini-batches that contain a portion of the data set. By iterating over the mini-batches, networks can learn from large data sets without needing to load all data into memory at once.
If your data is too large to fit in memory, use a datastore to work with mini-batches of data for training and inference. MATLAB® provides many different types of datastore tailored for different applications. For more information about datastores for different applications, see Datastores for Deep Learning.
augmentedImageDatastore is specifically designed to preprocess and augment
batches of image data for machine learning and computer vision applications. For an
example showing how to use
augmentedImageDatastore to manage image data
during training, see Train Network with Augmented Images.
Work with Big Data in Parallel
If you want to use large amounts of data to train a network, it can be helpful to train in parallel. Doing so can reduce the time it takes to train a network, because you can train using multiple mini-batches at the same time.
It is recommended to train using a GPU or multiple GPUs. Only use single CPU or multiple CPUs if you do not have a GPU. CPUs are normally much slower that GPUs for both training and inference. Running on a single GPU typically offers much better performance than running on multiple CPU cores.
For more information about training in parallel, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud.
Preprocess Data in Background
When you train in parallel, you can fetch and preprocess your data in the
background. This can be particularly useful if you want to preprocess your
mini-batches during training, such as when using the
transform function to apply a mini-batch preprocessing function to
When you train a network using the
trainNetwork function, you
can fetch and preprocess data in the background by enabling background
DispatchInBackgroundproperty of the datastore to
DispatchInBackgroundtraining option to
During training, some workers are used for preprocessing data instead of network
training computations. You can fine-tune the training computation and data dispatch
loads between workers by specifying the
option using the
trainingOptions function. For advanced
options, you can try modifying the number of workers of the parallel pool.
You can use a built-in mini-batch datastore, such as
denoisingImageDatastore (Image Processing Toolbox), or
pixelLabelImageDatastore (Computer Vision Toolbox). You can also use a custom mini-batch
datastore with background dispatch enabled. For more information on creating custom
mini-batch datastores, see Develop Custom Mini-Batch Datastore.
For more information about datastore requirement for background dispatching, see Use Datastore for Parallel Training and Background Dispatching.
Work with Big Data in the Cloud
Storing data in the cloud can make it easier for you to access for cloud applications without needing to upload or download large amounts of data each time you create cloud resources. Both AWS® and Azure® offer data storage services, such as AWS S3 and Azure Blob Storage, respectively.
To avoid the time and cost associated with transferring large quantities of data, it is recommended that you set up cloud resources for your deep learning applications using the same cloud provider and region that you use to store your data in the cloud.
To access data stored in the cloud from MATLAB, you must configure your machine with your access credentials. You can configure access from inside MATLAB using environment variables. For more information on how to set environment variables to access cloud data from your client MATLAB, see Work with Remote Data. For more information on how to set environment variables on parallel workers in a remote cluster, see Set Environment Variables on Workers (Parallel Computing Toolbox).
For examples showing how to upload data to the cloud and how to access that data from MATLAB, see Work with Deep Learning Data in AWS and Work with Deep Learning Data in Azure Blob Storage.
For more information about deep learning in the cloud, see Deep Learning in the Cloud.
Preprocess Data for Custom Training Loops
When you train a network using a custom training loop, you can process your data
in the background by using
minibatchqueue and enabling background dispatch. A
minibatchqueue object iterates over a
datastore to prepare mini-batches for custom training loops.
Enable background dispatch when your mini-batches require heavy
To enable background dispatch, you must:
DispatchInBackgroundproperty of the datastore to
DispatchInBackgroundproperty of the
When you use this option, MATLAB opens a local parallel pool to use for preprocessing your data. Data preprocessing for custom training loops is supported when training using local resources only. For example, use this option when training using a single GPU in your local machine.
For more information about datastore requirements for background dispatching, see Use Datastore for Parallel Training and Background Dispatching.
- Datastores for Deep Learning
- Data Sets for Deep Learning
- Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud
- Deep Learning in the Cloud
- Deep Learning with MATLAB on Multiple GPUs
- Train Deep Learning Networks in Parallel
- Work with Deep Learning Data in AWS
- Work with Deep Learning Data in Azure Blob Storage