Use Experiment Manager in the Cloud with MathWorks Cloud Center

This example uses:

This example shows how to fine-tune your deep learning network by using Experiment Manager in the cloud. Make use of multiple NVIDIA® high-performance GPUs on an Amazon EC2® instance on MathWorks® Cloud Center to run multiple experiments in parallel. Tune the hyperparameters of your network and try different network architectures. You can sweep through a range of hyperparameters automatically and save the results for each variation. Compare the results of your experiments to find the best network.

To train this model on AWS® using the MathWorks Cloud Center, you must:

Ensure that you have the necessary toolboxes in your MathWorks account. For this example, you need the Deep Learning Toolbox™ and Parallel Computing Toolbox™.
Ensure that your MATLAB® license is configured for cloud use. For more details, see Requirements for Using Cloud Center.
Link your AWS account to Cloud Center. For details, see Link Cloud Account to Cloud Center.

Note that training is always faster if you have locally hosted training data. Remote data use has overheads, especially if the data has many small files, like the digits classification example. Training time depends on network speed and the proximity of the Amazon S3™ bucket to the machine running MATLAB. Larger data files make efficient use of bandwidth in Amazon EC2 (greater than 200 kB per file). If you have sufficient memory, copy the data locally for best training speed.

Classification of CIFAR-10 Image Data with Experiment Manager in the Cloud

Start and open MATLAB on an Amazon EC2 instance using MathWorks Cloud Center. For details, see Start MATLAB on Amazon Web Services (AWS) Using Cloud Center. For deep learning applications, choose a machine with GPUs such as the P3, G4dn, or G5 instances. P3 instances have GPUs that are suitable for general high performance computing operations that require good performance in double precision. G4dn and G5 instances have GPUs that provide better performance on single-precision operations and are better suited for deep learning, image processing, computer vision, and automated driving simulations.

This example shows training if you choose a g4dn.12xlarge GPU enabled instance. The g4dn.12xlarge has 4 NVIDIA T4 Tensor Core GPUs with a total of 64 GB of GPU memory. If you do not have this instance in your region, pick another multi-GPU enabled instance.

To get started with Experiment Manager for a classification network example, first download the CIFAR-10 training data to MATLAB in your EC2 instance. A simple way to do so is to use the downloadCIFARToFolders function, attached to this example as a supporting file. To access this file, open the example as a live script. The following code downloads the data set to your current directory.

directory = pwd;
[locationCifar10Train,locationCifar10Test] = downloadCIFARToFolders(directory);

Downloading CIFAR-10 data set...done.
Copying CIFAR-10 to folders...done.

Next, open Experiment Manager by running experimentManager in the MATLAB command window or by opening the Experiment Manager App from the Apps tab.

experimentManager

In Experiment Manager, select New and then Project. After a new window opens, select Blank Project and then Built-In Training (trainnet).

Hyperparameters

In the hyperparameters section, add two new parameters. Name the first Momentum with values [0.01,0.1] and the second InitialLearningRate with values [1e-3,4e-3]. You can optionally add a description for your experiment.

Setup Function

Click Edit on the setup function Experiment1_setup1 and delete the contents. Copy the setup function Experiment1_setup1 and the supporting function convolutionalBlock provided at the end of this example and paste them into the Experiment1_setup1.m function.

Set the path to the training and test data in the setup function. Check the workspace variables locationCifar10Train and locationCifar10Test created when you downloaded the data, and replace the paths in the Experiment1_setup1 function with the values of these variables.

locationCifar10Train = "/path/to/train/data"; % replace with the path to the CIFAR-10 training data, see the locationCifar10Train workspace variable
locationCifar10Test = "/path/to/test/data"; % replace with the path to the CIFAR-10 test data, see the locationCifar10Test workspace variable

The function written in Experiment1_setup1.m is an adaptation of the Train Network Using Automatic Multi-GPU Support example. The setup of the deep learning network is copied. The training options are modified to:

Set ExecutionEnvironment to "gpu".
Replace InitialLearnRate with params.InitialLearningRate, which will take the values as specified in the hyperparameters section of Experiment Manager.
Add a Momentum training option set to params.Momentum, also specified in the hyperparameters table.

Run in Parallel

You are now ready to run the experiments. Determine the number of available GPUs and start a parallel pool with a number of workers equal to the number of available GPUs by running:

Ngpus = gpuDeviceCount("available");
p = parpool(Ngpus);

In Experiment Manager, set mode to Simultaneous and then Run to run experiments in parallel on 1 GPU each (you cannot select the multi-gpu training option when running trials in parallel). You can see your experiments running concurrently in the Experiment Manager results tab. This example ran on 4 NVIDIA T4 Tensor Core GPUs, therefore 4 trials run concurrently. Training took about 25 minutes.

Export Trial and Save to Cloud

Once the trials complete, compare the results to choose your preferred network. You can view the training plot and confusion matrix for each trial to help with your comparisons.

After you have selected your preferred trained network, export it to the MATLAB workspace by clicking Export. Doing so creates a dlnetwork object trainedNetwork in the MATLAB workspace. To save the trainedNetwork to Amazon S3™, follow the procedure in Transfer Data to Amazon S3 Buckets and Access Data Using MATLAB.

setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID"); 
setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY");
setenv("AWS_SESSION_TOKEN","YOUR_AWS_SESSION_TOKEN"); % optional
setenv("AWS_DEFAULT_REGION","YOUR_AWS_DEFAULT_REGION"); % optional
save("s3://mynewbucket/trainedNetwork.mat","trainedNetwork","-v7.3");

Appendix - Setup Function for CIFAR-10 Classification Network

function [augmentedImdsTrain,layers,lossFcn,options] = Experiment1_setup1(params)

locationCifar10Train = "/path/to/train/data"; % Replace with the path to the CIFAR-10 training data, see the locationCifar10Train workspace variable
locationCifar10Test = "/path/to/test/data"; % Replace with the path to the CIFAR-10 test data, see the locationCifar10Test workspace variable

imdsTrain = imageDatastore(locationCifar10Train, ...
    IncludeSubfolders=true, ...
    LabelSource="foldernames");

imdsTest = imageDatastore(locationCifar10Test, ...
    IncludeSubfolders=true, ...
    LabelSource="foldernames");

imageSize = [32 32 3];
pixelRange = [-4 4];

imageAugmenter = imageDataAugmenter( ...
    RandXReflection=true, ...
    RandXTranslation=pixelRange, ...
    RandYTranslation=pixelRange);

augmentedImdsTrain = augmentedImageDatastore(imageSize,imdsTrain, ...
    DataAugmentation=imageAugmenter, ...
    OutputSizeMode="randcrop");

blockDepth = 4; % blockDepth controls the depth of a convolutional block
netWidth = 32; % netWidth controls the number of filters in a convolutional block

layers = [
    imageInputLayer(imageSize) 
    
    convolutionalBlock(netWidth,blockDepth)
    maxPooling2dLayer(2,Stride=2)
    convolutionalBlock(2*netWidth,blockDepth)
    maxPooling2dLayer(2,Stride=2)    
    convolutionalBlock(4*netWidth,blockDepth)
    averagePooling2dLayer(8) 
    
    fullyConnectedLayer(10)
    softmaxLayer];

miniBatchSize = 256;

lossFcn = "crossentropy";

options = trainingOptions("sgdm", ...
    ExecutionEnvironment="gpu", ... 
    InitialLearnRate=params.InitialLearningRate, ... % hyperparameter 'InitialLearningRate'
    Momentum=params.Momentum, ... % hyperparameter 'Momentum'
    Metrics="accuracy", ...
    MiniBatchSize=miniBatchSize, ... 
    Verbose=false, ... 
    Plots="training-progress", ... 
    L2Regularization=1e-10, ...
    MaxEpochs=50, ...
    Shuffle="every-epoch", ...
    ValidationData=imdsTest, ...
    ValidationFrequency=floor(numel(imdsTrain.Files)/miniBatchSize), ...
    LearnRateSchedule="piecewise", ...
    LearnRateDropFactor=0.1, ...
    LearnRateDropPeriod=45);
end

function layers = convolutionalBlock(numFilters,numConvLayers)
layers = [
    convolution2dLayer(3,numFilters,Padding="same")
    batchNormalizationLayer
    reluLayer];

layers = repmat(layers,numConvLayers,1);
end