Main Content

Reproduce Network Training on a GPU

Since R2024b

This example shows how to train a network several times on a GPU and get identical results.

Ensuring the reproducibility of model training and inference on the GPU can be beneficial for experimentation and debugging. Reproducing model training on the GPU is particularly important in the verification of deep learning systems.

Prepare Training Data and Network

Use the supporting functions prepareDigitsData and prepareAutoencoderLayers to prepare the training data and the network architecture. These functions prepare the data and build the autoencoder network as described in the Prepare Datastore for Image-to-Image Regression example, and are attached to this example as supporting files.

[dsTrain,dsVal] = prepareDigitsData;
layers = prepareAutoencoderLayers;

Define Training Options

Specify the training options. The options are the same as those in the Prepare Datastore for Image-to-Image Regression example, with these exceptions.

  • Train for 5 epochs. Five epochs are not sufficient for the network to converge, but are sufficient to demonstrate whether or not training is exactly reproducible.

  • Return the network corresponding to the last training iteration. Doing so ensures a fair comparison when you compare the trained networks.

  • Train the network on a GPU. By default, the trainnet function uses a GPU if one is available. Training on a GPU requires a Parallel Computing Toolbox™ license and a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox).

  • Disable all visualizations.

options = trainingOptions("adam", ...
    MaxEpochs=5, ...
    MiniBatchSize=500, ...
    ValidationData=dsVal, ...
    ValidationPatience=5, ...
    OutputNetwork="last-iteration", ...
    ExecutionEnvironment="gpu", ...
    Verbose=false);

Check whether a GPU is selected and is available for training.

gpu = gpuDevice;
disp(gpu.Name + " selected.")
NVIDIA RTX A5000 selected.

Train Network Twice and Compare Results

Train the network twice using the trainnet function. To ensure that random number generation does not affect the training, set the random number generator and seed on the CPU and the GPU before training using the rng and gpurng (Parallel Computing Toolbox) functions, respectively.

rng("default")
gpurng("default")
net1 = trainnet(dsTrain,layers,"mse",options);

rng("default")
gpurng("default")
net2 = trainnet(dsTrain,layers,"mse",options);

Check whether the learnable parameters of the trained networks are equal. As the training uses nondeterministic algorithms, the learnable parameters of the two networks are different.

isequal(net1.Learnables.Value,net2.Learnables.Value)
ans = logical
   0

Plot the difference between the weights of the first convolution layer between the first training run and the second training run. The plot shows that there is a small difference in the weights of the two networks.

learnablesDiff = net1.Learnables.Value{1}(:) - net2.Learnables.Value{1}(:);
learnablesDiff = extractdata(learnablesDiff);

figure
bar(learnablesDiff)
ylabel("Difference in Weight Value")
xlabel("Learnable Parameter Number")

Set Determinism Option and Train Networks

Use the deep.gpu.deterministicAlgorithms function to set the GPU determinism state to true, and capture the previous state of the GPU determinism so that you can restore it later. All subsequent calls to GPU deep learning operations use only deterministic algorithms.

previousState = deep.gpu.deterministicAlgorithms(true);

Train the network twice using the trainnet function, setting the CPU and GPU random number generator and seed each time. Using only deterministic algorithms can slow down training and inference.

rng("default")
gpurng("default")
net3 = trainnet(dsTrain,layers,"mse",options);

rng("default")
gpurng("default")
net4 = trainnet(dsTrain,layers,"mse",options);

Check whether the learnable parameters of the trained networks are equal. As only deterministic algorithms are used, the learnable parameters of the two networks are equal.

isequal(net3.Learnables.Value,net4.Learnables.Value)
ans = logical
   1

Plot the difference between the weights of the first convolution layer between the first training run and the second training run. The plot shows that there is no difference in the weights of the two networks.

learnablesDiff = net3.Learnables.Value{1}(:) - net4.Learnables.Value{1}(:);
learnablesDiff = extractdata(learnablesDiff);

figure
bar(learnablesDiff)
ylabel("Difference in Weight Value")
xlabel("Learnable Parameter Number")

Restore the GPU determinism state to its original value.

deep.gpu.deterministicAlgorithms(previousState);

See Also

| | (Parallel Computing Toolbox) | |

Related Topics