Automatically Save Checkpoints During Neural Network Training
During neural network training, intermediate results can be periodically saved to a MAT file for recovery if the computer fails or you kill the training process. This helps protect the value of long training runs, which if interrupted would need to be completely restarted otherwise. This feature is especially useful for long parallel training sessions, which are more likely to be interrupted by computing resource failures.
Checkpoint saves are enabled with the optional 'CheckpointFile'
training argument followed by the checkpoint file name or path. If you specify only a file
name, the file is placed in the working directory by default. The file must have the
.mat
file extension, but if this is not specified it is automatically
appended. In this example, checkpoint saves are made to the file called
MyCheckpoint.mat
in the current working directory.
[x,t] = bodyfat_dataset; net = feedforwardnet(10); net2 = train(net,x,t,'CheckpointFile','MyCheckpoint.mat');
22-Mar-2013 04:49:05 First Checkpoint #1: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 04:49:06 Final Checkpoint #2: /WorkingDir/MyCheckpoint.mat
By default, checkpoint saves occur at most once every 60 seconds. For the previous short training example, this results in only two checkpoint saves: one at the beginning and one at the end of training.
The optional training argument 'CheckpointDelay'
can change the
frequency of saves. For example, here the minimum checkpoint delay is set to 10 seconds for a
time-series problem where a neural network is trained to model a levitated magnet.
[x,t] = maglev_dataset; net = narxnet(1:2,1:2,10); [X,Xi,Ai,T] = preparets(net,x,{},t); net2 = train(net,X,T,Xi,Ai,'CheckpointFile','MyCheckpoint.mat','CheckpointDelay',10);
22-Mar-2013 04:59:28 First Checkpoint #1: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 04:59:38 Write Checkpoint #2: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 04:59:48 Write Checkpoint #3: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 04:59:58 Write Checkpoint #4: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 05:00:08 Write Checkpoint #5: /WorkingDir/MyCheckpoint.mat 22-Mar-2013 05:00:09 Final Checkpoint #6: /WorkingDir/MyCheckpoint.mat
After a computer failure or training interruption, you can reload the checkpoint structure
containing the best neural network obtained before the interruption, and the training record.
In this case, the stage
field value is 'Final'
,
indicating the last save was at the final epoch because training completed successfully. The
first epoch checkpoint is indicated by 'First'
, and intermediate
checkpoints by 'Write'
.
load('MyCheckpoint.mat')
checkpoint = file: '/WorkdingDir/MyCheckpoint.mat' time: [2013 3 22 5 0 9.0712] number: 6 stage: 'Final' net: [1x1 network] tr: [1x1 struct]
You can resume training from the last checkpoint by reloading the dataset (if necessary), then calling train with the recovered network.
net = checkpoint.net; [x,t] = maglev_dataset; load('MyCheckpoint.mat'); [X,Xi,Ai,T] = preparets(net,x,{},t); net2 = train(net,X,T,Xi,Ai,'CheckpointFile','MyCheckpoint.mat','CheckpointDelay',10);