Train Reinforcement Learning Agents
Once you have created an environment and reinforcement learning agent, you can train the
agent in the environment using the
train function. To
configure your training, use the
function. For example, create a training option set
opt, and train agent
agent in environment
opt = rlTrainingOptions(... 'MaxEpisodes',1000,... 'MaxStepsPerEpisode',1000,... 'StopTrainingCriteria',"AverageReward",... 'StopTrainingValue',480); trainResults = train(agent,env,opt);
For more information on creating agents, see Reinforcement Learning Agents. For more information on creating environments, see Create MATLAB Reinforcement Learning Environments and Create Simulink Reinforcement Learning Environments.
train updates the agent as training progresses. This is possible
agent is an handle object. To preserve the original agent
parameters for later use, save the agent to a MAT-file. For more information about handle
objects, see Handle Object Behavior.
Training terminates automatically when the conditions you specify in the
StopTrainingValue options of
rlTrainingOptions object are satisfied. You can also terminate
training before any termination condition is reached by clicking Stop
Training in the Reinforcement Learning Episode Manager.
When training terminates the training statistics and results are stored in the
train updates the agent at the end of each episode, and because
trainResults stores the last training results, along with data to correctly
recreate the training scenario and update the episode manager, you can later resume training
from the exact point at which it stopped. To do so, at the command line,
trainResults = train(agent,env,trainResults);
trainResults object contains, as one of its properties, the
opt specifying the training
option set. Therefore, to restart the training with updated training options, first change the
training options in
trainResults using dot notation. If the maximum number of
episodes was already reached in the previous training session, you must increase the maximum
number of episodes.
For example, disable displaying the training progress on Episode Manager, enable the
Verbose option to display training progress at the command line, change
the maximum number of episodes to 2000, and then restart the training, returning a new
trainResults object as
trainResults.TrainingOptions.MaxEpisodes = 2000; trainResults.TrainingOptions.Plots = "none"; trainResults.TrainingOptions.Verbose = 1; trainResultsNew = train(agent,env,trainResults);
In general, training performs the following steps.
Initialize the agent.
For each episode:
Reset the environment.
Get the initial observation s0 from the environment.
Compute the initial action a0 = μ(s0), where μ(s) is the current policy.
Set the current action to the initial action (a←a0), and set the current observation to the initial observation (s←s0).
While the episode is not finished or terminated, perform the following steps.
Apply action a to the environment and obtain the next observation s''and the reward r.
Learn from the experience set (s,a,r,s').
Compute the next action a' = μ(s').
Update the current action with the next action (a←a') and update the current observation with the next observation (s←s').
Terminate the episode if the termination conditions defined in the environment are met.
If the training termination condition is met, terminate training. Otherwise, begin the next episode.
The specifics of how the software performs these steps depend on the configuration of the agent and environment. For instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so. For more information on agents and their training algorithms, see Reinforcement Learning Agents. To use parallel processing and GPUs to speed up training, see Train Agents Using Parallel Computing and GPUs.
By default, calling the
train function opens the Reinforcement
Learning Episode Manager, which lets you visualize the training progress.
The Episode Manager plot shows the reward for each episode (EpisodeReward) and a running average reward value (AverageReward).
For agents with a critic, Episode Q0 is the estimate of the
discounted long-term reward at the start of each episode, given the initial observation of
the environment. As training progresses, if the critic is well designed and learns
successfully, Episode Q0 approaches in average the true discounted
long-term reward, which may be offset from the EpisodeReward value
because of discounting. For a well designed critic using an undiscounted reward
DiscountFactor is equal to
1), then on average
Episode Q0 approaches the true episode reward, as shown in the
The Episode Manager also displays various episode and training statistics. You can also
train function to return episode and training information. To
turn off the Reinforcement Learning Episode Manager, set the
Save Candidate Agents
During training, you can save candidate agents that meet conditions you specify in the
SaveAgentValue options of your
rlTrainingOptions object. For instance, you can save any agent whose
episode reward exceeds a certain value, even if the overall condition for terminating
training is not yet satisfied. For example, save agents when the episode reward is greater
opt = rlTrainingOptions('SaveAgentCriteria',"EpisodeReward",'SaveAgentValue',100');
train stores saved agents in a MAT-file in the folder you specify
SaveAgentDirectory option of
rlTrainingOptions. Saved agents can be useful, for instance, to test
candidate agents generated during a long-running training process. For details about saving
criteria and saving location, see
After training is complete, you can save the final trained agent from the MATLAB® workspace using the
save function. For example, save the
myAgent to the file
finalAgent.mat in the
current working directory.
save(opt.SaveAgentDirectory + "/finalAgent.mat",'agent')
By default, when DDPG and DQN agents are saved, the experience buffer data is not saved.
If you plan to further train your saved agent, you can start training with the previous
experience buffer as a starting point. In this case, set the
SaveExperienceBufferWithAgent option to
some agents, such as those with large experience buffers and image-based observations, the
memory required for saving the experience buffer is large. In these cases, you must ensure
that enough memory is available for the saved agents.
Validate Trained Policy
When validating your agent, consider checking how your agent handles the following:
Changes to simulation initial conditions — To change the model initial conditions, modify the reset function for the environment. For example reset functions, see Create MATLAB Environment Using Custom Functions, Create Custom MATLAB Environment from Template, and Create Simulink Reinforcement Learning Environments.
Mismatches between the training and simulation environment dynamics — To check such mismatches, create test environments in the same way that you created the training environment, modifying the environment behavior.
As with parallel training, if you have Parallel Computing Toolbox™ software, you can run multiple parallel simulations on multicore computers. If
you have MATLAB
Parallel Server™ software, you can run multiple parallel simulations on computer clusters or
cloud resources. For more information on configuring your simulation to use parallel
If your training environment implements the
plot method, you can
visualize the environment behavior during training and simulation. If you call
plot(env) before training or simulation, where
is your environment object, then the visualization updates during training to allow you to
visualize the progress of each episode or simulation.
Environment visualization is not supported when training or simulating your agent using parallel computing.
For custom environments, you must implement your own
For more information on creating a custom environments with a
function, see Create Custom MATLAB Environment from Template.