Configure Exploration For Reinforcement Learning Agents
This example shows how to use visualization for configuring exploration settings for reinforcement learning agents.
Fix Random Seed Generator to Improve Reproducibility
The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the actor and critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.
Fix the random number stream with the seed 0
and random number algorithm Mersenne twister. For more information on random number generation see rng
.
previousRngState = rng(0,"twister");
Overview
Exploration in reinforcement learning refers to the strategy that an agent uses to discover new knowledge about its environment. Configuring exploration involves adjusting the parameters that govern how the agent explores the environment and typically involves numerous iterations before a satisfactory training performance is achieved. Visualizing data can reduce overhead in such scenario by helping to configure exploration.
In this example, you will visualize and configure exploration metrics for the following reinforcement learning agents:
Deep Q-network (DQN) agent.
Deep deterministic policy gradient (DDPG) agent.
The output previousRngState
is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.
Create two Cartpole reinforcement learning environments, one with continuous and the other with a discrete action space.
cEnv = rlPredefinedEnv("cartpole-continuous"); dEnv = rlPredefinedEnv("cartpole-discrete");
For more information about these environments, see rlPredefinedEnv
.
Create a variable doTraining
to enable or disable training of agents in the example. Since training can be computationally intensive the value is set to false
. You can enable training by setting the value to true
.
doTraining = false;
Epsilon Greedy Exploration
A deep Q-network (DQN) agent performs exploration with an epsilon-greedy policy. The parameters of interest are:
Initial epsilon value (default
1.0
).Minimum epsilon value (default
0.01
).Epsilon decay rate (default
0.005
).
When you create a DQN agent, the above default values are assigned to the parameters. First you will train the agent with the default values.
For more information see Deep Q-Network (DQN) Agent.
Create the agent object using the observation and action input specifications of the dEnv
environment. The agent has the following options:
A learning rate of
1e-4
and20
hidden units for the critic neural network.The double-DQN algorithm is not used for learning.
A mini-batch size of
256
is used for learning.The target critic network is updated using a smoothing factor of
1.0
every4
learning iterations.
% fix the random seed for reproducibility rng(0,"twister"); % observation and action input specifications dEnvObsInfo = getObservationInfo(dEnv); dEnvActInfo = getActionInfo(dEnv); % agent initialization options dqnInitOpts = rlAgentInitializationOptions(NumHiddenUnit=20); % DQN agent options criticOpts = rlOptimizerOptions( ... LearnRate=1e-4, ... GradientThreshold=1); dqnOpts = rlDQNAgentOptions( ... CriticOptimizerOptions=criticOpts, ... MiniBatchSize=256, ... TargetSmoothFactor=1, ... TargetUpdateFrequency=4, ... UseDoubleDQN=false); % create the agent dqnAgent = rlDQNAgent(dEnvObsInfo, dEnvActInfo, ... dqnInitOpts, dqnOpts);
For more information see rlDQNAgent
.
Create a data logger object to log data during training. The callback function logEpsilon
(provided at the end of the example) logs the epsilon values from the training. The logged data is saved in the current directory under the folder named dqn
.
dqnLogger = rlDataLogger();
dqnLogger.LoggingOptions.LoggingDirectory = "dqn";
dqnLogger.AgentStepFinishedFcn = @logEpsilon;
Train the agent for 500
episodes.
dqnTrainOpts = rlTrainingOptions( ... MaxEpisodes=500, ... StopTrainingCriteria="AverageReward",... StopTrainingValue=480); if doTraining dqnResult = train(dqnAgent, dEnv, dqnTrainOpts, ... Logger=dqnLogger); end
An example of the training is shown in the Reinforcement Learning Training Monitor window. Depending on your system configuration you may get a different training result.
To visualize exploration, first click View Logged Data in the Reinforcement Learning Training Monitor window.
In the Reinforcement Learning Data Viewer window, select Epsilon and choose the Line plot type from the toolstrip.
As seen in the plots:
The average reward received over
500
episodes did not reach the desired value of480
.The epsilon value was decayed to the minimum value after around
1000
iterations and the agent does not perform further exploration.
Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.
Inspect the exploration parameters.
dqnOpts.EpsilonGreedyExploration
ans = EpsilonGreedyExploration with properties: EpsilonDecay: 0.0050 Epsilon: 1 EpsilonMin: 0.0100
The default value of epsilon decay rate is 0.005
. Specify a smaller decay rate so that the agent performs more exploration.
dqnOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
Configure and train the agent with new exploration parameters.
% fix the random seed for reproducibility rng(0,"twister"); % create the agent dqnAgent = rlDQNAgent(dEnvObsInfo, dEnvActInfo, ... dqnInitOpts, dqnOpts); % train the agent dqnLogger.LoggingOptions.LoggingDirectory = "dqnTuned"; if doTraining dqnResult = train(dqnAgent, dEnv, dqnTrainOpts, ... Logger=dqnLogger); end
The training with new exploration parameters is shown below.
Open the Reinforcement Learning Data Viewer window and plot the Epsilon values again.
As seen in the plots:
This time the average reward reached the desired value of
480
.The epsilon value was decayed slower than the previous training. Increasing the exploration helped in improving the training performance.
Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.
Ornstein-Uhlenbeck (OU) Noise
A deep deterministic policy gradient (DDPG) agent uses the Ornstein-Uhlenbeck (OU) noise model for exploration.
The parameters of interest for the noise model are:
Mean of the noise (default
0
).Mean attraction constant (default
0.15
).Initial standard deviation (default
0.3
).Standard deviation decay rate (default
0
).Minimum standard deviation (default
0
).
When you create a DDPG agent, the above default values are assigned to the parameters. First you will train the agent with the default values.
For more information see Deep Deterministic Policy Gradient (DDPG) Agent.
Create the agent object using the observation and action input specifications of the cEnv
environment. The agent has the following options:
A learning rate of
1e-4
and200
hidden units for the actor neural network.A learning rate of
1e-3
and200
hidden units for the critic neural network.A mini-batch size of
64
is used for learning.A sample time of
0.02s
.
% fix the random stream for reproducibility rng(0,"twister"); % observation and action input specifications cEnvObsInfo = getObservationInfo(cEnv); cEnvActInfo = getActionInfo(cEnv); % agent initialization options ddpgInitOpts = rlAgentInitializationOptions(NumHiddenUnit=200); % DDPG agent options actorOpts = rlOptimizerOptions( ... LearnRate=1e-4, ... GradientThreshold=1); criticOpts = rlOptimizerOptions( ... LearnRate=1e-3, ... GradientThreshold=1)
criticOpts = rlOptimizerOptions with properties: LearnRate: 1.0000e-03 GradientThreshold: 1 GradientThresholdMethod: "l2norm" L2RegularizationFactor: 1.0000e-04 Algorithm: "adam" OptimizerParameters: [1x1 rl.option.OptimizerParameters]
ddpgOpts = rlDDPGAgentOptions( ... ActorOptimizerOptions=actorOpts, ... CriticOptimizerOptions=criticOpts, ... MiniBatchSize=64, ... SampleTime=cEnv.Ts); % create the agent ddpgAgent = rlDDPGAgent(cEnvObsInfo, cEnvActInfo, ... ddpgInitOpts, ddpgOpts);
For more information see rlDDPGAgent
.
Create a data logger object to log data during training. The function logOUNoise
(provided at the end of the example) logs the noise and standard deviation values from the training. Save the logged data in the folder named ddpg
.
ddpgLogger = rlDataLogger();
ddpgLogger.LoggingOptions.LoggingDirectory = "ddpg";
ddpgLogger.AgentStepFinishedFcn = @logOUNoise;
Train the agent for 500
episodes.
ddpgTrainOpts = rlTrainingOptions( ... MaxEpisodes=500, ... StopTrainingCriteria="AverageReward", ... StopTrainingValue=480); if doTraining ddpgResult = train(ddpgAgent, cEnv, ddpgTrainOpts, ... Logger=ddpgLogger); end
An example of the training is shown in the Reinforcement Learning Training Monitor window. Depending on your system configuration you may get a different training result.
Click the View Logged Data button in the Reinforcement Learning Training Monitor window.
In the Reinforcement Learning Data Viewer window:
Select OUNoise and choose the Line plot type from the toolstrip.
Select StandardDeviation and choose the Line plot type from the toolstrip.
As seen in the plots:
The agent did not achieve the desired average reward of
480
.The noise value generally remains within
+/-1
.The standard deviation value remains constant throughout the training.
Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.
Inspect the default exploration parameters.
ddpgOpts.NoiseOptions
ans = OrnsteinUhlenbeckActionNoise with properties: InitialAction: 0 Mean: 0 MeanAttractionConstant: 0.1500 StandardDeviationDecayRate: 0 StandardDeviation: 0.3000 StandardDeviationMin: 0
It may be useful to decay the exploration and gradually shift the agent's behavior from exploration to exploitation as it learns more about the environment. Early on, more exploration ensures a diverse range of experiences, which is crucial for the agent to learn a robust policy. However, as learning progresses, too much exploration can introduce unnecessary variance and instability into the learning process. Decaying exploration helps to stabilize learning by gradually reducing this variance.
Specify a mean attraction constant value of
0.1
. A smaller value reduces the attraction of the noise process towards the mean value.Specify an initial standard deviation value of
0.3
.Specify a standard deviation decay rate of
1e-4
.
ddpgOpts.NoiseOptions.MeanAttractionConstant = 0.1; ddpgOpts.NoiseOptions.StandardDeviation = 0.3; ddpgOpts.NoiseOptions.StandardDeviationDecayRate = 1e-4;
Train the agent with the new exploration options.
% fix the random seed for reproducibility rng(0,"twister"); % create the agent ddpgAgent = rlDDPGAgent(cEnvObsInfo, cEnvActInfo, ... ddpgInitOpts, ddpgOpts); % train the agent ddpgLogger.LoggingOptions.LoggingDirectory = "ddpgTuned"; if doTraining ddpgResult = train(ddpgAgent, cEnv, ddpgTrainOpts, ... Logger=ddpgLogger); end
The training with new exploration parameters is shown below.
Open the Reinforcement Learning Data Viewer window again and plot the OUNoise and StandardDeviation values again.
As seen in the plots:
This time the average reward reached the desired value of
480
.The standard deviation value was decayed. Consequently, the noise values were larger towards the beginning and smaller towards the end of the training.
Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.
Restore the random number stream using the information stored in previousRngState
.
rng(previousRngState);
Logging Functions
function dataToLog = logEpsilon(data) policy = getExplorationPolicy(data.Agent); pstate = getState(policy); dataToLog.Epsilon = pstate.Epsilon; end function dataToLog = logOUNoise(data) policy = getExplorationPolicy(data.Agent); pstate = getState(policy); dataToLog.OUNoise = pstate.Noise{1}; dataToLog.StandardDeviation = pstate.StandardDeviation{1}; end
See Also
Functions
rlPredefinedEnv
|train
|sim
Objects
rlTrainingOptions
|rlSimulationOptions
|rlDQNAgent
|rlDQNAgentOptions
|rlDDPGAgent
|rlDDPGAgentOptions
Related Examples
- Train DQN Agent to Balance Discrete Cart-Pole System
- Train Agent or Tune Environment Parameters Using Parameter Sweeping
- Tune Hyperparameters Using Bayesian Optimization