Deep Deterministic Policy Gradient (DDPG) Agent

The deep deterministic policy gradient (DDPG) algorithm is an off-policy actor-critic method for environments with a continuous action-space. A DDPG agent learns a deterministic policy while also using a Q-value function critic to estimate the value of the optimal policy. It features a target actor and critic as well as an experience buffer. DDPG agents supports offline training (training from saved data, without an environment). For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

In Reinforcement Learning Toolbox™, a deep deterministic policy gradient agent is implemented by an rlDDPGAgent object.

DDPG agents can be trained in environments with the following observation and action spaces.

Observation Space	Action Space
Continuous or discrete	Continuous

DDPG agents use the following actor and critic.

Critic	Actor
Q-value function critic Q(S,A), which you create using `rlQValueFunction`	Deterministic policy actor π(S), which you create using `rlContinuousDeterministicActor`

During training, a DDPG agent:

Updates the actor and critic learnable parameters at each time step during learning.
Stores past experiences using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.
Perturbs the action chosen by the policy using a stochastic noise model at each training step.

Actor and Critic Used by the DDPG Agent

To estimate the policy and value function, a DDPG agent maintains four function approximators:

Actor π(S;θ)— The actor, with parameters θ, takes observation S and returns the corresponding action that maximizes the long-term reward. Note that π here does not represent a probability distribution, but a function that returns an action.
Target actor π_t(S;θ_t) — To improve the stability of the optimization, the agent periodically updates the target actor learnable parameters θ_t using the latest actor parameter values.
Critic Q(S,A;ϕ) — The critic, with parameters ϕ, takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.
Target critic Q_t(S,A;ϕ_t) — To improve the stability of the optimization, the agent periodically updates the target critic learnable parameters ϕ_t using the latest critic parameter values.

Both Q(S,A;ϕ) and Q_t(S,A;ϕ_t) have the same structure and parameterization, and both π(S;θ) and π_t(S;θ_t) have the same structure and parameterization.

During training, the actor tunes the parameter values in θ to improve the policy. Similarly, during training, the critic tunes the parameter values in ϕ to improve its action-value function estimation. After training, the parameters remain at their tuned values in the actor and critic internal to the trained agent.

For more information on actors and critics, see Create Policies and Value Functions.

DDPG Agent Creation

You can create and train DDPG agents at the MATLAB^® command line or using the Reinforcement Learning Designer app. For more information on creating agents using Reinforcement Learning Designer, see Create Agents Using Reinforcement Learning Designer.

At the command line, you can create a DDPG agent with default actor and critic based on the observation and action specifications from the environment. To do so, perform the following steps.

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using getObservationInfo.
Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using getActionInfo.
If needed, specify the number of neurons in each learnable layer of the default network or whether to use an LSTM layer. To do so, create an agent initialization option object using rlAgentInitializationOptions.
If needed, specify agent options using an rlDDPGAgentOptions object. Alternatively, you can skip this step and modify the agent options later using dot notation.
Create the agent using rlDDPGAgent.

Alternatively, you can create a custom actor and critic and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment.

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using getObservationInfo.
Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using getActionInfo.
Create an approximation model for your actor. You can use a custom basis function with initial parameter values or a neural network object.
Create an actor using rlContinuousDeterministicActor. Use the model you created in the previous step as a first input argument.
Create an approximation model for your critic. You can use a custom basis function with initial parameter values, or a neural network object.
Create a critic using rlQValueFunction. Use the model you created in the previous step as a first input argument.
Specify agent options using an rlDDPGAgentOptions object. Alternatively, you can skip this step and modify the agent options later using dot notation.
Create the agent using rlDDPGAgent.

For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.

DDPG Training Algorithm

DDPG agents use the following training algorithm, in which they update their actor and critic models at each time step. To configure the training algorithm, specify options using an rlDDPGAgentOptions object.

Initialize the critic Q(S,A;ϕ) with random parameter values ϕ, and initialize the target critic parameters ϕ_t with the same values: $ϕ_{t} = ϕ$ .
Initialize the actor π(S;θ) with random parameter values θ, and initialize the target actor parameters θ_t with the same values: $θ_{t} = θ$ .
Perform a warm start by taking a sequence of actions following the initial policy represented by π(S;θ).
1. At the beginning of each episode, get the initial observation from the environment.
2. For the current observation, select action A = π(S;θ) + N, where N is stochastic noise from the noise model. To configure the noise model, use the NoiseOptions option.
3. Execute action A. Observe the reward R and the next observation S'.
4. Store the experience (S,A,R,S') in the experience buffer.
To specify the size of the experience buffer, use the ExperienceBufferLength option in the agent rlDDPGAgentOptions object. To specify the number of warm up actions, use the NumWarmStartSteps option.
After the warm start procedure, for each training time step:
1. Execute the four operations described in the warm start procedure.
2. Every D_C time steps (to specify D_C use the LearningFrequency option), perform the following two operations for NumEpoch times:
  1. Using all the collected experiences, create at most B different mini-batches. To specify B, use the MaxMiniBatchPerEpoch option. Each mini-batch contains M different (typically nonconsecutive) experiences (S_i,A_i,R_i,S'_i) that are randomly sampled from the experience buffer (each experience can only be part of one mini-batch). To specify M, use the MiniBatchSize option.
    If the agent contains recurrent neural networks, each mini-batch contains M different sequences. Each sequence contains K consecutive experiences (starting from a randomly sampled experience). To specify K, use the SequenceLength option.
  2. For each (randomly selected) mini-batch, perform the learning operations described in Mini-Batch Learning Operations.
  When LearningFrequency is the default value of -1, the creation of the minibatches (described in point a) and the learning operations (described in point b) are executed after each episode is finished.

For simplicity, the actor and critic updates in this algorithm show a gradient update using basic stochastic gradient descent. The actual gradient update method depends on the optimizer you specify in the rlOptimizerOptions object assigned to the rlCriticOptimizerOptions property.

Mini-Batch Learning Operations

Operations performed for each mini-batch.

If S'_i is a terminal state, set the value function target y_i to R_i. Otherwise, set it to
$y_{i} = R_{i} + γ Q_{t} (S_{i}', π_{t} (S_{i}'; θ_{t}); ϕ_{t})$
The value function target is the sum of the experience reward R_i and the discounted future reward. To specify the discount factor γ, use the DiscountFactor option.
To compute the cumulative reward, the agent first computes a next action by passing the next observation S'_i from the sampled experience to the target actor. The agent finds the cumulative reward by passing the next action to the target critic.
If you specify a value of NumStepsToLookAhead equal to N, then the N-step return (which adds the rewards of the following N steps and the discounted estimated value of the state that caused the N-th reward) is used to calculate the target y_i.
Update the critic parameters by minimizing the loss L_k across all sampled experiences.
$L_{k} = \frac{1}{2 M} \sum_{i = 1}^{M} {(y_{i} - Q_{k} (S_{i}, A_{i}; ϕ_{k}))}^{2}$
Every D_A critic updates (to set D_A, use the PolicyUpdateFrequency option), update the actor parameters using the following sampled policy gradient to maximize the expected discounted cumulative long-term reward.
$\begin{array}{l} \nabla_{θ} J \approx \frac{1}{M} \sum_{i = 1}^{M} G_{a i} G_{π i} \\ G_{a i} = \nabla_{A} Q (S_{i}, A; ϕ) where A = π (S_{i}; θ) \\ G_{π i} = \nabla_{θ} π (S_{i}; θ) \end{array}$
Here, G_ai is the gradient of the critic output with respect to the action computed by the actor network, and G_πi is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation S_i.
At every TargetUpdateFrequency critic updates, update the target actor and critics depending on the target update method. For more information, see Target Update Methods.

Target Update Methods

DDPG agents update their target actor and critic parameters using one of the following target update methods.

Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the TargetSmoothFactor option.
$\begin{array}{l} ϕ_{t} = τ ϕ + (1 - τ) ϕ_{t} (critic parameters) \\ θ_{t} = τ θ + (1 - τ) θ_{t} (actor parameters) \end{array}$
Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor = 1). To specify the update period, use the TargetUpdateFrequency parameter.
Periodic Smoothing — Update the target parameters periodically with smoothing.

To configure the target update method, create a rlDDPGAgentOptions object, and set the TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.

Update Method	`TargetUpdateFrequency`	`TargetSmoothFactor`
Smoothing (default)	`1`	Less than `1`
Periodic	Greater than `1`	`1`
Periodic smoothing	Greater than `1`	Less than `1`

References

[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.” ArXiv:1509.02971 [Cs, Stat], September 9, 2015. https://arxiv.org/abs/1509.02971.