Regularizer options object to train DQN and SAC agents
rlConservativeQLearningOptions object to specify
conservative Q-learning regularizer options to train a DQN or SAC agents. The options you can
specify are the minimum weight and the number of random actions used for Q-value compensation,
and are mostly useful to train agents offline (specifically to deal with possible differences
between the probability distribution of the dataset and the one generated by the environment).
To enable the conservative Q-learning regularizer when training an agent, set the
BatchDataRegularizerOptions property of the agent options object to a
rlConservativeQLearningOptions object (that has your preferred minimum
weight and number of samples).
returns a default conservative Q-learning regularizer options set.
cqOpts = rlConservativeQLearningOptions
creates the conservative Q-learning regularizer option set
cqOpts = rlConservativeQLearningOptions(
sets its properties using one or more name-value arguments.
MinQValueWeight — Weight used for Q-value compensation
1 (default) | positive scalar
Weight used for Q-value compensation, specified as a positive scalar. For more information, see Algorithms.
NumSampledActions — Number of sampled actions used for Q-value compensation
10 (default) | positive integer
Number of sampled actions used for Q-value compensation, specified as a positive integer. This is the number of random actions used to estimate the logarithm of the sum of Q-values for the SAC agent. For more information see Continuous Actions Regularizer (SAC).
Create Conservative Q-Learning Options Object
rlConservativeQLearningOptions object specifying the weight to be used for Q-value compensation.
opt = rlConservativeQLearningOptions( ... MinQValueWeight=5)
opt = rlConservativeQLearningOptions with properties: MinQValueWeight: 5 NumSampledActions: 10
You can modify options using dot notation. For example, set
opt.NumSampledActions = 20;
To specify this behavioral cloning option set for an agent, first create the agent options object. For this example, create a default
rlDQNAgentOptions object for a DQN agent.
agentOpts = rlDQNAgentOptions;
Then, assign the
rlBehaviorCloningRegularizerOptions object to the
agentOpts.BatchDataRegularizerOptions = opt;
When you create the agent, use
agentOpts as the last input argument for the agent constructor function
In conservative Q-learning the regularizer that is added to the critic loss relies on the difference between the expected Q-values of the actions from the current policy and the Q-values of the actions from the data set.
Discrete Actions Regularizer (DQN)
For an agent with a discrete action space, the resulting loss function that the agents minimizes is the following:
Here, A is the set of all possible actions, M is the number of experiences in the minibatch, si is an observation stored in the minibatch, and yi the target Q-value corresponding to Q(si,ai).
To set Wcq, assign a value to the
MinQValueWeight property of the
rlConservativeQLearningOptions object. For more information, see
Continuous Actions Regularizer (SAC)
Similarly to the discrete action case, for an agent with a continuous action space, the resulting loss function that the agents minimizes is the following:
Here A is the (continuous) action set, M the number of experiences in the minibatch, si an observation stored in the minibatch, and yi the target Q-value corresponding to Q(si,ai). The first, logarithmic, term in the sum is properly defined for a continuous action space in  and is approximated as follows:
In this second equation, Unif(Amn,Amx) is an uniform distribution of action values from Amn to Amx, which are the lower and upper limits of the action range. These limits are taken from the action specifications (or are otherwise estimated if unavailable). The probability density function of the distribution, evaluated at ak is:
Finally, π(∙|si) is the distribution of the current policy given si.
To set Wcq in the first equation, assign a
value to the
MinQValueWeight property of the
To set N (the number of actions to be sampled to estimate the
logarithm term in the second equation), assign a value to the
NumSampledActions property of the
For more information, see .
 Kumar, Aviral, Aurick Zhou, George Tucker, and Sergey Levine. "Conservative q-learning for offline reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 1179-1191.
Introduced in R2023a