rlEvolutionStrategyTrainingOptions

Options for training off-policy reinforcement learning agents using an evolutionary strategy

Since R2023b

Description

Use an rlEvolutionStrategyTrainingOptions object to specify options to train an DDPG, TD3 or SAC agent within an environment. Evolution strategy training options include the population size and its update method, the number of training epochs, as well as criteria for stopping training and saving agents. After setting its options, use this object as an input argument for trainWithEvolutionStrategy.

For more information on the training algorithm, see Train agent with evolution strategy. For more information on training agents, see Train Reinforcement Learning Agents.

Creation

Syntax

trainOpts = rlEvolutionStrategyTrainingOptions

trainOpts = rlEvolutionStrategyTrainingOptions(Name=Value)

Description

trainOpts = rlEvolutionStrategyTrainingOptions returns the default options for training a DDPG, TD3 or SAC agent using an evolutionary strategy.

trainOpts = rlEvolutionStrategyTrainingOptions(Name=Value) creates the training option set trainOpts and sets its Properties using one or more name-value arguments.

example

Properties

expand all

`PopulationSize` — Number of individuals in the population
25 (default) | positive integer

Number of individuals in the population, specified as a positive integer. Every individual corresponds to an actor.

Example: PopulationSize=50

`PercentageEliteSize` — Percentage of surviving individuals
50 (default) | positive integer

Percentage of individuals surviving to form the next population, specified as an integer between 1 and 100.

Example: PercentageEliteSize=30

`EvaluationsPerIndividual` — Maximum number of episodes run per individual
1 (default) | positive integer

Maximum number of episodes run per individual, specified as a positive integer. Here, episode run means all the environment steps performed by given individual from the beginning of a given generation until termination of the same generation.

Example: EvaluationsPerIndividual=2

`TrainEpochs` — Number of training epochs
10 (default) | nonnegative integer

Number of training epochs used to update the gradient-based agent. If you set TrainEpochs to 0, then the agents are updated without using any gradient-based agent (therefore using only an pure evolutionary search strategy). For more information on the training algorithm, see Train agent with evolution strategy.

Example: TrainEpochs=5

`PopulationUpdateOptions` — Population update options
`GaussianUpdateOptions` object

Population update options, specified as a GaussianUpdateOptions object. For more information on the training algorithm, see Train agent with evolution strategy.

The properties of the GaussianUpdateOptions object, which determine how the evolution algorithm updates the distribution, and which you can modify using dot notation after creating the rlEvolutionStrategyTrainingOptions object, are as follows.

`UpdateMethod` — Update method for the population distribution
`"WeightedMixing"` (default) | `"UniformMixing"`

Update method for the population distribution, specified as either:

"WeightedMixing" — When calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor according to its fitness index (that is, better actors are weighted more).
"UniformMixing" — When calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor equally.

Example: Mode="async"

`InitialMean` — Initial mean of the population distribution
`0` (default) | scalar

Initial mean of the population distribution, specified as a scalar.

Example: InitialMean=-0.5

`InitialStandardDeviation` — Initial standard deviation of the population distribution
`0.1` (default) | positive scalar

Initial standard deviation of the population distribution, specified as a scalar.

Example: InitialStandardDeviation=0.5

`InitialStandardDeviationBias` — Initial bias of the standard deviation of the population distribution
`0.1` (default) | positive scalar

Initial bias of the standard deviation of the population distribution, specified as a scalar. A larger value promotes exploration.

Example: InitialStandardDeviationBias=0.2

`FinalStandardDeviationBias` — Final bias of the standard deviation of the population distribution
`0.001` (default) | positive scalar

Final bias of the standard deviation of the population distribution, specified as a nonnegative scalar.

Example: FinalStandardDeviationBias=0.002

`StandardDeviationBiasDecayRate` — Decay rate of the bias of the standard deviation of the population distribution
`0.95` (default) | positive scalar less than one

Decay rate of the bias of the standard deviation of the population distribution, specified as a positive scalar.

At the end of each training time step, if the bias of the population standard deviation StdBias is updated as follows.

StdBias = (1-StandardDeviationBiasDecayRate)*StdBias + ...
          StandardDeviationBiasDecayRate*FinalStandardDeviationBias

Note that StdBias is conserved between the end of an episode and the start of the next one. Therefore, it keeps on uniformly evolving over multiple episodes until it reaches FinalStandardDeviationBias.

Example: StandardDeviationBiasDecayRate=0.99

`ReturnedPolicy` — Type of the policy returned once training is terminated
`"AveragedPolicy"` (default) | `"BestPolicy"`

Type of the policy returned once training is terminated, specified as either "AveragedPolicy" or "BestPolicy".

Example: ReturnedPolicy="BestPolicy"

`MaxGenerations` — Maximum number of generations
500 (default) | positive integer

Maximum number of generations that the population is updated, specified as a positive integer.

Example: MaxGenerations=1000

`MaxStepsPerEpisode` — Maximum number of environment steps to run per episode
`500` (default) | positive integer

This property is read-only.

Maximum number of environment steps to run per episode, specified as a positive integer. In general, you define episode termination conditions in the environment. This value is the maximum number of steps to run in the episode if other termination conditions are not met.

Example: MaxStepsPerEpisode=1000

`ScoreAveragingWindowLength` — Window length for averaging
`5` (default) | positive integer

Window length for averaging the scores, rewards, and number of steps, specified as a scalar or vector.

For options expressed in terms of averages, ScoreAveragingWindowLength is the number of episodes included in the average. For instance, if StopTrainingCriteria is "AverageReward", and StopTrainingValue is 500, training terminates when the average reward over the number of episodes specified in ScoreAveragingWindowLength equals or exceeds 500.

Example: ScoreAveragingWindowLength=10

`StopTrainingCriteria` — Training termination condition
`"AverageReward"` (default) | `"EpisodeReward"` | ...

Training termination condition, specified as one of the following strings:

"AverageReward" — Stop training when the running average reward equals or exceeds the critical value.
"EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the critical value.

Example: StopTrainingCriteria="AverageReward"

`StopTrainingValue` — Critical value of training termination condition
`500` (default) | scalar

Critical value of the training termination condition, specified as an scalar.

Training ends when the termination condition specified by the StopTrainingCriteria option equals or exceeds this value.

For instance, if StopTrainingCriteria is "AverageReward", and StopTrainingValue is 100, training terminates when the average reward over the number of episodes specified in ScoreAveragingWindowLength equals or exceeds 100.

Example: StopTrainingValue=100

`SaveAgentCriteria` — Condition for saving the agent during training
`"none"` (default) | `"AverageReward"` | `"EpisodeReward"` | ...

Condition for saving agents during training, specified as one of the following strings:

"none" — Do not save any agents during training.
"AverageReward" — Save the agent when the running average reward over all episodes equals or exceeds the critical value.
"EpisodeReward" — Save the agent when the reward in the current episode equals or exceeds the critical value.

Set this option to store candidate agents that perform well according to the criteria you specify. When you set this option to a value other than "none", the software sets the SaveAgentValue option to 500. You can change that value to specify the condition for saving the agent.

For instance, suppose you want to store for further testing any agent that yields an episode reward that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the current agent in a MAT-file in the folder specified by the SaveAgentDirectory option. The MAT-file is called AgentK.mat, where K is the number of the corresponding episode. The agent is stored within that MAT-file as saved_agent.

Example: SaveAgentCriteria="EpisodeReward"

`SaveAgentValue` — Critical value of condition for saving agents
`"none"` (default) | 500 | scalar

Critical value of the condition for saving agents, specified as a scalar.

When you specify a condition for saving candidate agents using SaveAgentCriteria, the software sets this value to 500. Change the value to specify the condition for saving the agents. See the SaveAgentCriteria option for more details.

Example: SaveAgentValue=100

`SaveAgentDirectory` — Folder name for saved agents
`"savedAgents"` (default) | string | character vector

Folder name for saved agents, specified as a string or character vector. The folder name can contain a full or relative path. When an episode occurs in which the conditions specified by the SaveAgentCriteria and SaveAgentValue options are satisfied, the software saves the current agent in a MAT-file in this folder. If the folder does not exist, the training function creates it. When SaveAgentCriteria is "none", this option is ignored and no folder is created.

Example: SaveAgentDirectory = pwd + "\run1\Agents"

`Verbose` — Option to display training progress at the command line
`false` (`0`) (default) | `true` (`1`)

Option to display training progress at the command line, specified as the logical values false (0) or true (1). Set to true to write information from each training episode to the MATLAB^® command line during training.

Example: Verbose=true

`Plots` — Option to display training progress with Reinforcement Learning Training Monitor
`"training-progress"` (default) | `"none"`

Option to display training progress with Reinforcement Learning Training Monitor, specified as "training-progress" or "none". By default, calling train opens Reinforcement Learning Training Monitor, which graphically and numerically displays information about the training progress, such as the reward for each episode, average reward, number of episodes, and total number of steps. For more information, see train. To turn off this display, set this option to "none".

Example: Plots="none"

`UseParallel` — Option to use parallel training
`false` (default) | `true`

Option to use parallel training, specified as a logical. Setting this option to true configures training to use multiple processes (which can run on different cores, processors, computer clusters or cloud resources) to simulate the environment. This option scales up the number of simulations with the environment, and can speed up the generation of data for learning.

To specify options for parallel training, use the ParallelizationOptions property.

Note that if you want to speed up deep neural network calculations (such as gradient computation, parameter update and prediction) using a local GPU, you do not need to set UseParallel to true. Instead, when creating your actor or critic, set its UseDevice option to "gpu" instead of "cpu".

Using parallel computing or the GPU requires Parallel Computing Toolbox™ software. Using computer clusters or cloud resources additionally requires MATLAB Parallel Server™. For more information about training using multicore processors and GPUs, see Train Agents Using Parallel Computing and GPUs.

Example: UseParallel=true

`ParallelizationOptions` — Options for parallel training
`ParallelTraining` object

Options for parallel training, specified as a ParallelTraining object. For more information about training using parallel computing, see Train Agents Using Parallel Computing and GPUs.

The ParallelTraining object has the following properties, which you can modify using dot notation after creating the rlTrainingOptions object.

`Mode` — Parallel computing mode
`"sync"` (default) | `"async"`

Parallel computing mode, specified as one of the following:

"sync" — Use parpool to run synchronous training on the available workers. In this case, each worker pauses execution until all workers are finished. The parallel pool client updates the actor and critic parameters based on the results from all the workers and sends the updated parameters to all workers. When training a PG agent using gradient-based parallelization Mode must be set to "sync".
"async" — Use parpool to run asynchronous training on the available workers. In this case, each worker sends its data back to the parallel pool client as soon as it finishes and then receives updated parameters from the client. The worker then continues with its task.

Example: Mode="async"

`WorkerRandomSeeds` — Randomizer initialization for workers
`–1` (default) | `–2` | vector

Randomizer initialization for workers, specified as one of the following:

–1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
–2 — Do not assign a random seed to the workers.
Vector — Manually specify the random seed for each worker. The number of elements in the vector must match the number of workers.

Example: WorkerRandomSeeds=[1 2 3 4]

`TransferBaseWorkspaceVariables` — Option to send model and workspace variables to parallel workers
`"on"` (default) | `"off"`

Option to send model and workspace variables to parallel workers, specified as "on" or "off". When the option is "on", the client sends to the workers the variables defined in the base MATLAB workspace and used in the approximation models.

Example: TransferBaseWorkspaceVariables="off"

`AttachedFiles` — Additional files to attach to the parallel pool
`[]` (default) | string | string array

Additional files to attach to the parallel pool, specified as a string or string array.

Example: AttachedFiles="myInitFile.m"

`SetupFcn` — Function to run before training starts
`[]` (default) | function handle

Function to run before training starts, specified as a handle to a function having no input arguments. This function is run once per worker before training begins. Write this function to perform any processing that you need prior to training.

Example: AttachedFiles=@mySetupFcn

`CleanupFcn` — Function to run after training ends
`[]` (default) | function handle

Function to run after training ends, specified as a handle to a function having no input arguments. You can write this function to clean up the workspace or perform other processing after training terminates.

Example: AttachedFiles=@myCleanupFcn

`StopOnError` — Option to stop training when error occurs
`"on"` (default) | `"off"`

Option to stop training when an error occurs during an episode, specified as "on" or "off". When this option is "off", errors are captured and returned in the SimulationInfo output of train, and training continues to the next episode.

Example: StopOnError="off"

`SimulationStorageType` — Storage type for environment data
`"memory"` (default) | `"file"` | `"none"`

Storage type for environment data, specified as "memory", "file", or "none". This option specifies the type of storage used for data generated during training or simulation by a Simulink^® environment. Specifically, the software saves anything that appears as the output of a sim (Simulink) command.

Note that this option does not affect (and is not affected by) any option to save agents during training specified within a training option object, or any data logged by a FileLogger or MonitorLogger object.

The default value is "memory", indicating that data is stored in an internal memory variable. When you set this option to "file", data is stored to disk, in MAT-files in the directory specified by the SaveSimulationDirectory property, and using the MAT-file version specified by the SaveFileVersion property. When you set this option to "none", simulation data is not stored.

You can use this option to prevent out-of-memory issues during training or simulation.

Example: "none"

`SaveSimulationDirectory` — Folder used to save environment data
`"savedSims"` (default) | string | character vector

Folder used to save environment data, specified as a string or character vector. The folder name can contain a full or relative path. When you set the SimulationStorageType property to "file", the software saves data generated during training or simulation by a Simulink environment in MAT-files in this folder, using the MAT-file version specified by the SaveFileVersion property. If the folder does not exist, the software creates it.

Example: "envSimData"

`SaveFileVersion` — MAT-file version used to save environment data
`"-v7"` (default) | `"-v7.3"` | `"-v6"`

MAT-file version used to save environment data, specified as a string or character vector. When you set the SimulationStorageType property to "file", the software saves data generated by a Simulink environment in MAT-files in the version specified by SaveFileVersion, in the folder specified by the SaveSimulationDirectory property. For more information, see MAT-File Versions.

Example: Version="-v7.3"

Object Functions

trainWithEvolutionStrategy Train DDPG, TD3 or SAC agent using an evolutionary strategy within a specified environment

Examples

collapse all

Configure Options for Training with Evolutionary Strategy

Open Live Script

Create an options set for training a DDPG, TD3 or SAC agent using an evolutionary strategy. Set the population size, the number of train epochs, and the maximum number of steps per episode. You can set the options using name-value pair arguments when you create the options set. Any options that you do not explicitly set have their default values.

esOpts = rlEvolutionStrategyTrainingOptions(...
    PopulationSize=50, ...
    TrainEpoch=10, ...
    MaxStepsPerEpisode=500)

esOpts = 
  rlEvolutionStrategyTrainingOptions with properties:

                PopulationSize: 50
           PercentageEliteSize: 50
      EvaluationsPerIndividual: 1
                   TrainEpochs: 10
       PopulationUpdateOptions: [1x1 rl.option.GaussianUpdateOptions]
                ReturnedPolicy: "AveragedPolicy"
                MaxGenerations: 500
            MaxStepsPerEpisode: 500
    ScoreAveragingWindowLength: 5
          StopTrainingCriteria: "AverageReward"
             StopTrainingValue: 500
             SaveAgentCriteria: "none"
                SaveAgentValue: 500
            SaveAgentDirectory: "savedAgents"
                       Verbose: 0
                         Plots: "training-progress"
                   UseParallel: 0
        ParallelizationOptions: [1x1 rl.option.ParallelSimulation]
                   StopOnError: "on"
         SimulationStorageType: "memory"
       SaveSimulationDirectory: "savedSims"
               SaveFileVersion: "-v7"

Alternatively, create a default options set and use dot notation to change some of the values.

esOpts = rlEvolutionStrategyTrainingOptions;
esOpts.PopulationSize=30;
esOpts.TrainEpochs=15;
esOpts.MaxStepsPerEpisode=500;

Set the population update method and the initial standard deviation in the PopulationUpdateOptions property.

esOpts.PopulationUpdateOptions.UpdateMethod = "UniformMixing";
esOpts.PopulationUpdateOptions.InitialStandardDeviation  =  0.2;

To train a supported off-policy agent with an evolutionary strategy, you can now use esOpts as an input argument to trainWithEvolutionStrategy.

Algorithms

expand all

Train agent with evolution strategy

Each individual in the population is an actor identified by a vector of learnable parameters, which is sampled from a multivariate Gaussian distribution. Specifically, the training algorithm uses the InitialMean and InitialStandardDeviation properties to establish the initial Gaussian distribution for the population, and then samples a population of actors from that distribution. Additionally, the algorithm also maintains a gradient-based actor, for which parameters are updated independently using a policy-gradient based rule (in which the gradient is calculated using experience data from all the actors).

After interacting with the environment for a number of episodes specified by EvaluationsPerIndividual, each actor (including the gradient-based one), is assigned a fitness index, which corresponds to the reward accumulated during the episodes. New mean and a standard deviation values are then calculated from the elite population, according to PercentageEliteSize, using a sum weighted according to UpdateMethod.

A standard deviation bias factor, which evolves independently according to the properties InitialStandardDeviationBias, FinalStandardDeviationBias and StandardDeviationBiasDecayRate, is scalarly expanded and then added to the standard deviation. The training algorithm then instantiates a new population of actors by sampling the new Gaussian distribution specified by the new mean and standard deviation, and the cycle resumes.

Version History

Introduced in R2023b

rlEvolutionStrategyTrainingOptions

Description

Creation

Syntax

Description

Properties

PopulationSize — Number of individuals in the population 25 (default) | positive integer

PercentageEliteSize — Percentage of surviving individuals 50 (default) | positive integer

EvaluationsPerIndividual — Maximum number of episodes run per individual 1 (default) | positive integer

TrainEpochs — Number of training epochs 10 (default) | nonnegative integer

PopulationUpdateOptions — Population update options GaussianUpdateOptions object

UpdateMethod — Update method for the population distribution "WeightedMixing" (default) | "UniformMixing"

InitialMean — Initial mean of the population distribution 0 (default) | scalar

InitialStandardDeviation — Initial standard deviation of the population distribution 0.1 (default) | positive scalar

InitialStandardDeviationBias — Initial bias of the standard deviation of the population distribution 0.1 (default) | positive scalar

FinalStandardDeviationBias — Final bias of the standard deviation of the population distribution 0.001 (default) | positive scalar

StandardDeviationBiasDecayRate — Decay rate of the bias of the standard deviation of the population distribution 0.95 (default) | positive scalar less than one

ReturnedPolicy — Type of the policy returned once training is terminated "AveragedPolicy" (default) | "BestPolicy"

MaxGenerations — Maximum number of generations 500 (default) | positive integer

MaxStepsPerEpisode — Maximum number of environment steps to run per episode 500 (default) | positive integer

ScoreAveragingWindowLength — Window length for averaging 5 (default) | positive integer

StopTrainingCriteria — Training termination condition "AverageReward" (default) | "EpisodeReward" | ...

StopTrainingValue — Critical value of training termination condition 500 (default) | scalar

SaveAgentCriteria — Condition for saving the agent during training "none" (default) | "AverageReward" | "EpisodeReward" | ...

SaveAgentValue — Critical value of condition for saving agents "none" (default) | 500 | scalar

SaveAgentDirectory — Folder name for saved agents "savedAgents" (default) | string | character vector

Verbose — Option to display training progress at the command line false (0) (default) | true (1)

Plots — Option to display training progress with Reinforcement Learning Training Monitor "training-progress" (default) | "none"

UseParallel — Option to use parallel training false (default) | true

ParallelizationOptions — Options for parallel training ParallelTraining object

Mode — Parallel computing mode "sync" (default) | "async"

WorkerRandomSeeds — Randomizer initialization for workers –1 (default) | –2 | vector

TransferBaseWorkspaceVariables — Option to send model and workspace variables to parallel workers "on" (default) | "off"

AttachedFiles — Additional files to attach to the parallel pool [] (default) | string | string array

SetupFcn — Function to run before training starts [] (default) | function handle

CleanupFcn — Function to run after training ends [] (default) | function handle

StopOnError — Option to stop training when error occurs "on" (default) | "off"

SimulationStorageType — Storage type for environment data "memory" (default) | "file" | "none"

SaveSimulationDirectory — Folder used to save environment data "savedSims" (default) | string | character vector

SaveFileVersion — MAT-file version used to save environment data "-v7" (default) | "-v7.3" | "-v6"

Object Functions

Examples

Configure Options for Training with Evolutionary Strategy

Algorithms

Train agent with evolution strategy

Version History

See Also

Functions

Objects

Topics

`PopulationSize` — Number of individuals in the population
25 (default) | positive integer

`PercentageEliteSize` — Percentage of surviving individuals
50 (default) | positive integer

`EvaluationsPerIndividual` — Maximum number of episodes run per individual
1 (default) | positive integer

`TrainEpochs` — Number of training epochs
10 (default) | nonnegative integer

`PopulationUpdateOptions` — Population update options
`GaussianUpdateOptions` object

`UpdateMethod` — Update method for the population distribution
`"WeightedMixing"` (default) | `"UniformMixing"`

`InitialMean` — Initial mean of the population distribution
`0` (default) | scalar

`InitialStandardDeviation` — Initial standard deviation of the population distribution
`0.1` (default) | positive scalar

`InitialStandardDeviationBias` — Initial bias of the standard deviation of the population distribution
`0.1` (default) | positive scalar

`FinalStandardDeviationBias` — Final bias of the standard deviation of the population distribution
`0.001` (default) | positive scalar

`StandardDeviationBiasDecayRate` — Decay rate of the bias of the standard deviation of the population distribution
`0.95` (default) | positive scalar less than one

`ReturnedPolicy` — Type of the policy returned once training is terminated
`"AveragedPolicy"` (default) | `"BestPolicy"`

`MaxGenerations` — Maximum number of generations
500 (default) | positive integer

`MaxStepsPerEpisode` — Maximum number of environment steps to run per episode
`500` (default) | positive integer

`ScoreAveragingWindowLength` — Window length for averaging
`5` (default) | positive integer

`StopTrainingCriteria` — Training termination condition
`"AverageReward"` (default) | `"EpisodeReward"` | ...

`StopTrainingValue` — Critical value of training termination condition
`500` (default) | scalar

`SaveAgentCriteria` — Condition for saving the agent during training
`"none"` (default) | `"AverageReward"` | `"EpisodeReward"` | ...

`SaveAgentValue` — Critical value of condition for saving agents
`"none"` (default) | 500 | scalar

`SaveAgentDirectory` — Folder name for saved agents
`"savedAgents"` (default) | string | character vector

`Verbose` — Option to display training progress at the command line
`false` (`0`) (default) | `true` (`1`)

`Plots` — Option to display training progress with Reinforcement Learning Training Monitor
`"training-progress"` (default) | `"none"`

`UseParallel` — Option to use parallel training
`false` (default) | `true`

`ParallelizationOptions` — Options for parallel training
`ParallelTraining` object

`Mode` — Parallel computing mode
`"sync"` (default) | `"async"`

`WorkerRandomSeeds` — Randomizer initialization for workers
`–1` (default) | `–2` | vector

`TransferBaseWorkspaceVariables` — Option to send model and workspace variables to parallel workers
`"on"` (default) | `"off"`

`AttachedFiles` — Additional files to attach to the parallel pool
`[]` (default) | string | string array

`SetupFcn` — Function to run before training starts
`[]` (default) | function handle

`CleanupFcn` — Function to run after training ends
`[]` (default) | function handle

`StopOnError` — Option to stop training when error occurs
`"on"` (default) | `"off"`

`SimulationStorageType` — Storage type for environment data
`"memory"` (default) | `"file"` | `"none"`

`SaveSimulationDirectory` — Folder used to save environment data
`"savedSims"` (default) | string | character vector

`SaveFileVersion` — MAT-file version used to save environment data
`"-v7"` (default) | `"-v7.3"` | `"-v6"`