PPO agent - Experience Horizon in MATLAB

24 次查看(过去 30 天)
On the PPO examples it states that the agent interacts with the environment until it interacts with the environment for the experience horizon amount of steps at which point it will split the experiences into minibatches each of which are put through the neural network 'epochs' times. e.g.,
If experience horizion was 512, mini batch 128, and epoch 3. The agent would interact with the environment for 512 steps (512 experiences) split this into 4 mini batches of size 128 each. Put each batch through the neural network 3 times. Is this correct? I ask because elsewhere the terms buffer size, mini batch size and time horizon are used.
I also saw in the example which used experience horizon of 200 and mini batch of 64 that it would collect experiences until it reaches 200 or the episode terminates at which point it will learn. So is it correct that if only 100 experiences were collected then this would be split into 2 batches, one of 64 and one of 36 rather than waiting for the next epsiode for the experience horizon to reach 200. i.e., the experience horizon resets each episode and the experiences are put through the neural network and the end of each episode regardless of whether the experience horizon has been reached?

回答(1 个)

Piyush Dubey
Piyush Dubey 2023-5-29
编辑:Piyush Dubey 2023-5-29
Hi Harry,
The experience horizon is the maximum number of time steps that an agent can collect experience for during a single episode. It is a hyperparameter that the user sets before training the agent and it depends on the specific task/environment in which the agent will operate. If the agent reaches the experience horizon without terminating the episode, it will reset and start a new episode.
Time horizon refers to the maximum number of time steps that an agent can take in a single episode of interaction with the environment. This is a common concept when dealing with sequential decision-making problems in which the agent must choose actions over a certain period of time to achieve some goal. For example, in a game of chess, the time horizon is the maximum number of moves that a player can make before the game ends.
In many cases, the time horizon and experience horizon can be set to the same value, especially for simple problems or when collecting experience is relatively fast. However, for more complex problems, it may not be feasible to set the time horizon and experience horizon to the same value, due to computational or memory limitations.
If the experience horizon is 200 and the mini-batch size is 64, there will be three mini-batches per epoch, with a remaining of 8 experiences per epoch that cannot be included in a complete mini-batch. This is because 200 (the experience horizon) divided by 64 (the mini-batch size) is approximately 3, with 8 remaining experiences.
Epoch-based training in PPO with minibatches is typically done by first collecting a batch of experience samples by interacting with the environment using the current policy. Then, the experience samples are randomly sampled to create a mini-batch for updating the policy.
You can follow the following documentation for more information:
Hope this helps!

产品


版本

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by