Hi Harry,
The experience horizon is the maximum number of time steps that an agent can collect experience for during a single episode. It is a hyperparameter that the user sets before training the agent and it depends on the specific task/environment in which the agent will operate. If the agent reaches the experience horizon without terminating the episode, it will reset and start a new episode.
Time horizon refers to the maximum number of time steps that an agent can take in a single episode of interaction with the environment. This is a common concept when dealing with sequential decision-making problems in which the agent must choose actions over a certain period of time to achieve some goal. For example, in a game of chess, the time horizon is the maximum number of moves that a player can make before the game ends.
In many cases, the time horizon and experience horizon can be set to the same value, especially for simple problems or when collecting experience is relatively fast. However, for more complex problems, it may not be feasible to set the time horizon and experience horizon to the same value, due to computational or memory limitations.
If the experience horizon is 200 and the mini-batch size is 64, there will be three mini-batches per epoch, with a remaining of 8 experiences per epoch that cannot be included in a complete mini-batch. This is because 200 (the experience horizon) divided by 64 (the mini-batch size) is approximately 3, with 8 remaining experiences.
Epoch-based training in PPO with minibatches is typically done by first collecting a batch of experience samples by interacting with the environment using the current policy. Then, the experience samples are randomly sampled to create a mini-batch for updating the policy.
You can follow the following documentation for more information:
Hope this helps!