Episode reward is too high to be true

6 次查看(过去 30 天)
Tan
Tan 2025-1-16
评论: Sivsankar 2025-1-24
My problem is that my reward function is well defined to be expected in a range of -80 to 80. Running through a lot of times of my environment step function and the step reward is working fine. Below is the sample graph of reward against each step when the action inputted for each step is fully randomised.
However, whenever I train it with my PPO agent, the episode reward will be absurbly high, as shown in the table below.
It's an absurb episode reward value for a 45 steps per episode training. That just means every step reward is 187717.36/45 = 4171.5 average, which is impossible to happen.
Another problem is I can't terminate the training by using the 'stop training' in the reinforcement learning training monitor. I can stop the training by pausing the matlab and the window will show that the training is stopped. However, the environment is still running for no reason despite the window said that it has stopped. My environment do keep popping cmd window for each step because it links to the external software and I used task kill to close the cmd. Not sure this is the cause. Spend days and still don't know what caused it.
Any help is appreciated.
  1 个评论
Tan
Tan 2025-1-23
I found the mistake now. The action that is given by the agent does not normalized despite the fact I did specified that the action info is rlnumeric with the lower limit = -1 and upper limit = 1 in the agent property. May I know how to normalized the action info?

请先登录,再进行评论。

回答(1 个)

Sivsankar
Sivsankar 2025-1-20
Hi @Tan
PPO agents utilize at least 10 times the MinibatchSize experiences (default setting: LearningFrequency=-1) per update. Consequently, PPO agents often need to gather experiences from multiple episodes, depending on the length of each episode. This implies that PPO may not be effectively trained if each training session consists of only a few episodes. All collected experiences are discarded after each training session because PPO is an on-policy agent. I believe this is the reason why you are observing an unusually high reward value.
In general, PPO agents tend to be more stable when they use experiences from numerous episodes. To modify the number of experiences used for updates, adjust the LearningFrequency in rlPPOAgentOptions.
The reason you might be unable to stop the training using the 'Stop Training' button could be that the training cannot be halted in the middle of an episode for PPO agents. Therefore, you may need to wait after pressing the 'Stop Training' button for the episode to conclude before the program stops.
I hope this information is helpful!
  2 个评论
Tan
Tan 2025-1-23
I found the mistake now. The action that is given by the agent does not normalized despite the fact I did specified that the action info is rlnumeric with the lower limit = -1 and upper limit = 1 in the agent property. May I know how to normalized the action info?
Sivsankar
Sivsankar 2025-1-24
The documentation for PPO agent states that - "For continuous action spaces, this PPO agent does not enforce the constraints set by the action specification. In this case, you must enforce action space constraints within the environment.". Please refer the documentation below:
Did you normailze it in the environment?

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Training and Simulation 的更多信息

产品


版本

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by