Episode reward is too high to be true

Question

Tan 2025-1-16

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2172948-episode-reward-is-too-high-to-be-true

评论： Sivsankar 2025-1-24

My problem is that my reward function is well defined to be expected in a range of -80 to 80. Running through a lot of times of my environment step function and the step reward is working fine. Below is the sample graph of reward against each step when the action inputted for each step is fully randomised.

However, whenever I train it with my PPO agent, the episode reward will be absurbly high, as shown in the table below.

It's an absurb episode reward value for a 45 steps per episode training. That just means every step reward is 187717.36/45 = 4171.5 average, which is impossible to happen.

Another problem is I can't terminate the training by using the 'stop training' in the reinforcement learning training monitor. I can stop the training by pausing the matlab and the window will show that the training is stopped. However, the environment is still running for no reason despite the window said that it has stopped. My environment do keep popping cmd window for each step because it links to the external software and I used task kill to close the cmd. Not sure this is the cause. Spend days and still don't know what caused it.

Any help is appreciated.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Tan 2025-1-23

I found the mistake now. The action that is given by the agent does not normalized despite the fact I did specified that the action info is rlnumeric with the lower limit = -1 and upper limit = 1 in the agent property. May I know how to normalized the action info?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Sivsankar 2025-1-20

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2172948-episode-reward-is-too-high-to-be-true#answer_1557755

Hi @Tan

PPO agents utilize at least 10 times the MinibatchSize experiences (default setting: LearningFrequency=-1) per update. Consequently, PPO agents often need to gather experiences from multiple episodes, depending on the length of each episode. This implies that PPO may not be effectively trained if each training session consists of only a few episodes. All collected experiences are discarded after each training session because PPO is an on-policy agent. I believe this is the reason why you are observing an unusually high reward value.

In general, PPO agents tend to be more stable when they use experiences from numerous episodes. To modify the number of experiences used for updates, adjust the LearningFrequency in rlPPOAgentOptions.

The reason you might be unable to stop the training using the 'Stop Training' button could be that the training cannot be halted in the middle of an episode for PPO agents. Therefore, you may need to wait after pressing the 'Stop Training' button for the episode to conclude before the program stops.

I hope this information is helpful!

2 个评论
显示无隐藏无

Tan 2025-1-23

I found the mistake now. The action that is given by the agent does not normalized despite the fact I did specified that the action info is rlnumeric with the lower limit = -1 and upper limit = 1 in the agent property. May I know how to normalized the action info?

Sivsankar 2025-1-24

The documentation for PPO agent states that - "For continuous action spaces, this PPO agent does not enforce the constraints set by the action specification. In this case, you must enforce action space constraints within the environment.". Please refer the documentation below:

https://www.mathworks.com/help/reinforcement-learning/ref/rlppoagent.html

Did you normailze it in the environment?

请先登录，再进行评论。

Episode reward is too high to be true

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Episode reward is too high to be true

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

2 个评论
显示无隐藏无