RL DDPG, reward should be negative however episode Q0 reward is becoming positive

Question

Muhammad Nadeem 2023-10-18

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2035389-rl-ddpg-reward-should-be-negative-however-episode-q0-reward-is-becoming-positive

回答： Muhammad Nadeem 2023-10-31

Hello Everyone,

I am building LQR type controller. My reward is the negative of LQR quadratic cost given as x'Qx + u'Ru. When i train the DDPG agen the episode Q0 reward is becoming positive. Since according to my understanding Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. The how is it possible? why is episode q0 reward going positive bcz the reward function is designed to be negative!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

UDAYA PEDDIRAJU 2023-10-26

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2035389-rl-ddpg-reward-should-be-negative-however-episode-q0-reward-is-becoming-positive#answer_1340911

Hello Muhammad,

I understand that you are facing an issue where the episode “Q0” reward is becoming positive though it was designed to achieve negative reward, to address this issue, I suggest considering the following solutions:

Scale between “Q0” and episode reward: It is possible that there is a significant difference in scale between the “Q0” estimate and the actual episode reward. This disparity may lead to unexpected results and impact the training process. To investigate this, you can try unchecking the "Show Episode Q0" option to see if it affects the episode reward values.
Another possibility is that there might be an issue with the implementation of the DDPG algorithm itself. The algorithm should be able to handle both positive and negative rewards. It is important to ensure that you are using the return, which is the sum of the rewards for a specific state-action pair from that point until the end of the trajectory.
Simplify the critic network: It might be helpful to simplify the critic network to ensure that it outputs values on a similar scale as the episode reward. This can help align the “Q0” estimates with the actual rewards, providing more accurate feedback for the agent's learning process.

Further you can have a refer to the MathWorks Documentation:

https://www.mathworks.com/matlabcentral/answers/532933-why-is-the-ddpg-episode-rewards-never-change-during-the-whole-training-process.

I hope this helps!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Muhammad Nadeem 2023-10-31

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2035389-rl-ddpg-reward-should-be-negative-however-episode-q0-reward-is-becoming-positive#answer_1343871

Matlab codes.zip

Hello UDAYA,

Thank you for the details. I have tried all of your options but the problem still persists. The episode Q0 reward just doesnt make sense and becomes extremely huge and positive. What does episode Q0 signify can it be ignored?. According to my knowledge its a matric to tell how good the critic is given the initial observation of the enviroment.

I am attaching the details of my codes also if you want to have a look at them, please find it in the attachments.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

RL DDPG, reward should be negative however episode Q0 reward is becoming positive

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

RL DDPG, reward should be negative however episode Q0 reward is becoming positive

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论