RL DDPG, reward should be negative however episode Q0 reward is becoming positive

8 次查看(过去 30 天)
Hello Everyone,
I am building LQR type controller. My reward is the negative of LQR quadratic cost given as x'Qx + u'Ru. When i train the DDPG agen the episode Q0 reward is becoming positive. Since according to my understanding Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. The how is it possible? why is episode q0 reward going positive bcz the reward function is designed to be negative!

回答(2 个)

UDAYA PEDDIRAJU
UDAYA PEDDIRAJU 2023-10-26
Hello Muhammad,
I understand that you are facing an issue where the episode Q0 reward is becoming positive though it was designed to achieve negative reward, to address this issue, I suggest considering the following solutions:
  1. Scale between Q0 and episode reward: It is possible that there is a significant difference in scale between the Q0 estimate and the actual episode reward. This disparity may lead to unexpected results and impact the training process. To investigate this, you can try unchecking the "Show Episode Q0" option to see if it affects the episode reward values.
  2. Another possibility is that there might be an issue with the implementation of the DDPG algorithm itself. The algorithm should be able to handle both positive and negative rewards. It is important to ensure that you are using the return, which is the sum of the rewards for a specific state-action pair from that point until the end of the trajectory.
  3. Simplify the critic network: It might be helpful to simplify the critic network to ensure that it outputs values on a similar scale as the episode reward. This can help align the Q0 estimates with the actual rewards, providing more accurate feedback for the agent's learning process.
Further you can have a refer to the MathWorks Documentation:
I hope this helps!

Muhammad Nadeem
Muhammad Nadeem 2023-10-31
Hello UDAYA,
Thank you for the details. I have tried all of your options but the problem still persists. The episode Q0 reward just doesnt make sense and becomes extremely huge and positive. What does episode Q0 signify can it be ignored?. According to my knowledge its a matric to tell how good the critic is given the initial observation of the enviroment.
I am attaching the details of my codes also if you want to have a look at them, please find it in the attachments.

类别

Help CenterFile Exchange 中查找有关 Environments 的更多信息

产品


版本

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by