Can observation and reward be the same signal in a RL system?

4 次查看(过去 30 天)
When I tried to train a RL system, I created a simulink model, where there is only one action and one observation, which is the reward. Then I encountered an error named" containing algebraic loop" when I tried to train it. So I wonder if the way I define observation and reward caused this problem.
The reason why I define reward and observation as the same signal is they act the same role in this system, I want the agent get only this signal from the environment, so I just define one observation representing both observation and reward for avoiding redundance.

采纳的回答

Poorna
Poorna 2024-3-31
Hi Jize,
I see that you want to use the same signal both as an observation and reward in your reinforcement learning setup. It is to be noted that observation and reward do not occur at the same time.
In a reinforcement learning setting you first make an observation i.e, the current state of the system, and then pick an action and execute it. Your system will then move to a new state. The reward that you get at the end of this transition is a function of your initial state, the action and the resultant next state. When you say you want to use the same signal as reward and observation. It means that the reward you get at time step 't', will be the observation at time step 't+1'.
The algebraic loop error you're encountering arises from attempting to use the reward at time step (t) directly as the observation at the same time step (t), which creates a paradoxical situation. This is because the system is being asked to observe a signal that has not yet been generated, resulting in a logical inconsistency.
So, you should try adding an "unit delay" block when you pass the reward as observation to the system. By doing this you are essentially sending the reward of previous transition as obsevation to the current transition.
To know more about the "unit delay" block, refer to the following documentation:
Hope this Helps!.
  1 个评论
Jize Liu
Jize Liu 2024-4-6
Thank you for your reply. This should help. I have one point want to confirm: So in one cycle(t), the system starts from receiving an observation and ends with a reward, and in the next cycle(t+1), the new observation, which could be the reward from the last cycle, will be input to the system and start a new period. Is this so?

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Environments 的更多信息

产品


版本

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by