freezing layers of actor and critic of RL agent

Question

Sourabh 2024-1-30

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2076011-freezing-layers-of-actor-and-critic-of-rl-agent

编辑： Karanjot 2024-1-30

rewards_refer.png

After training ,I have freezed every layer of my actor and crtitc network of my RL agent (by using setLearnRateFactor(neuralnet,'layers','parameters',0);) and then I am retraining my agent in same enviornment and I am getting rewards like as shown in image file.

My ques is is it normal to get rewards like this? (I mean shouldnt there should be no variation or very little variation in rewards.)

my reward function is 10 - e^2 (error).

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Karanjot 2024-1-30

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2076011-freezing-layers-of-actor-and-critic-of-rl-agent#answer_1399471

编辑：Karanjot 2024-1-30

Observing fluctuations in rewards is a common occurrence when retraining a reinforcement learning (RL) agent, despite having locked the parameters of both the actor and critic architectures. The agent continues its exploration and learning within the given environment, and the specified reward function significantly influences the reward outcomes.

The variation in rewards can be influenced by several factors, such as the exploration-exploitation trade-off, the complexity of the environment, and the learning rate of the agent. It is possible that the agent is still trying to optimize its policy and may encounter different states or actions that result in varying rewards.

The environment’s inherent stochasticity can lead to different state transitions and rewards for similar actions. Additionally, if there are other unfrozen parameters or noise processes involved in action selection, they could contribute to the observed variations

You may consider the following steps:

Plot the rewards over time during the retraining process to observe the trend. This can help you understand if the rewards are converging or not.
Experiment with different learning rates for the agent. A higher learning rate may lead to faster convergence but could also result in more variation initially.
You can also try modifying the reward function to see if it reduces the variation in rewards.

Keep in mind that RL training is inherently iterative, and achieving an optimal policy often requires multiple iterations. While some degree of reward variation is to be anticipated, it may indicate a need for further investigation or adjustments in your training setup.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

freezing layers of actor and critic of RL agent

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

freezing layers of actor and critic of RL agent

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论