Trying to train PPO RL agent

Question

Nicolas CRETIN 2024-7-18

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2138466-trying-to-train-ppo-rl-agent

编辑： Nicolas CRETIN 2024-8-2

Hello,

I'm trying to train a PPO agent, but I'm encountering the following issue:

From a certain time, the agent don't learn anymore (although the agent is only in a local maximum). Let's say that for the ten first episodes the agent gets a very bad reward, since it's actually perfoming bad. Then, on the 11th episode (see graph below), the agent found a local maxmimum by updating its actions value to 30 and -30 (these are the gains coefficients of a PI controller). Finally, starting from the 12th episode (i.e. the next one), the agent don't update its action values anymore.

As a solution, I've already tried to increase the EntropyLossWeight, from 0.02 to 1. I've tried a lot of values in this range, and it seems like nothing is efficient.

Another parameter may influence the result: from different actions taken in a very wide range of values (e.g. from [1; ∞] for the first action) any system output variation is perceptible, and thus any variation in the reward can't be seen watever the action value taken within this range of values. In another words, on the picture below, the agent tried three differents values of gains, but the three actions values produced the same result, So, maybe the agent can't learn from it.

So, I would like it to continue exploring, although it just got a better reward, since it is still not the best it can achieve.

Link to PPO agent options, including EntropyLossWeight: Options for PPO agent - MATLAB - MathWorks Switzerland

Any help would be very kind!

Thanks a lot in advance!

Nicolas

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Karan Singh 2024-7-23

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2138466-trying-to-train-ppo-rl-agent#answer_1489216

Hi Nicolas,

The problem you are facing is a common scenario, and in my view, the only way to proceed forward is through trial and error of various parameters. Here are some which I have tried in my personal projects that may be useful for you:

Modify the "learning rate" to see if it helps the agent to escape the local maximum.
The "clip factor" in PPO controls how much the new policy can deviate from the old policy. Adjusting this parameter could help the agent explore more.
Adding noise to the actions can encourage exploration. For example, you can add Gaussian noise to the action values. Here is an interesting article by OpenAI on the same: https://openai.com/index/better-exploration-with-parameter-noise/
Lastly, consider the "Entropy Loss Weight," which you have already tried.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Nicolas CRETIN 2024-7-23

But, what does exactly parameter noise refer to ?

Is it noise that we apply to the observation signal, which then goes into the actor/critic network ?

Is it noise that we apply to the output of the actor net ?

Or does it refer to increasing the range of values of the standard deviation branch of the actor net?

Actually, I don't really understand the picture below (which comes from the article above):

Nicolas CRETIN 2024-8-1

编辑：Nicolas CRETIN 2024-8-2

Okay, now, I think I get it:

There are a several ways to add noise to improve performance:

As in the left picture, i.e. after the network has outputted the best action value. This technique is used by agents such as PPO for examples (which I used above), since the PPO actor net outputs the mean value and the standard deviation of the action to be taken.

Or, as on the right picture, i.e. one add noise to the network itself, with maybe something like a dropout layer I guess. According to the article of open AI above, this method tends to be more performant.

Finally, adding noise to the input data increase ability to generalize and reduce overfitting. Among other things, this technique makes it possible to increase the size of the input data.

请先登录，再进行评论。

Trying to train PPO RL agent

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Trying to train PPO RL agent

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论