Constant output of Reinforcement learning on optimal control problem

Question

YU WENG 2023-10-13

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2032904-constant-output-of-reinforcement-learning-on-optimal-control-problem

回答： Nihal 2024-7-23

I am using reinforcement learning for voltage control when renewable varies. However, after training the agent keeps giving me constant control action regardless what the input is. Where can be the problem? it should be something wrong with the Env at the reward function or reset function?

My understanding is, the training process achieves an optimal action that works for all the observations with highest reward in average. So the trained agent gives a constant ouptuts Whereas my objective is to let the trained agent provide the optimial action for each observation. How should i fixed it?

My code is very simple with the following points:

Observation = [Power Injections; uncertainties, voltages]; Action = [Control Injection] on selected buses.

reset function: add a random uncertainties range on power injections, then run power flow to get the voltage.

step function: take this.state and Action. apply Action to change power injections, then run power flow to get the voltage. Update system states.

reward function: high voltage improvement - control efforts needed. It can be simplified as follows:

P_inj = this.State(:,1);

Ctr = sum(P_inj(this.Ctr_bus)-this.PowerInjections(this.Ctr_bus));

Vol = this.State(:,3)-this.GoalVoltage;

Reward = sum(Vol)-Ctr;

Thank you for your time and help. I can provide more details if needed.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Nihal 2024-7-23

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2032904-constant-output-of-reinforcement-learning-on-optimal-control-problem#answer_1489296

Hi Yu,

I understand from your query that you want to know why the actor keeps on giving constant control actions, irrespective of the input.

Based on the information provided, it looks like the issue lies in the reward function. If the agent consistently receives a high reward for a certain action, it will learn to stick to that action regardless of the input. The current reward function sums up the voltage improvement and then subtracts the control efforts required. This formulation might lead the agent to favour a constant control action that achieves a high average reward.

Here are a few things you can try to overcome this:

Consider incorporating a penalty for deviation of the control agent from the optimal action.
Penalize the agent for deviating from optimal value for each observation. Instead of summing the voltage difference, calculate the absolute difference for each bus, and penalize the agent based on these differences.
Use different reward shaping techniques.