Deep reinforcement learning and TD3 algorithm in the PID control

44 次查看(过去 30 天)
I'm relatively new to reinforcement learning. I have a project where I need to use RL's TD3 algorithm to tune the parameters of a PID controller, in my case it is a continuous controller. I have read some articles that describe the use of RL specifically for PID control, but they do not describe its hyperparameters, diagrams, among others and I have problems applying the TD3 algorithm in my cartpole system. Perhaps someone can guide me on the use of the TD3 algorithm for parameter adjustment, for example, my algorithm trains but at the end of the pole it does not reach the target position of 180°. I hope someone can guide me, thank you!!

采纳的回答

Emmanouil Tzorakoleftherakis
Have you seen this example?

更多回答(1 个)

Sam Chak
Sam Chak 2023-10-16
As a matter of common sense, for the cartpole system, any angle that deviates from precisely ±180° upright will inevitably result in the pendulum falling due to the force of gravity (). Therefore, maintaining the pendulum at exactly 180° is the desired behavior. In an ideal scenario, your reward function might resemble the following:
  • If the pendulum is precisely at ±180°, a positive reward is provided.
  • For any deviation from ±180°, a negative reward or no reward is assigned.
However, it is essential to be aware that training an RL agent to execute this swing-up task with sparse rewards can be exceedingly challenging. This challenge arises from the fact that it offers positive feedback only for achieving the exact target state. Additionally, the agent may require an impractically extensive amount of time to discover the correct actions that lead to success, as it is highly improbable to randomly stumble upon the precise 180° position during exploration.
To address this challenge, many papers in the field suggest incorporating intermediate rewards for approaching the target state. From the perspective of a control engineer, it may be advantageous to create a reward function based on three key components, encompassing transient behavior, steady-state behavior, and error behavior:
  1. Swing-up reward (transient): A substantial positive reward is given when the agent effectively swings the pendulum from the bottom position to an angle close to 180°.
  2. Balance reward (steady-state): A modest positive reward is awarded for maintaining the pendulum within a predefined range of ±2% around 180°. This encourages the agent to maintain the pendulum close to the upright position.
  3. Failure Penalty (error): A negative penalty is imposed when the pendulum falls or deviates significantly from the upright position, discouraging undesirable behavior.
Gp = tf(10^2, [1 sqrt(2)*10 10^2])
Gp = 100 ------------------- s^2 + 14.14 s + 100 Continuous-time transfer function.
step(Gp, 1.2)
ylabel('Pendulum angular position')
title('Response of a pendulum')
yt = 0:0.2:1.2;
yticks(yt);
yticklabels({'0', '36', '72', '108', '144', '180', '216'})
yline(1+0.02, '--', 'color', '#D90e11')
yline(1-0.02, '--', 'color', '#D90e11')
xline(0.6, '-.', 'color', '#f1a45c')
text(0.1, 1.1, 'Transient behavior')
text(0.7, 1.1, 'Steady-state behavior')

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by