Singularity problems with RL designer for control moment gyro control

5 次查看(过去 30 天)
Hello,
I am trying to find a RL solution to control a control moment gyroscope using the reinforcement learning designer. The control moment gyroscope is in the inner loop. The outer loop consists of a 1DOF spacecraft attitude dynamics model (but is not very relevant to the RL control problem). To give you a general idea of the dynamic system:
Inside the 'SGCMG RL control and dynamics' block, I implemented the control as well as the dynamics of the control moment gyro. In the 'SGCMG RL control law block' is the RL agent. The action is the angular velocity of the control moment gyro's gimbal angle. The observation consists of the angular velocity of the spacecraft (s/c) and the difference between commanded torque and total torque acting on the s/c. Finally, the reward function consists of the error state, cos(delta) which is a measure of proximity to singularity of the system and the use of the actuator. These are all penalties and it is therefore more of a cost function than a reward function.
As for the RL process, I initiated the environment as follows:
%% Clear vars
close all
clear
clc
%% Define the system and input
Md=0.01; %disturbance moment Nm
J=400; %intertia kgm2
%% Initial conditions
delta0=deg2rad(0); %initial gimbal angle of control moment gyro rad
omega0=0; %initiate angular velocity of the spacecraft
theta0=deg2rad(20); %initial value for theta1
%% Create environment object from simulink model
mdl = 'RL_SGCMG_simp';
open_system(mdl)
obsInfo=rlNumericSpec([1 2]);
obsInfo.Name='observation';
actInfo = rlNumericSpec([1 1],'LowerLimit',-10^20,'UpperLimit',10^20);
actInfo.Name='action';
agentBlk=[mdl '/SGCMG RL control and dynamics/SGCMG RL control law/RL Agent'];
env=rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
%% Simulatation with trained agent --> run after agent is trained in RL designer
sim('RL_SGCMG_simpel');
%%
figure(1)
plot(tout,rad2deg(delta(1,:)))
xlabel('Time [s]')
ylabel('delta [deg]')
figure(2)
plot(tout,theta)
xlabel('Time [s]')
ylabel('theta [deg]')
Next, I imported the environment into the reinforcement learning designer and used a default DDPG agent. My main problem at this moment is that the action will go to its maximum value, after 1 or 2 timesteps into the episode and won't change (by a lot). For instance, with a lowerlimit, upperlimit of -10^20; 10^20, the penalty for the actuator will go straight to the maximum value (deltadot*10^-5=10^15 for the current reward function). Note that this value will be subtracted before going into the reward port of the RL block. If I do not implement a lower and upper limit on the action, it will go to infinity after some timesteps and singularity problems kick in. I would say, that with this severe penalty for the actuator (action) its value wouldn't go all the way up. It's almost like my RL algorithm is trying to minimize the reward... However, if I delete the reward function alltogether (no input for the RL agent at all), the same thing happens. So it cannot be the reward function I'd say.
Would that mean there's a problem with the default DDPG agent? The default actor and critic NNs are, respectively:
Allthough I'm pretty experienced in Matlab and Simulink, RL and NNs are relatively new to me. I already tried to play with the exploration setting (stand. dev.) but this won't solve the problem. I was hoping anybody with a bit more experience in RL could spot the problem and help me on my way.
With kind regards,
Noah
  2 个评论
Sam Chak
Sam Chak 2022-7-17
编辑:Sam Chak 2022-7-17
Although I don't know what your RL does when you have the circled block takes care of the attitude system, if the governing mathematics in the circled block is rigorously proven to 100% stable, does your SGCMG output () go to the infinity and beyond as well?
Or is this a momentum dumping problem for the CMG? When it gets saturated (cannot spin faster than the hardware limit), RL gives an infinite value (NaN) at the Output?
Noah Stam
Noah Stam 2022-7-18
编辑:Noah Stam 2022-7-18
Hello Sam,
Thanks alot for your reply to my post! To clarify: the circled block does indeed take care of the attitude system in a way that connects the error in attitude angle to a required torque to get that error to zero. The SGCMG RL control takes this required torque and translates this into a required gimbal angle rate for the single gimbal control moment gyroscope. So this is really control of the actuator itself. This gimbal angle rate then yields an actual actuator torque Mcmg.
I followed up on your question on the stability of the red circled block by first deleting the 'SGCMG RL control and dynamics' block alltogether. I found out that the PID controller in it made the system unstable. I copied this part from a 3DOF model that I made earlier with a different actuator controller. Apparently the change to 1DOF changed the dynamics of the system in such a way that the original PID controller didn't do the trick anymore. I therefore implemented a new PID controller. The stable respons of the PID controller to some reference:
Also, I saw that I fed back omega to the theta input and vice versa... undoing this and implementing the new PID made me think I solved the problem. However, this was not the case. Action still goes to its maximum value (as in scope graph in my previous post) even when I only penalize this action without any other terms in the reward function, or for no reward function at all. Maybe the abs() I use in the reward function are a problem as they make it discontinuous? (but then again, also a problem for no reward function at all). Maybe the training process can give you any clues on what's going on. Q0 does not tend to be in line with the reward either, this seems to me like something is going wrong with the critic?
With kind regards,
Noah
Update: another weird thing is that for no limits on the action output, the action value does go to infinity, but only after awhile (can be after 15 episodes or already in the first...). While if I do add limits, it tends to go to these limits right away (at the second or third timestep or so). This seems weird as the limits should not affect the system when the output is within these bounds right? Apart from this, it does seem to be the dynamics in the system itself reaching a state of singularity during exploration of the state space.
An example of the results i got for not putting any limits on the action:
The training algorithm seems to minimize the penalty up until the 8th episode, where it reaches a singularity. The action in this final episode behaves as follows:
(when you zoom in it actually looks reasonable). I thought putting a constraint on the action output would limit it and prevent it from becoming singular, but then again: with these limitis, the action ouput goes straight to the maximum value and won't bother to explore any further.

请先登录,再进行评论。

回答(0 个)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by