Reinforcement Learning algorithm for adaptive proportional controller of a first order system

7 次查看(过去 30 天)
Hello everyone,
I have a simple plant model described with transfer function:
, where K can change from time to time.
To control the output of this plant (x), proportional controller with gain Kp is employed. I want to use RL Agent which can observe the parameter K, and whose action will be gain Kp, such that the system output will have the desired steady-state error. Here is the Simulink model:
Desired steady-state error is chosen so that the agent outputs different values for Kp instead of the maximum Kp, which would be the case if we want a minimum steady-state error. Because the plant is a first-order system, we know from relationship of steady-state error and gains K & Kp that proportional gain Kp should be:
, where is desired steady-state error. For , the policy should then approximate the function:
Of course, the agent does not know this function and should find neural network that approximates it through training process. My question is:
How to approach the agent training process in such cases where policy should output controller gains in response to changing environment?
Let us suppose that one episode of training is 2 s and that x_ref is always constant and equal to 1, only the K will change from episode to episode.
I think the agent should not change its parameters during a single episode to get a true picture of the fixed gain response. Because in reality, the controller will work with one fixed gain (agent's action) as long as K is fixed, so I think it is correct to change these gains during the training process only after the individual episode is finished and after the reward for that particular episode is received. Does this mean that the agent should have sample time which is equal to time of one episode, 2 s? If not, the agent will change Kp even though it has not waited to receive a reward for the system's response with that Kp.
Since I chose agent's sample time to be equal to time of one episode, the reward it gets is a cumulative reward for that one episode:
where:
function r = fcn(x, xref, e_ss, t)
e = xref - x;
r = -e -(e < e_ss);
and Cumulative reward is:
function R = fcn(r, R_old)
R = R_old + r;
Here is a file for the agent initialization:
clear; clc; close all;
%%
K = 3; % Initial value for plant parameter K
e_ss = 0.2; % desired steady-state error
Ts0 = 0.01; % Simulation sample time [s]
Tsim = 2; % Simulation time [s]
xref = 1; % Reference signal
%%
doTraining = true;
mdl = 'FirstOrderSys_1';
open_system(mdl)
%% Create RL environment
obsInfo = rlNumericSpec([1 1]);
obsInfo.Name = 'observations';
obsInfo.Description = 'K';
actInfo = rlNumericSpec([1 1], "UpperLimit", 1, "LowerLimit", 0.1);
actInfo.Name = 'Kp';
% Build the environment interface object.
env = rlSimulinkEnv(mdl, [mdl '/RL_Agent'], obsInfo, actInfo);
% Set a cutom reset function that randomizes the K for the model.
env.ResetFcn = @(in)localResetFcn(in,mdl);
% Fix the random generator seed for reproducibility
rng(0)
%% Define Agent
learnopts = rlOptimizerOptions;
agentopts = rlDDPGAgentOptions("SampleTime", Tsim, "ExperienceBufferLength", 100, "MiniBatchSize", 50);
agentopts.ActorOptimizerOptions = learnopts;
agentopts.CriticOptimizerOptions = learnopts;
agent = rlDDPGAgent(obsInfo, actInfo, agentopts);
%% Train Agent
maxepisodes = 1000;
maxsteps = ceil(Tsim/Ts0);
trainOpts = rlTrainingOptions(...
MaxEpisodes = maxepisodes, ...
MaxStepsPerEpisode = maxsteps, ...
ScoreAveragingWindowLength = 100, ...
Verbose = false, ...
Plots = "training-progress",...
StopTrainingCriteria = "AverageReward",...
StopTrainingValue = -100);
if doTraining
% Train the agent.
info = train(agent, env, trainOpts);
else
% Load pretrained agent.
end
%% Validate the learned agent
simopts = rlSimulationOptions("NumSimulations", 5, "MaxSteps", maxsteps);
simout = sim(agent, env, simopts);
%% Local functions
% Function to randomize the K at the beginning of each episode.
function in = localResetFcn(in,mdl)
% Randomize K
blk = sprintf([mdl '/K']);
K = 5 + 9*(rand-0.5);
in = setBlockParameter(in,blk,'Value',num2str(K));
end
The StopTrainingValue of -100 is chosen randomly, it can be changed, the point is that even after 1000 training episodes no convergence is seen. And the agent can also be of any type, I adopted DDPG as a first attempt.
Although the example I gave is simple and refers to a first-order system with a proportional controller, my question is more general and refers to cases where we want the RL agent to change controller gains in response to a changing environment it can observe.
Thank you for considering my question.

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Training and Simulation 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by