RL Agent learns a constant trajectory instead of actual trajectory

Question

Vasu Sharma 2024-1-31

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2076771-rl-agent-learns-a-constant-trajectory-instead-of-actual-trajectory

评论： Emmanouil Tzorakoleftherakis 2024-2-11

Results_temp.png

Hi,

I have a conceptual question to my problem. I am trying to learn an Engine control model with a DDPG agent, whee I have an LSTM Model for my Engine as a plant. I simulate the engine for a given random trajectory, and use the engine outputs, along with engine states( LSTM states) and the load trajectory as the observation model for my agent.

I am trying to train the DDPG agent by asking it to follow a reference load trajectory as below ( dashed line in top left graph ). I have observed that despite trying various network architectures/noise options & learning rates, the learnt model agent chooses to just deliver a constant load of around 6 ( orange line in the top left graph), rather than follow the given refernece trajectory. The outputs seem to vary reasonably ( here in blue ) but the learning is still not acceptable.

I am tweaking the trajectory every episode to aid learning as then it can see varios load profiles.

Could you kindly advise what might be going on here?

Additional Information: The same effect happens if I ask the controller to match a constant load trajectory ( constnat per episode, then changes to another random constant for the next episode ). I have attached my code here

Thanks in advance :)

Code:

%% H2DF DDPG Trainer
%
% clc
% clear all
% close all
ObsInfo.Name = "Engine Outputs";
ObsInfo.Description = ' IMEP, NOX, SOOT, MPRR ';
%% Creating envirement
obsInfo = rlNumericSpec([16 1],...
 'LowerLimit',[-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf]',...
 'UpperLimit',[inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]');
obsInfo.Name = "Engine Outputs";
obsInfo.Description = ' IMEP, NOX, SOOT, MPRR, IMEP_t-1,IMEP_ref,IMEP_ref_t-1, IMEP_error, states';
numObservations = obsInfo.Dimension(1);
actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);
actInfo.Name = "Engine Inputs";
actInfo.Description = 'DOI, P2M, SOI, DOI_H2';
numActions = actInfo.Dimension(1);
env = rlSimulinkEnv('MPC_RL_H2DF','MPC_RL_H2DF/RL Agent',...
 obsInfo,actInfo);
env.ResetFcn = @(in)localResetFcn(in);
Ts = 0.08;
Tf = 20;
% 375 engine cycle results
rng(0)
% 1200 - 0.1| 1900: 0.06
%% Createing Agent
L = 60; % number of neurons
statePath = [
    sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
    fullyConnectedLayer(L, 'Name', 'fc1')
    reluLayer('Name', 'relu1')
    fullyConnectedLayer(L, 'Name', 'fc11')
    reluLayer('Name', 'relu11')
    fullyConnectedLayer(2*L, 'Name', 'fc12')
    reluLayer('Name', 'relu12')
    fullyConnectedLayer(4*L, 'Name', 'fc15')
    reluLayer('Name', 'relu15')
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L)
    reluLayer
    fullyConnectedLayer(L, 'Name', 'fc2')
    concatenationLayer(1,2,'Name','add')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L, 'Name', 'fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(8*L)
    reluLayer   
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    gruLayer(4)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(L, 'Name', 'fc7')
    reluLayer('Name','relu7')
    fullyConnectedLayer(1, 'Name', 'fc4','BiasInitializer','ones','WeightsInitializer','he')];
actionPath = [
    sequenceInputLayer(numActions, 'Normalization', 'none', 'Name', 'action')
    fullyConnectedLayer(2*L, 'Name', 'fc6')
    reluLayer('Name','relu6')
    fullyConnectedLayer(4*L, 'Name', 'fc13')
    reluLayer('Name','relu13')
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L, 'Name', 'fc14')
    reluLayer('Name','relu14')
    fullyConnectedLayer(L, 'Name', 'fc5','BiasInitializer','ones','WeightsInitializer','he')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
    
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');
figure
plot(criticNetwork)
criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
 'Observation',{'observation'},'Action',{'action'},criticOptions);
%%
actorNetwork = [
    sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
    fullyConnectedLayer(L, 'Name', 'fc1')
    reluLayer('Name', 'relu1')
    fullyConnectedLayer(2*L, 'Name', 'fc2')
    reluLayer('Name', 'relu2')
    fullyConnectedLayer(2*L, 'Name', 'fc3')
    reluLayer('Name', 'relu3')
    fullyConnectedLayer(4*L, 'Name', 'fc8')
    reluLayer('Name', 'relu8')
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    gruLayer(4)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L, 'Name', 'fc9')
    reluLayer('Name', 'relu9')
    fullyConnectedLayer(L, 'Name', 'fc10')
    reluLayer('Name', 'relu10')
    fullyConnectedLayer(numActions, 'Name', 'fc4')
    tanhLayer('Name','tanh1')
    scalingLayer('Name','ActorScaling1','Scale',-(actInfo.UpperLimit-actInfo.LowerLimit)/2,'Bias',(actInfo.UpperLimit+actInfo.LowerLimit)/2)];
actorOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,...
 'Observation',{'observation'},'Action',{'ActorScaling1'},actorOptions);
%% Deep Deterministic Policy Gradient (DDPG) agent
agentOpts = rlDDPGAgentOptions(...
 'SampleTime',Ts,...
 'TargetSmoothFactor',1e-3,...
 'DiscountFactor',0.99, ...,
 'MiniBatchSize',128, ...
 'SequenceLength',8,...
 'ExperienceBufferLength',1e5, ...
 'TargetUpdateFrequency', 10);
% agentOpts.NoiseOptions.Variance =
% [0.005*(70/sqrt(Ts));0.005*(12/sqrt(Ts));0.005*(0.4/sqrt(Ts))] v01
agentOpts.NoiseOptions.Variance = [1e-5*0.0025*(80/sqrt(Ts));1e2*0.003*(12/sqrt(Ts));30*0.003*(0.4/sqrt(Ts));1e-1*0.00025*(0.4/sqrt(Ts))];
agentOpts.NoiseOptions.Variance =10*[1.65000000000000e-05;0;0;0.000225000000000000];
agentOpts.NoiseOptions.VarianceDecayRate = [1e-5;1e-5;1e-5;1e-5];
criticOptions.UseDevice = "gpu";
actorOptions.UseDevice = "gpu";
% agent = rlDDPGAgent(actor,critic,agentOpts);
% variance*ts^2 = (0.01 - 0.1)*(action range)
% At each sample time step, the noise model is updated using the following formula, where Ts is the agent sample time.
% 
% x(k) = x(k-1) + MeanAttractionConstant.*(Mean - x(k-1)).*Ts
%        + Variance.*randn(size(Mean)).*sqrt(Ts)
% At each sample time step, the variance decays as shown in the following code.
% 
% decayedVariance = Variance.*(1 - VarianceDecayRate);
% Variance = max(decayedVariance,VarianceMin);
% For continuous action signals, it is important to set the noise variance appropriately to encourage exploration. It is common to have Variance*sqrt(Ts) be between 1% and 10% of your action range.
% 
% If your agent converges on local optima too quickly, promote agent exploration by increasing the amount of noise; that is, by increasing the variance. Also, to increase exploration, you can reduce the VarianceDecayRate.
%% Training agent
maxepisodes = 10000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
 'MaxEpisodes',maxepisodes, ...
 'MaxStepsPerEpisode',maxsteps, ...
 'ScoreAveragingWindowLength',100, ...
 'Verbose',true, ...
 'UseParallel',false,...
 'Plots','training-progress',...
 'StopTrainingCriteria','AverageReward',...
 'StopTrainingValue',0,...
 'SaveAgentCriteria','EpisodeReward','SaveAgentValue',-0.05');
%%
% % Set to true, to resume training from a saved agent
 resumeTraining = false;
% % Set ResetExperienceBufferBeforeTraining to false to keep experience from the previous session
 agentOpts.ResetExperienceBufferBeforeTraining = ~(resumeTraining);
if resumeTraining
    % Load the agent from the previous session
    sprintf('- Resume training of: %s', 'agentV04.mat');   
    trained_agent = load('D:\Masters\HiWi\h2dfannbasedmpc\acados_implementation\rl\savedAgents\Agent253.mat');
    agent = trained_agent.saved_agent ;
else
    % Create a fresh new agent
    agent = rlDDPGAgent(actor, critic, agentOpts);
end
% agent = rlDDPGAgent(actor, critic, agentOpts);
% agent = rlDDPGAgent(actor,critic,agentOpts);
%% Train the agent
trainingStats = train(agent, env, trainOpts);
%trainingStats = train(agent,env,trainOpts);
% get the agent's actor, which predicts next action given the current observation
actor       = getActor(agent);
% get the actor's parameters (neural network weights)
%actorParams = getLearnableParameterValues(actor);

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Emmanouil Tzorakoleftherakis 2024-1-31

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2076771-rl-agent-learns-a-constant-trajectory-instead-of-actual-trajectory#answer_1400671

编辑：Emmanouil Tzorakoleftherakis 2024-1-31

Thanks for adding all the details. The first thing I will say is that the average reward on the Episode Manager is moving in the right direction. So from the point of view of RL algorithm, it's learning... something. If that something is not what you would expect, maybe you should revisit the reward signal you have put together and make sure it makes sense. That would be my first instinct.

Another point, I noticed that the upper and lower limits for a couple of your actions are the same (e.g. 440 and -1). Is that expected? You can see this is the case in the respective blue curves as well.

2 个评论
显示无隐藏无

Vasu Sharma 2024-2-10

Hi @Emmanouil Tzorakoleftherakis,

Thanks for your response.

I revisted my reward formulation and there doesn't seem to be something obviously wrong there. The reward is simply the negation of a squared error cost term for the trajectory.Could you suggest some other things to check/tweak?

There is however, something odd that I have noticed. I have the reward signal going into the RL agent in simulink as input on a scope I see that it does not show anything for the first time step. See attached image below.

However, when the episode ends it usually takes up a value and this one is quite high as compare to other reward values in the episode. I have tried my best to capture it but since this is instantaneous, its tricky to capture. I have attached the images but you can also refer to the image below. ( look at inital time step).

I am wondering if this is normal. Is this somehow related to the Q_0 or just a normal way for the scopes to work. What would be your opinion?

Thank in advance,

Vasu

Emmanouil Tzorakoleftherakis 2024-2-11

在 MATLAB Online 中打开

What you mention seems normal since the agent needs to take a step first to be able to collect a reward. Two things I would look at next are:

1) The upper and lower limits:

actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);

This line that you have implies that the second and third action have the same upper and lower limits so their values are essentially always constrained to 440 and -1. No reason to use those as actions if that's the case.

2) Your neural networks architectures seem unnecessarily large. Plus I wouldn't use a sequenceinputlayer since these are not lstm networks. A featureinputlayer would work. In fact, I would let the software come up with an initial architecture and modify it afterwards if needed. Take a look here to see how this can be accomplished.

请先登录，再进行评论。

RL Agent learns a constant trajectory instead of actual trajectory

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

RL Agent learns a constant trajectory instead of actual trajectory

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无