RL Agent learns a constant trajectory instead of actual trajectory
9 次查看(过去 30 天)
显示 更早的评论
Hi,
I have a conceptual question to my problem. I am trying to learn an Engine control model with a DDPG agent, whee I have an LSTM Model for my Engine as a plant. I simulate the engine for a given random trajectory, and use the engine outputs, along with engine states( LSTM states) and the load trajectory as the observation model for my agent.
I am trying to train the DDPG agent by asking it to follow a reference load trajectory as below ( dashed line in top left graph ). I have observed that despite trying various network architectures/noise options & learning rates, the learnt model agent chooses to just deliver a constant load of around 6 ( orange line in the top left graph), rather than follow the given refernece trajectory. The outputs seem to vary reasonably ( here in blue ) but the learning is still not acceptable.
I am tweaking the trajectory every episode to aid learning as then it can see varios load profiles.
Could you kindly advise what might be going on here?
Additional Information: The same effect happens if I ask the controller to match a constant load trajectory ( constnat per episode, then changes to another random constant for the next episode ). I have attached my code here
Thanks in advance :)
Code:
%% H2DF DDPG Trainer
%
% clc
% clear all
% close all
ObsInfo.Name = "Engine Outputs";
ObsInfo.Description = ' IMEP, NOX, SOOT, MPRR ';
%% Creating envirement
obsInfo = rlNumericSpec([16 1],...
'LowerLimit',[-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf]',...
'UpperLimit',[inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]');
obsInfo.Name = "Engine Outputs";
obsInfo.Description = ' IMEP, NOX, SOOT, MPRR, IMEP_t-1,IMEP_ref,IMEP_ref_t-1, IMEP_error, states';
numObservations = obsInfo.Dimension(1);
actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);
actInfo.Name = "Engine Inputs";
actInfo.Description = 'DOI, P2M, SOI, DOI_H2';
numActions = actInfo.Dimension(1);
env = rlSimulinkEnv('MPC_RL_H2DF','MPC_RL_H2DF/RL Agent',...
obsInfo,actInfo);
env.ResetFcn = @(in)localResetFcn(in);
Ts = 0.08;
Tf = 20;
% 375 engine cycle results
rng(0)
% 1200 - 0.1| 1900: 0.06
%% Createing Agent
L = 60; % number of neurons
statePath = [
sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
fullyConnectedLayer(L, 'Name', 'fc1')
reluLayer('Name', 'relu1')
fullyConnectedLayer(L, 'Name', 'fc11')
reluLayer('Name', 'relu11')
fullyConnectedLayer(2*L, 'Name', 'fc12')
reluLayer('Name', 'relu12')
fullyConnectedLayer(4*L, 'Name', 'fc15')
reluLayer('Name', 'relu15')
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(4*L)
reluLayer
fullyConnectedLayer(2*L)
reluLayer
fullyConnectedLayer(L, 'Name', 'fc2')
concatenationLayer(1,2,'Name','add')
reluLayer('Name','relu2')
fullyConnectedLayer(L, 'Name', 'fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
gruLayer(4)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(4*L)
reluLayer
fullyConnectedLayer(L, 'Name', 'fc7')
reluLayer('Name','relu7')
fullyConnectedLayer(1, 'Name', 'fc4','BiasInitializer','ones','WeightsInitializer','he')];
actionPath = [
sequenceInputLayer(numActions, 'Normalization', 'none', 'Name', 'action')
fullyConnectedLayer(2*L, 'Name', 'fc6')
reluLayer('Name','relu6')
fullyConnectedLayer(4*L, 'Name', 'fc13')
reluLayer('Name','relu13')
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(4*L)
reluLayer
fullyConnectedLayer(2*L, 'Name', 'fc14')
reluLayer('Name','relu14')
fullyConnectedLayer(L, 'Name', 'fc5','BiasInitializer','ones','WeightsInitializer','he')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');
figure
plot(criticNetwork)
criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'action'},criticOptions);
%%
actorNetwork = [
sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
fullyConnectedLayer(L, 'Name', 'fc1')
reluLayer('Name', 'relu1')
fullyConnectedLayer(2*L, 'Name', 'fc2')
reluLayer('Name', 'relu2')
fullyConnectedLayer(2*L, 'Name', 'fc3')
reluLayer('Name', 'relu3')
fullyConnectedLayer(4*L, 'Name', 'fc8')
reluLayer('Name', 'relu8')
fullyConnectedLayer(4*L)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
gruLayer(4)
reluLayer
fullyConnectedLayer(8*L)
reluLayer
fullyConnectedLayer(4*L)
reluLayer
fullyConnectedLayer(2*L, 'Name', 'fc9')
reluLayer('Name', 'relu9')
fullyConnectedLayer(L, 'Name', 'fc10')
reluLayer('Name', 'relu10')
fullyConnectedLayer(numActions, 'Name', 'fc4')
tanhLayer('Name','tanh1')
scalingLayer('Name','ActorScaling1','Scale',-(actInfo.UpperLimit-actInfo.LowerLimit)/2,'Bias',(actInfo.UpperLimit+actInfo.LowerLimit)/2)];
actorOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'ActorScaling1'},actorOptions);
%% Deep Deterministic Policy Gradient (DDPG) agent
agentOpts = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'TargetSmoothFactor',1e-3,...
'DiscountFactor',0.99, ...,
'MiniBatchSize',128, ...
'SequenceLength',8,...
'ExperienceBufferLength',1e5, ...
'TargetUpdateFrequency', 10);
% agentOpts.NoiseOptions.Variance =
% [0.005*(70/sqrt(Ts));0.005*(12/sqrt(Ts));0.005*(0.4/sqrt(Ts))] v01
agentOpts.NoiseOptions.Variance = [1e-5*0.0025*(80/sqrt(Ts));1e2*0.003*(12/sqrt(Ts));30*0.003*(0.4/sqrt(Ts));1e-1*0.00025*(0.4/sqrt(Ts))];
agentOpts.NoiseOptions.Variance =10*[1.65000000000000e-05;0;0;0.000225000000000000];
agentOpts.NoiseOptions.VarianceDecayRate = [1e-5;1e-5;1e-5;1e-5];
criticOptions.UseDevice = "gpu";
actorOptions.UseDevice = "gpu";
% agent = rlDDPGAgent(actor,critic,agentOpts);
% variance*ts^2 = (0.01 - 0.1)*(action range)
% At each sample time step, the noise model is updated using the following formula, where Ts is the agent sample time.
%
% x(k) = x(k-1) + MeanAttractionConstant.*(Mean - x(k-1)).*Ts
% + Variance.*randn(size(Mean)).*sqrt(Ts)
% At each sample time step, the variance decays as shown in the following code.
%
% decayedVariance = Variance.*(1 - VarianceDecayRate);
% Variance = max(decayedVariance,VarianceMin);
% For continuous action signals, it is important to set the noise variance appropriately to encourage exploration. It is common to have Variance*sqrt(Ts) be between 1% and 10% of your action range.
%
% If your agent converges on local optima too quickly, promote agent exploration by increasing the amount of noise; that is, by increasing the variance. Also, to increase exploration, you can reduce the VarianceDecayRate.
%% Training agent
maxepisodes = 10000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes, ...
'MaxStepsPerEpisode',maxsteps, ...
'ScoreAveragingWindowLength',100, ...
'Verbose',true, ...
'UseParallel',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',0,...
'SaveAgentCriteria','EpisodeReward','SaveAgentValue',-0.05');
%%
% % Set to true, to resume training from a saved agent
resumeTraining = false;
% % Set ResetExperienceBufferBeforeTraining to false to keep experience from the previous session
agentOpts.ResetExperienceBufferBeforeTraining = ~(resumeTraining);
if resumeTraining
% Load the agent from the previous session
sprintf('- Resume training of: %s', 'agentV04.mat');
trained_agent = load('D:\Masters\HiWi\h2dfannbasedmpc\acados_implementation\rl\savedAgents\Agent253.mat');
agent = trained_agent.saved_agent ;
else
% Create a fresh new agent
agent = rlDDPGAgent(actor, critic, agentOpts);
end
% agent = rlDDPGAgent(actor, critic, agentOpts);
% agent = rlDDPGAgent(actor,critic,agentOpts);
%% Train the agent
trainingStats = train(agent, env, trainOpts);
%trainingStats = train(agent,env,trainOpts);
% get the agent's actor, which predicts next action given the current observation
actor = getActor(agent);
% get the actor's parameters (neural network weights)
%actorParams = getLearnableParameterValues(actor);
0 个评论
回答(1 个)
Emmanouil Tzorakoleftherakis
2024-1-31
编辑:Emmanouil Tzorakoleftherakis
2024-1-31
Thanks for adding all the details. The first thing I will say is that the average reward on the Episode Manager is moving in the right direction. So from the point of view of RL algorithm, it's learning... something. If that something is not what you would expect, maybe you should revisit the reward signal you have put together and make sure it makes sense. That would be my first instinct.
Another point, I noticed that the upper and lower limits for a couple of your actions are the same (e.g. 440 and -1). Is that expected? You can see this is the case in the respective blue curves as well.
2 个评论
Emmanouil Tzorakoleftherakis
2024-2-11
What you mention seems normal since the agent needs to take a step first to be able to collect a reward. Two things I would look at next are:
1) The upper and lower limits:
actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);
This line that you have implies that the second and third action have the same upper and lower limits so their values are essentially always constrained to 440 and -1. No reason to use those as actions if that's the case.
2) Your neural networks architectures seem unnecessarily large. Plus I wouldn't use a sequenceinputlayer since these are not lstm networks. A featureinputlayer would work. In fact, I would let the software come up with an initial architecture and modify it afterwards if needed. Take a look here to see how this can be accomplished.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Applications 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!