rlDDPGAgent learns to generate extreme and low reward outputs during trainging.

Question

Genis Bonet Garcia 2022-1-11

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1626475-rlddpgagent-learns-to-generate-extreme-and-low-reward-outputs-during-trainging

回答： Alan 2024-3-14

I have been working on a rl project for data center cooling and after setting up the environment for a while the agent is giving me some problems. When I run the agent with the default function values it gives varied values between 0.1 and 1 as expected, but when I start training it it quickly diverges to either extreme.

Here's a run without training:

Here's what the first training episode looks like:

After this, running the agent normally shows it's outputs are either 1 or 0.1 with no jumping around like the one we observe during training.

I have seen many responses to similar problems saying the cause of this is a too large standard deviation value, but I've went as far as setting it to zero just to get similar results.

Here's my initialization code.

obsInfo = rlNumericSpec([15 1],"LowerLimit",0.01,"UpperLimit",1);
actInfo = rlNumericSpec([15 1],"LowerLimit",0.1,"UpperLimit",1);
opt = rlDDPGAgentOptions("SampleTime",1,"NumStepsToLookAhead",900,"ResetExperienceBufferBeforeTraining",false,"SaveExperienceBufferWithAgent",true);
opt.NoiseOptions.StandardDeviation = 0.003;
agent = rlDDPGAgent(obsInfo,actInfo,opt);
actor = getActor(agent)
actor.Options.UseDevice = 'gpu';
actor.Options.LearnRate = 0.05;
critic = getCritic(agent);
critic.Options.UseDevice = 'gpu';
critic.Options.LearnRate = 0.0005;
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
env = rlSimulinkEnv("collagenSim","collagenSim/Collagen",obsInfo, actInfo);
opts = rlTrainingOptions("MaxStepsPerEpisode",1800,"MaxEpisodes",2000,"StopTrainingValue",3600);
train(agent,env,opts);

This is all quite troubling since I have to turn in the project in not too long and training takes long enough already. If anyone could help to solve this it would be greatly appeciated.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Alan 2024-3-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1626475-rlddpgagent-learns-to-generate-extreme-and-low-reward-outputs-during-trainging#answer_1425051

Hi Genis,

It seems like an interesting application of RL. This answer could be late, as you mentioned that you might have to turn in the project soon. Let me give it a shot either way.

It would be easier to find the root cause if you could share your Simulink model, or some more information about the behaviour of the environment.

I am assuming that the rewards are proportional to the error between the observed and target temperatures, and there are 15 sensors and 15 air conditioners that can be mapped to the observation and action spaces.

There are a few things you can try out:

Increase the SampleTime parameter of the DDPG agent: It is possible that the update rate of the agent is low. By the time the cooler is switched on or off, it has deviated a lot from the target temperature, resulting in the cooler taking either of the extreme actions. Also, experiment with the TargetSmoothFactor parameter, which dictates the rate at which the target networks are updated.
Bring the Learning Rates closer: Currently, the learning rates used for actor and critic networks differ by a factor of 100, which can cause instability while training. Consider bringing them closer (maybe to 0.05).
Better modelling of the reward function: The reward function dictates the way the agent learns. Hence, you can try modifying your reward function to better suit the environment, i.e., you could also penalize the cooler for over-cooling and resulting in higher power consumption.
Try out other agents: If tuning parameters does not help, there are various other agents that take a continuous action space which you could try out: https://www.mathworks.com/help/deeplearning/ug/reinforcement-learning-using-deep-neural-networks.html

Also, the action graph should not matter as long as the rewards are maximized.

I hope this helped.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

rlDDPGAgent learns to generate extreme and low reward outputs during trainging.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

rlDDPGAgent learns to generate extreme and low reward outputs during trainging.

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论