Reinforcement LEarning Agents PG & AC only NAN Action, DDPG and TD3 work with same environment

4 次查看(过去 30 天)
For classification:
I have implemented an environment that depicts the energy supply of a single-family house. The system consists of PV modules, a battery, an electric heater, a hot water tank and a gas heater. A market situation is assumed in which there are variable electricity prices (both for the purchase and for the feed-in). The idea is to use RL to regulate the system. Observations are the price forecasts, feed-in forecasts (PV) and load forecasts (electricity and heat). They are "all" energy flows, possible. E.g. PV direct supply (electricity), PV direct supply (heat, electric heater), charging battery (PV/grid), charging heat storage (PV/grid, electric heater), electricity feed-in to grid (PV direct and from battery)....
The actions are numeric, between 0 and 1. The idea is that in a sequence, it is always possible to choose between 0 and 100% of an available power for a "purpose".
Ex:
,
%% Stepfunction:
% First line: PV power for load coverage, minimum from the product: Action(1) (between 1 and 0) and the load demand.
PV_Load = min([(Action(1)*PV_gen) Load_sys]);
% Second line: Coverage of remaining load by available battery power: Minimum of product Action(2) and available battery charge, remaining load and maximum battery power.
Batt_Load = min([Action(2)*Batt_stor (Load_sys-PV_Load) (this.Batt_P)]);
I hope the description is sufficient for a first impression.
Now I've tried to train diffrent agents.
With DDPG and TD3, the first results are basically plausible. For PG and AC, all actions are output with NAN.
Can anyone give me a clue on this basis?

回答(1 个)

Aiswarya
Aiswarya 2023-10-26
Hi,
I understand that you are trying to set the action output for your model using different RL agents. You observe that DDPG and TD3 set the action output correctly whereas PG and AC does not. The cause for the output of the actions being NaN is code specific and can't be concluded without data. But the different results shown by the agents can be explained as follows:
Different agents show different behavior while setting action output bounds. DDPG and TD3 are off-policy agents and they clip all actions. They can be simply set using "rlNumericSpec". AC and PG, on the other hand, are on-policy agents. These agents don’t enforce constraints set in the action specification. If you want to enforce these limits you have to do it explicitly on the environment side.
One alternative is to set the agent.UseExplorationPolicy = false after training, so the agents can use only mean, and the actions are always within limits. You may refer to the documentation of PG agent for more information: https://www.mathworks.com/help/reinforcement-learning/ug/pg-agents.html

产品


版本

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by