TD3算法训练时动作总是输出边界值

23 次查看(过去 30 天)
泽宇
泽宇 2024-2-29
评论: 泽宇 2024-4-23
我在使用TD3算法训练完成后,无论训练过程中奖励曲线是否收敛,动作总是输出边界值或者输出完全不正确。我的state的值在0-20000,动作边界在0-15000.是哪里出了问题,是自定义环境创建的不正确还是哪里?需要对输入输出进行归一化吗

回答(1 个)

UDAYA PEDDIRAJU
UDAYA PEDDIRAJU 2024-3-14
Hi 泽宇,
Regarding your issue with the TD3 algorithm where actions always output at boundary values regardless of whether the reward curve converges.
It’s essential to investigate a few potential factors:
  1. Action Bounds: Ensure that the action bounds are correctly defined. If the boundaries are too restrictive, the agent might struggle to learn effective actions.
  2. Normalization: Normalizing the inputs and outputs can significantly impact training stability. Consider normalizing both state and action values to a common range (e.g., [0, 1]).
  3. Custom Environment: Verify that your custom environment is correctly implemented. Double-check the reward function, state representation, and action space.
  4. Exploration Noise: TD3 relies on exploration noise to encourage exploration. Ensure that the noise level is appropriate during training.
  1 个评论
泽宇
泽宇 2024-4-23
非常感谢您的回答,我的问题到现在依然没有解决,我在用深matlab强化学习工具箱进行自定义环境智能体训练,在第一次训练时(未得到奖励时),智能体给出的action是action约束范围内的值,然而在第二次训练时(得到第一次训练的奖励后),智能体给出的action是action却是约束范围的边界值?并且从第二次训练到后面第n次的训练也是这样,这是为什么? 我可以给您我的简易代码,您可以帮忙看一下问题出在哪里了吗?function[Observation,Reward,IsDone,NextState]=newgoushi(Action,State)
E=State;
%% 奖励
GT=1000*Action(1);
NextState=E-GT;
if GT-E<0.1
Reward=0;
else
Reward=-1;
end
IsDone=Reward>=0;
Observation=NextState ;
end
我的action是一个连续的,约束范围在0-12000之间,我的state也是一个连续的,约束范围在5000-10000之间

请先登录,再进行评论。

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!