RL DDPG agent not converging

Question

Haochen 2024-11-17

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging

回答： Prathamesh 2025-6-3

Hi,

I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.

These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.

The agent by actor critic.

function [agents] = createDDPGAgents(N)
    % Function to create two DDPG agents with the same observation and action info.
    obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
    actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
    % Define observation and action paths for critic
    obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
    actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
    
    % Define common path: concatenate along first dimension
    commonPath = [
        concatenationLayer(1, 2, Name="concat")
        fullyConnectedLayer(30)
        reluLayer
        fullyConnectedLayer(1)
    ];
    
    % Add paths to layerGraph network
    criticNet = layerGraph(obsPath);
    criticNet = addLayers(criticNet, actPath);
    criticNet = addLayers(criticNet, commonPath);
    
    % Connect paths
    criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
    criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
    
    % Plot the network
    plot(criticNet)
    
    % Convert to dlnetwork object
    criticNet = dlnetwork(criticNet);
    
    % Display the number of weights
    summary(criticNet)
    
    % Create the critic approximator object
    critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
        ObservationInputNames="obsInLyr", ...
        ActionInputNames="actInLyr");
    % Check the critic with random observation and action inputs
    getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
    
    % Create a network to be used as underlying actor approximator
    actorNet = [
        featureInputLayer(prod(obsInfo.Dimension))
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(prod(actInfo.Dimension))
    ];
    
    % Convert to dlnetwork object
    actorNet = dlnetwork(actorNet);
    
    % Display the number of weights
    summary(actorNet)
    
    % Create the actor
    actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
    
    %% DDPG Agent Options
    agentOptions = rlDDPGAgentOptions(...
        'DiscountFactor', 0.98, ...
        'MiniBatchSize', 128, ...
        'TargetSmoothFactor', 1e-3, ...
        'ExperienceBufferLength', 1e6, ...
        'SampleTime', -1);
    %% Create Two DDPG Agents
    agent1 = rlDDPGAgent(actor, critic, agentOptions);
    agent2 = rlDDPGAgent(actor, critic, agentOptions);
    % Return agents as an array
    agents = [agent1, agent2];
    agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
    agentOptions.NoiseOptions.StandardDeviation = 0.3;
    agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
    agentOptions.NoiseOptions
end

The envrionment:

function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
    % Environment parameters
    nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
    nextObs = nextObs1;
    loggedSignals.State = nextObs1;
    if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05 
        reward1 = 10;
    else
        reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
        if reward1 <= -1000
            reward1 = -1000;
        end
    end
    reward = reward1;
   
    if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
        isDone = true;
    else
        isDone = false;
    end
end

And this is the simulation setup (i omitted the reset function here, and S.N = 1):

obsInfo1 =  rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
    StopOnError="on",...
    MaxEpisodes=1000,...  %1100 for fully trained
    MaxStepsPerEpisode=1000,...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=480,...
    Plots="training-progress");
    %"training-progress"
train(agent, env, trainOpts);

This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.

And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.

It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.

Haochen

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Prathamesh 2025-6-3

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging#answer_1565828

Hi @Haochen,

I understand that you are training a DDPG client to control the single cart with an initial speed moving along a horizontal axis. The plots show the agent is not reaching the origin and getting stuck with negative rewards. This is common when the agent is not getting clear enough feedback or isn't exploring enough.

Your agent likely gets a big reward only when it's exactly at the origin. For every other step, it just gets a penalty.

Modify the “myStepFunction” to give the agent continous feedback

Make the reward negative (a penalty) if the cart is away from the origin (position not zero) or if it has speed (velocity not zero).
Also add a small penalty for the amount of force the agent applies. This encourages the agent to use only the necessary force.
You'll need to decide how much each penalty matters. For example, penalize being far from the origin more heavily than using a little bit of force.
The agent will always try to make its reward less negative, pushing it towards the origin using minimal force.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

RL DDPG agent not converging

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

RL DDPG agent not converging

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论