Can the reset function of the environment for deep reinforcement learning incorporate feedback from the agent's state and reward?

Question

平成 2025-5-30

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2177405-can-the-reset-function-of-the-environment-for-deep-reinforcement-learning-incorporate-feedback-from

编辑： Nithin 2025-6-11

Currently, I am designing a control system using deep reinforcement learning (DDPG) in reinforcement learning toolbox, MATLAB/Simulink. Since there are uncertainties in the parameters of the plant (for example, mass), I apply domain randomization to improve the agent’s robustness to such uncertainties. Specifically, I randomly vary the mass of the plant at the beginning of each training episode. This can be implemented within the environment's reset function.

At present, the range of parameter variation is fixed, and the mass is randomly sampled from a uniform distribution within this range. In the future, I would like to adaptively adjust this range of variation between episodes. More concretely, I aim to design an adaptation algorithm in which the range of variation for a given episode is determined based on the episode reward (or its moving average) achieved by the agent in the previous episode.

For example, if the episode reward is below a certain threshold, the mass variation range could be ±10%; if it is above the threshold, the range could be increased to ±20% or more.

However, I am not sure how to feed back the agent's performance (e.g., episode reward) or state (e.g., control output) into the reset function. How can I implement this kind of adaptive domain randomization in MATLAB/Simulink? I am currently using the train function for training the reinforcement learning agent. Under this standard training framework, is it possible to implement the aforementioned adaptive domain randomization?

obsInfo = rlNumericSpec([6 1]);
obsInfo.Name = "observations";
actInfo = rlNumericSpec([1 1]);
actInfo.Name = "control input";
mdl ='SIM_RL'; % Plant + RL agent block
env = rlSimulinkEnv( ...
    "SIM_RL", ...
    "SIM_RL/Agent/RL Agent", ...
     obsInfo, actInfo);
% Domain randomization: Reset function
env.ResetFcn = @(in)localResetFcn(in);
function in = localResetFcn(in)
% Fixed range of plant parameter
M_min = Nominal_value*(1 - 0.5); % -50% of nominal mass
M_max = Nominal_value*(1 + 0.5); % +50% of nominal mass
% <------We would like to adaptively change the variation range.
% Randomize mass
randomValue_M = M_min + (M_max - M_min) * rand;
in = setBlockParameter(in, ...
"SIM_RL/Plant/Mass", ...
Value=num2str(randomValue_M));
end
% The construction of the critic Network is omitted here.
% ....
criticNet = initialize(criticNet);
critic = rlQValueFunction(criticNet,obsInfo,actInfo);
% The construction of the actor Network is omitted here.
% ....
actorNet = initialize(actorNet);
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
% agent
criticOpts = rlOptimizerOptions(LearnRate=1e-04,GradientThreshold=1);
actorOpts = rlOptimizerOptions(LearnRate=1e-04,GradientThreshold=1);
agentOpts = rlDDPGAgentOptions(...
    SampleTime=0.01,...
    CriticOptimizerOptions=criticOpts,...
    ActorOptimizerOptions=actorOpts,...
    ExperienceBufferLength=1e5,...
    DiscountFactor=0.99,...
    MiniBatchSize=128,...
    TargetSmoothFactor=1e-3);
agent = rlDDPGAgent(actor,critic,agentOpts);
maxepisodes = 5000;
maxsteps = ceil(Simulation_End_Time/0.01);
trainOpts = rlTrainingOptions(...
    MaxEpisodes=maxepisodes,...
    MaxStepsPerEpisode=maxsteps,...
    ScoreAveragingWindowLength=5,...
    Verbose=true,...
    Plots="training-progress",...
    StopTrainingCriteria="EpisodeCount",...
    SaveAgentCriteria="EpisodeReward",...
    SaveAgentValue=-1.0);
doTraining = true;
if doTraining
    evaluator = rlEvaluator(...
        NumEpisodes=1,...
        EvaluationFrequency=5);
    % Train the agent.
    trainingStats = train(agent,env,trainOpts,Evaluator=evaluator);
else
    % Load the pretrained agent
    load("agent.mat","agent")
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Nithin 2025-6-11

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2177405-can-the-reset-function-of-the-environment-for-deep-reinforcement-learning-incorporate-feedback-from#answer_1566261

编辑：Nithin 2025-6-11

在 MATLAB Online 中打开

Hi @平成,

The "Reinforcement Learning Toolbox" does not natively provide a mechanism for passing data from one episode to the next through the environment's "ResetFcn" since the "ResetFcn" is stateless by default. A workaround would be to implement the adaptive domain randomization by using a persistent variable or a global stateful mechanism outside the environment that stores the most recent reward and adjusts parameters accordingly. This is achieved by storing the reward of the previous episode and using it inside your "ResetFcn" to modify the mass range.

Additionally, a better approach would be to create a custom class that tracks performance between episodes. Refer to the following steps to understand the general approach:

Create a Custom Tracker Class

classdef DomainRandomizationManager < handle 
    properties 
        rewardHistory = []; 
        NominalValue = 10; % example nominal mass 
    end 
    
    methods 
        function updateReward(obj, reward) 
            obj.rewardHistory(end+1) = reward; 
        end 
        
        function [Mmin, Mmax] = getRandomizationRange(obj) 
            if isempty(obj.rewardHistory) 
                rangePercent = 0.1; 
            else 
                avgReward = mean(obj.rewardHistory(max(end-4,1):end)); % moving average 
                if avgReward < 100 
                    rangePercent = 0.1; 
                else 
                    rangePercent = 0.2; 
                end 
            end 
            Mmin = obj.NominalValue * (1 - rangePercent); 
            Mmax = obj.NominalValue * (1 + rangePercent); 
        end 
    end 
end 

Use it in the Reset Function
Hook into the Training Loop. MATLAB’s built-in "train" function does not allow easy feedback between episodes. So, for full flexibility, switch to a custom training loop:

global DRM; 
DRM = DomainRandomizationManager();  
for ep = 1:maxepisodes 
    experience = []; 
    totalReward = 0; 
    
    % reset env — ResetFcn uses DRM internally 
    initialObs = reset(env); 
    
    for t = 1:maxsteps 
        action = getAction(agent, initialObs); 
        [nextObs, reward, isDone, ~] = step(env, action); 
        totalReward = totalReward + reward; 
        
        experience = [experience; {initialObs, action, reward, nextObs}]; 
        initialObs = nextObs; 
        
        if isDone 
            break; 
        end 
    end 
    
    DRM.updateReward(totalReward); % update the reward history 
    
    % Train the agent using collected experience 
    agent = train(agent, experience);  % or use rlReplayBuffer if needed 
end 

Refer to the following MATLAB documentation for more information:

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Can the reset function of the environment for deep reinforcement learning incorporate feedback from the agent's state and reward?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Can the reset function of the environment for deep reinforcement learning incorporate feedback from the agent's state and reward?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论