Main Content

Generate Reward Function from a Model Predictive Controller for a Servomotor

This example shows how to automatically generate a reward function from cost and constraint specifications defined in a model predictive controller object. You then use the generated reward function to train a reinforcement learning agent.

Introduction

You can use the generateRewardFunction to generate a reward function for reinforcement learning, starting from cost and constraints specified in a model predictive controller. The resulting reward signal is a sum of costs (as defined by an objective function) and constraint violation penalties depending on the current state of the environment.

This example is based on theDC Servomotor with Constraint on Unmeasured Output (Model Predictive Control Toolbox) example, in which you design a model predictive controller for a DC servomechanism under voltage and shaft torque constraints. Here, you will convert the cost and constraints specifications defined in the mpc object into a reward function and use it to train an agent to control the servomotor.

Open the Simulink model for this example which is based on the above MPC example but has been modified for reinforcement learning.

open_system('rl_motor')

Create Model Predictive Controller

Create the open-loop dynamic model of the motor, defined in plant and the maximum admissible torque tau using an helper function.

[plant,tau] = mpcmotormodel;

Specify input and output signal types for the MPC controller. The shaft angular position, is measured as first output. The second output, torque, is unmeasurable.

plant = setmpcsignals(plant,'MV',1,'MO',1,'UO',2);

Specify constraints on the manipulated variable, and define a scale factor.

MV = struct('Min',-220,'Max',220,'ScaleFactor',440);

Impose torque constraints during the first three prediction steps, and specify scale factor for both shaft position and torque.

OV = struct('Min',{-Inf, [-tau;-tau;-tau;-Inf]},...
            'Max',{Inf, [tau;tau;tau;Inf]},...
            'ScaleFactor',{2*pi, 2*tau});

Specify weights for the quadratic cost function to achieve angular position tracking. Set to zero the weight for the torque, thereby allowing it to float within its constraint.

Weights = struct('MV',0,'MVRate',0.1,'OV',[0.1 0]);

Create an MPC controller for the plant model with a sample time of 0.1 s, a prediction horizon 10 steps, and a control horizon of 2 steps, using the previously defined structures for the weights, manipulated variables, and output variables.

mpcobj = mpc(plant,0.1,10,2,Weights,MV,OV);

Display the controller specifications.

mpcobj
 
MPC object (created on 01-Sep-2021 14:18:45):
---------------------------------------------
Sampling time:      0.1 (seconds)
Prediction Horizon: 10
Control Horizon:    2

Plant Model:        
                                      --------------
      1  manipulated variable(s)   -->|  4 states  |
                                      |            |-->  1 measured output(s)
      0  measured disturbance(s)   -->|  1 inputs  |
                                      |            |-->  1 unmeasured output(s)
      0  unmeasured disturbance(s) -->|  2 outputs |
                                      --------------
Indices:
  (input vector)    Manipulated variables: [1 ]
  (output vector)        Measured outputs: [1 ]
                       Unmeasured outputs: [2 ]

Disturbance and Noise Models:
        Output disturbance model: default (type "getoutdist(mpcobj)" for details)
         Measurement noise model: default (unity gain after scaling)

Weights:
        ManipulatedVariables: 0
    ManipulatedVariablesRate: 0.1000
             OutputVariables: [0.1000 0]
                         ECR: 10000

State Estimation:  Default Kalman Filter (type "getEstimator(mpcobj)" for details)

Constraints:
 -220 <= MV1 (V) <= 220, MV1/rate (V) is unconstrained,       MO1 (rad) is unconstrained
                                                        -78.54 <= UO1 (Nm)(t+1) <= 78.54
                                                        -78.54 <= UO1 (Nm)(t+2) <= 78.54
                                                        -78.54 <= UO1 (Nm)(t+3) <= 78.54
                                                          UO1 (Nm)(t+4) is unconstrained

The controller operates on a plant with 4 states, 1 input (voltage) and 2 output signals (angle and torque) and has the following specifications:

  • The cost function weights for the manipulated variable, manipulated variable rate and output variables are 0, 0.1 and [0.1 0] respectively.

  • The manipulated variable is constrained between -220V and 220V.

  • The manipulated variable rate is unconstrained.

  • The first output variable (angle) is unconstrained but the second (torque) is constrained between -78.54 Nm and 78.54 Nm in the first three prediction time steps and unconstrained in the fourth step.

Note that for reinforcement learning only the constraints specification from the first prediction time step will be used since the reward is computed for a single time step.

Generate the Reward Function

Generate the reward function code from specifications in the mpc object using generateRewardFunction. The code is displayed in the MATLAB Editor.

generateRewardFunction(mpcobj)

The generated reward function is a starting point for reward design. You can modify the function with different penalty function choices and tune the weights. For this example, make the following change to the generated code:

  • Scale the original cost weights Qy, Qmv and Qmvrate by a factor of 100.

  • The default exterior penalty function method is step. Change the method to quadratic.

After you make changes, the cost and penalty specifications should be as follows:

Qy      = [10 0];
Qmv     = 0;
Qmvrate = 10;
Py      = Wy      * exteriorPenalty(y,ymin,ymax,'quadratic');
Pmv     = Wmv     * exteriorPenalty(mv,mvmin,mvmax,'quadratic');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'quadratic');

For this example, the modified code has been saved in the MATLAB function file rewardFunctionMpc.m. Display the generated reward function.

type rewardFunctionMpc.m
function reward = rewardFunctionMpc(y,refy,mv,refmv,lastmv)
% REWARDFUNCTIONMPC generates rewards from MPC specifications.
%
% Description of input arguments:
%
% y : Output variable from plant at step k+1
% refy : Reference output variable at step k+1
% mv : Manipulated variable at step k
% refmv : Reference manipulated variable at step k
% lastmv : Manipulated variable at step k-1
%
% Limitations (MPC and NLMPC):
%     - Reward computed based on first step in prediction horizon.
%       Therefore, signal previewing and control horizon settings are ignored.
%     - Online cost and constraint update is not supported.
%     - Custom cost and constraint specifications are not considered.
%     - Time varying cost weights and constraints are not supported.
%     - Mixed constraint specifications are not considered (for the MPC case).

% Reinforcement Learning Toolbox
% 02-Jun-2021 16:05:41

%#codegen

%% Specifications from MPC object
% Standard linear bounds as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
ymin = [-Inf -78.5398163397448];
ymax = [Inf 78.5398163397448];
mvmin = -220;
mvmax = 220;
mvratemin = -Inf;
mvratemax = Inf;

% Scale factors as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
Sy  = [6.28318530717959 157.07963267949];
Smv = 440;

% Standard cost weights as specified in 'Weights' property
Qy      = [0.1 0];
Qmv     = 0;
Qmvrate = 0.1;

%% Compute cost
dy      = (refy(:)-y(:)) ./ Sy';
dmv     = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy      = dy'      * diag(Qy.^2)      * dy;
Jmv     = dmv'     * diag(Qmv.^2)     * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost    = Jy + Jmv + Jmvrate;

%% Penalty function weight (specify nonnegative)
Wy = [1 1];
Wmv = 10;
Wmvrate = 10;

%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternatively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py      = Wy      * exteriorPenalty(y,ymin,ymax,'step');
Pmv     = Wmv     * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;

%% Compute reward
reward = -(Cost + Penalty);
end

To integrate this reward function, open the MATLAB Function block in the Simulink model.

open_system('rl_motor/Reward Function')

Append the function with the following line of code and save the model.

r = rewardFunctionMpc(y,refy,mv,refmv,lastmv);

The MATLAB Function block will now execute rewardFunctionMpc.m during simulation.

For this example, the MATLAB Function block has already been modified and saved.

Create a Reinforcement Learning Environment

The environment dynamics are modeled in the Servomechanism subsystem. For this environment,

  • The observations are the reference and actual output variables (angle and torque) from the last 8 time steps.

  • The action is the voltage V applied to the servomotor.

  • The sample time is Ts=0.1s.

  • The total simulation time is Tf=20s.

Specify the total simulation time and sample time.

Tf = 20;
Ts = 0.1;

Create observation and action specifications for the environment.

numObs = 32;
numAct = 1;
oinfo = rlNumericSpec([numObs 1]);
ainfo = rlNumericSpec([numAct 1],'LowerLimit',-220,'UpperLimit',220);

Create the reinforcement learning environment using the rlSimulinkEnv function.

env = rlSimulinkEnv('rl_motor','rl_motor/RL Agent',oinfo,ainfo);

Create a Reinforcement Learning Agent

Fix the random seed for reproducibility.

rng(0)

The agent in this example is a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent.

Create two critic representations.

% Critic
cnet = [
    featureInputLayer(numObs,'Normalization','none','Name', 'State')
    fullyConnectedLayer(128, 'Name', 'fc1')
    concatenationLayer(1,2,'Name','concat')
    reluLayer('Name','relu1')
    fullyConnectedLayer(64, 'Name', 'fc3')
    reluLayer('Name','relu2')
    fullyConnectedLayer(1, 'Name', 'CriticOutput')]; 
actionPath = [
    featureInputLayer(numAct,'Normalization','none', 'Name', 'Action')
    fullyConnectedLayer(8, 'Name', 'fc2')];
criticNetwork = layerGraph(cnet);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = connectLayers(criticNetwork,'fc2','concat/in2');
criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic1 = rlQValueRepresentation(criticNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},criticOptions);
critic2 = rlQValueRepresentation(criticNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},criticOptions);

Create an actor representation.

actorNetwork = [
    featureInputLayer(numObs,'Normalization','none','Name','State')
    fullyConnectedLayer(128, 'Name','actorFC1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(64, 'Name','actorFC2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(numAct,'Name','Action')];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},actorOptions);

Specify the agent options using rlTD3AgentOptions. The agent trains from an experience buffer of maximum capacity 1e6 by randomly selecting mini-batches of size 256. The discount factor of 0.995 favors long-term rewards.

agentOpts = rlTD3AgentOptions("SampleTime",Ts, ...
    "DiscountFactor", 0.995, ...
    "ExperienceBufferLength",1e6, ...
    "MiniBatchSize",256);

The exploration model in this TD3 agent is Gaussian. The noise model adds a uniform random value to the action during training. Set the standard deviation of the noise to 100. The standard deviation decays at the rate of 1e-5 every agent step until the minimum value of 0.005.

agentOpts.ExplorationModel.StandardDeviationMin = 0.005;
agentOpts.ExplorationModel.StandardDeviation = 100;
agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-5;

Create the TD3 agent using the actor and critic representations. For more information on TD3 agents, see rlTD3Agent.

agent = rlTD3Agent(actor,[critic1,critic2],agentOpts);

Train the Agent

To train the agent, first specify the training options using rlTrainingOptions. For this example, use the following options:

  • Run each training for at most 2000 episodes, with each episode lasting at most ceil(Tf/Ts) time steps.

  • Stop the training when the agent receives an average cumulative reward greater than -2 over 20 consecutive episodes. At this point, the agent can track the reference signal.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',2000, ...
    'MaxStepsPerEpisode',ceil(Tf/Ts), ...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-2,...
    'ScoreAveragingWindowLength',20);

Train the agent using the train function. Training this agent is a computationally intensive process that may take several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    trainingStats = train(agent,env,trainOpts);
else
    load('rlDCServomotorTD3Agent.mat')
end

A snapshot of the training progress is shown in the following figure. You can expect different results due to inherent randomness in the training process.

Validate Controller Response

To validate the performance of the trained agent, simulate the model and view the response in the Scope blocks. The reinforcement learning agent is able to track the reference angle while satisfying the constraints on torque and voltage.

sim('rl_motor');

Close the model.

close_system('rl_motor')

Copyright 2021 The MathWorks, Inc..