Main Content

Generate Reward Function from a Model Verification Block for a Water Tank System

This example shows how to automatically generate a reward function from performance requirement defined in a Simulink® Design Optimization™ model verification block. You then use the generated reward function to train a reinforcement learning agent.

Introduction

You can use the generateRewardFunction to generate a reward function for reinforcement learning, starting from performance constraints specified in a Simulink Design Optimization model verification block. The resulting reward signal is a sum of weighted penalties on constraint violations by the current state of the environment.

In this example, you will convert the cost and constraint specifications defined in a Check Step Response Characteristics block for a water tank system into a reward function. You then use the reward function and use it to train an agent to control the water tank.

Specify parameters for this example.

% Watertank parameters
a = 2;
b = 5;
A = 20;

% Initial and final height
h0 = 1;
hf = 2;

% Simulation and sample times
Tf = 10;
Ts = 0.1;

The original model for this example is the watertank Simulink Model (Simulink Control Design).

Open the model.

open_system('rlWatertankStepInput')

The model in this example has been modified for reinforcement learning. The goal is to control the level of the water in the tank using a reinforcement learning agent, while satisfying the response characteristics defined in the Check Step Response Characteristics block. Open the block to view the desired step response specifications.

blk = 'rlWatertankStepInput/WaterLevelStepResponse';
open_system(blk)

Figure Check Step Response Characteristics [1] - WaterLevelStepResponse contains an axes object and other objects of type uiflowcontainer, uimenu, uitoolbar. The axes object contains 9 objects of type patch, line.

Generate a Reward Function

Generate the reward function code from specifications in the WaterLevelStepResponse block using generateRewardFunction. The code is displayed in the MATLAB Editor.

generateRewardFunction(blk)

The generated reward function is a starting point for reward design. You may modify the function by choosing different penalty functions and tuning the penalty weights. For this example, make the following change to the generated code:

  • The default penalty weight is 1. Set weight to 10.

  • The default exterior penalty function method is step. Change the method to quadratic.

After you make changes, the weight and penalty specifications should be as follows:

Weight = 10;
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'quadratic'));

For this example, the modified code has been saved in the MATLAB function file rewardFunctionVfb.m. Display the generated reward function.

type rewardFunctionVfb.m
function reward = rewardFunctionVfb(x,t)
% REWARDFUNCTION generates rewards from Simulink block specifications.
%
% x : Input of watertank_stepinput_rl/WaterLevelStepResponse
% t : Simulation time (s)

% Reinforcement Learning Toolbox
% 26-Apr-2021 13:05:16

%#codegen

%% Specifications from watertank_stepinput_rl/WaterLevelStepResponse
Block1_InitialValue = 1;
Block1_FinalValue = 2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100);
Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100);
Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);
Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100;

if t >= Block1_StepTime
    if Block1_InitialValue <= Block1_FinalValue
        Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
        Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling];
        Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
        Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling];
    else
        Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
        Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling];
        Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
        Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling];
    end

    Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
    for idx = 1:numel(Block1_xmax)
        tseg = Block1_UpperBoundTimes(idx,:);
        xseg = Block1_UpperBoundAmplitudes(idx,:);
        Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmax))
        Block1_xmax = Inf;
    else
        Block1_xmax = max(Block1_xmax,[],'omitnan');
    end

    Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
    for idx = 1:numel(Block1_xmin)
        tseg = Block1_LowerBoundTimes(idx,:);
        xseg = Block1_LowerBoundAmplitudes(idx,:);
        Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmin))
        Block1_xmin = -Inf;
    else
        Block1_xmin = max(Block1_xmin,[],'omitnan');
    end
else
    Block1_xmin = -Inf;
    Block1_xmax = Inf;
end

%% Penalty function weight (specify nonnegative)
Weight = 10;

%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltely, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'quadratic'));

%% Compute reward
reward = -Weight * Penalty;
end

To integrate this reward function in the water tank model, open the MATLAB Function block under the Reward Subsystem.

open_system('rlWatertankStepInput/Reward/Reward Function')

Append the function with the following line of code and save the model.

r = rewardFunctionVfb(x,t);

The MATLAB Function block will now execute rewardFunctionVfb.m for computing rewards.

For this example, the MATLAB Function block has already been modified and saved.

Create a Reinforcement Learning Environment

The environment dynamics are modeled in the Water-Tank Subsystem. For this environment,

  • The observations are the reference height ref from the last 5 time steps, and the height error is err = ref - H.

  • The action is the voltage V applied to the pump.

  • The sample time Ts is 0.1 s.

Create observation and action specifications for the environment.

numObs = 6;
numAct = 1;
oinfo = rlNumericSpec([numObs 1]);
ainfo = rlNumericSpec([numAct 1]);

Create the reinforcement learning environment using the rlSimulinkEnv function.

env = rlSimulinkEnv('rlWatertankStepInput','rlWatertankStepInput/RL Agent',oinfo,ainfo);

Create a Reinforcement Learning Agent

Fix the random seed for reproducibility.

rng(100)

The agent in this example is a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent.

Create two critic representations.

% Critic
cnet = [
    featureInputLayer(numObs,'Normalization','none','Name', 'State')
    fullyConnectedLayer(128, 'Name', 'fc1')
    concatenationLayer(1,2,'Name','concat')
    reluLayer('Name','relu1')
    fullyConnectedLayer(128, 'Name', 'fc3')
    reluLayer('Name','relu2')
    fullyConnectedLayer(1, 'Name', 'CriticOutput')];
actionPath = [
    featureInputLayer(numAct,'Normalization','none', 'Name', 'Action')
    fullyConnectedLayer(8, 'Name', 'fc2')];
criticNetwork = layerGraph(cnet);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = connectLayers(criticNetwork,'fc2','concat/in2');
criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic1 = rlQValueRepresentation(criticNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},criticOptions);
critic2 = rlQValueRepresentation(criticNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},criticOptions);

Create an actor representation.

actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','State')
    fullyConnectedLayer(128, 'Name','actorFC1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(128, 'Name','actorFC2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(numAct,'Name','Action')
    ];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,oinfo,ainfo,...
    'Observation',{'State'},'Action',{'Action'},actorOptions);

Specify the agent options using rlTD3AgentOptions. The agent trains from an experience buffer of maximum capacity 1e6 by randomly selecting mini-batches of size 256. The discount factor of 0.99 favors long-term rewards.

agentOpts = rlTD3AgentOptions("SampleTime",Ts, ...
    "DiscountFactor",0.99, ...
    "ExperienceBufferLength",1e6, ...
    "MiniBatchSize",256);

The exploration model in this TD3 agent is Gaussian. The noise model adds a uniform random value to the action during training. Set the standard deviation of the noise to 0.5. The standard deviation decays at the rate of 1e-5 every agent step until the minimum value of 0.

agentOpts.ExplorationModel.StandardDeviation = 0.5;
agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-5;
agentOpts.ExplorationModel.StandardDeviationMin = 0;

Create the TD3 agent using the actor and critic representations. For more information on TD3 agents, see rlTD3Agent.

agent = rlTD3Agent(actor,[critic1,critic2],agentOpts);

Train the Agent

To train the agent, first specify the training options using rlTrainingOptions. For this example, use the following options:

  • Run each training for at most 2000 episodes, with each episode lasting at most ceil(Tf/Ts) time steps, where the total simulation time Tf is 10 s.

  • Stop the training when the agent receives an average cumulative reward greater than -5 over 20 consecutive episodes. At this point, the agent can track the reference height.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',100, ...
    'MaxStepsPerEpisode',ceil(Tf/Ts), ...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-5,...
    'ScoreAveragingWindowLength',20);

Train the agent using the train function. Training this agent is a computationally intensive process that may take several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    trainingStats = train(agent,env,trainOpts);
else
    load('rlWatertankTD3Agent.mat')
end

A snapshot of the training progress is shown in the following figure. You can expect different results due to inherent randomness in the training process.

Validate Closed Loop Response

Simulate the model to view the closed loop step response. The reinforcement learning agent is able to track the reference height while satisfying the step response constraints.

sim('rlWatertankStepInput');

Figure Check Step Response Characteristics [1] - WaterLevelStepResponse contains an axes object and other objects of type uiflowcontainer, uimenu, uitoolbar. The axes object contains 10 objects of type patch, line. This object represents WaterLevelStepResponse.

Close the model.

close_system('rlWatertankStepInput')