Compare Agents on Stochastic Waterfall Grid World
This example shows how to create and train frequently used discrete action space default agents on an 8-by-7 gridworld environment. The environment has a starting location, eight terminal locations, and a stochastic waterfall that pushes the agent toward the bottom of the grid. The goal of the agent is to move from the starting location to the target terminal location while maximizing the total reward.
The results that the agents obtain in this environment, with the selected initial conditions and random number generator seed, do not necessarily imply that specific agents are better than others. Also, note that the training times depend on the computer and operating system you use to run the example, and on other processes running in the background. Your training times might differ substantially from the training times shown in the example.
Fix Random Number Stream for Reproducibility
The example code might involve computation of random numbers at several stages. Fixing the random number stream at the beginning of some sections in the example code preserves the random number sequence in the section every time you run it, which increases the likelihood of reproducing the results. For more information, see Results Reproducibility.
Fix the random number stream with seed zero and random number algorithm Mersenne Twister. For more information on controlling the seed used for random number generation, see rng.
previousRngState = rng(0,"twister");The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.
Stochastic Waterfall Grid World Environment
This grid world environment used in this example has the following configuration and rules:
The grid world is 8-by-7 and bounded by borders, with four possible actions (North = 1, South = 2, East = 3, West = 4).
The agent begins from cell [4,1] (fifth row, first column).
The agent receives a reward of +50 if it reaches the terminal state at cell [4,5] (light blue).
All the locations in the bottom row of the grid are terminal, and agent receives a
-10penalty for reaching them.All other actions result in –1 reward.
The waterfall pushes the agent toward the bottom of the grid with a stochastic intensity. Specifically, the baseline intensity varies between the columns, as shown in the figure, and the agent has an equal chance of experiencing the indicated intensity, one level above that intensity, or one level below that intensity. As a result, when the agent moves into a column with a nonzero intensity, the waterfall pushes it downward a number of squares according to the intensity experienced by the agent. For example, if the agent moves east from state [5,2], it has an equal chance of reaching state [6,3], [7,3], or [8,3].


With respect to the deterministic waterfall comparison example, the main differences are the trap terminal states on the bottom, which prevent the agent from using the bottom pathway to advance toward the goal, and the stochastic nature of the waterfall, which oh the other hand opens up the possibility of going up first and then, with luck, use the waterfall to better advance toward the target. Note that even if the agent executes an optimal policy, it sometimes cannot arrive at the target cell because the waterfall might be too strong to overcome.
For more information on this environment, see Use Predefined Grid World Environments.
Create Environment Object
Create a stochastic waterfall grid world.
env = rlPredefinedEnv("WaterFallGridWorld-Stochastic");To specify that the initial state of the agent is always [4,1], create a reset function that returns the state number for the initial agent state. This function is called at the start of each training episode and simulation. States are numbered starting at position [1,1]. The state number increases as you move down the first column and then down each subsequent column. Therefore, create an anonymous function handle that sets the initial state to 4.
env.ResetFcn = @() 4;
For this example, enhance the reward for getting to the [4,5] terminal state.
env.Model.R(env.Model.R>0)=50;
Extract the environment observation and action specification objects for later use when creating agents.
obsInfo = getObservationInfo(env); actInfo = getActionInfo(env);
Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env)

The plot shows the agent location with a red circle and the terminal locations in light blue. Note that the initial agent location is [5,1] because the environment reset function has not yet been called.
Create a Table-Based Q-Value Function Critic
In this section, you create a table-based Q-value function critic for use within Q, SARSA, and DQN agents.
Create a table object.
table = rlTable(obsInfo,actInfo);
Create a critic approximator object using rlQValueFunction.
qcritic = rlQValueFunction(table,obsInfo,actInfo);
Check the critic with random observation and action inputs.
getValue(qcritic, ... {randi(numel(obsInfo.Elements))}, ... {randi(numel(actInfo.Elements))})
ans = 0
Create a Table-Based Value Function Critic
In this section, you create a table-based value function critic for use within AC and PPO agents. You can also use this critic as a baseline for a PG agent.
Create a table object.
table = rlTable(obsInfo);
Create a critic approximator object using rlValueFunction.
vcritic = rlValueFunction(table,obsInfo);
Check the critic with random observation input.
getValue(vcritic,{randi(numel(obsInfo.Elements))})ans = 0
Create a Custom Basis Function-Based Vector Q-Value Function Critic
In this section, you create a vector value function critic for use within the SAC agent. The critic uses a custom basis function that implements a table.
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Because rlVectorQValueFunction does not support tables, use the helper function defined in the supporting file cbftbl.m, which implements a table using custom basis functions. Note that using local functions to implement a custom basis function is not recommended if you want to save an agent and load it later. This is because local functions are available only in the file in which they are defined, and when you load an agent in the workspace the function is no longer available to the agent. Additionally, local functions are not supported for code generation.
Alternatively, you could also define a custom layer and then use it in a custom network that effectively implements a table. For an example on how to do that, see Use Custom Layer in TRPO Agent to Solve Tabular Approximation Problem.
Display the helper function.
type("cbftbl.m")function out = cbftbl(obs,nrows)
% Table with nrows rows and one column implemented using a custom basis
% function. The third dimension of obs is the batch dimension.
% Get batch dimension and allocate output matrix.
nbtc = size(obs,3);
out = zeros(nrows,nbtc,like=obs);
% Cycle through batch dimension and set to one the output
% corresponding to the index received as observation.
for k = 1:nbtc
if obs(1,1,k)
out(obs(1,1,k),k) = 1;
end
end
end
Define the custom basis function. Specifically, the anonymous function basisFcn accepts as input the current observation, calls the cbftbl function (the value of the second argument is stored in the basisFcn workspace at definition time) and returns, for each batch element, a one-hot vector of 8*7=56 elements, where the element equal to 1 indicates the current observation.
basisFcn = @(obs) cbftbl(obs,numel(obsInfo.Elements));
Specify initial conditions for the table values. When the transpose of W0 is multiplied by the output of basisFcn the result is a vector in which each of the four elements is the value of the corresponding action given the observation.
W0 = rand(numel(obsInfo.Elements),numel(actInfo.Elements));
Use rlVectorQValueFunction to create the critic object, passing the basis function handle and the initial condition as a cell.
vqcritic = rlVectorQValueFunction({basisFcn,W0},obsInfo,actInfo);Check the critic with a batch of 10 random observation inputs.
r = randi(numel(obsInfo.Elements),[1 1 10]);
v = getValue(vqcritic,{r});Display the seventh element of the batch.
v(:,7)
ans = 4×1
0.8003
0.8407
0.0844
0.7447
Create a Custom Basis Function-Based Q-Value Function Critic
In this section, you create a value function critic for use within the LSPI agent. The critic uses a custom basis function that implements a table.
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Because the LSPI agent does not support tables as approximator models, use the helper function defined in the supporting file cbfqtbl.m, which implements a q-value function table using custom basis functions. Note that using local functions to implement a custom basis function is not recommended if you want to save an agent and load it later. This is because local functions are available only in the file in which they are defined, and when you load an agent in the workspace the function is no longer available to the agent. Additionally, local functions are not supported for code generation.
Alternatively, you could also define a custom layer and then use it in a custom network that effectively implements a table. For an example on how to do that, see Use Custom Layer in TRPO Agent to Solve Tabular Approximation Problem.
Display the helper function.
type("cbfqtbl.m")function out = cbfqtbl(obs,act,nr,nc)
% Table with nr rows and nc columns implemented using a custom basis
% function. The third dimension of obs and act is the batch dimension.
% Get batch dimension.
nbtc = size(obs,3);
% Allocate output matrix in columnwise format, that is the first two
% dimensions are vectorized as in out=table(:)
out = zeros(nr*nc,nbtc,like=obs);
% Cycle through batch dimension and set to one the output
% corresponding to the indexes of observation and action.
for k = 1:nbtc
if obs(1,1,k)
out(nr*(act(1,1,k)-1)+obs(1,1,k),k) = 1;
end
end
end
Define the custom basis function. Specifically, the anonymous function basisFcn accepts as input the current observation, calls the cbftbl function (the value of the last two arguments is stored in the basisFcn workspace at definition time) and returns, for each batch element, a one-hot vector of 8*7*4=224 elements, where the element equal to 1 indicates the columnwise location of the current observation and action.
basisFcnQ = @(obs,act) cbfqtbl(obs,act, ... numel(obsInfo.Elements), ... numel(actInfo.Elements));
Specify initial conditions for the table values. When the transpose of W0 is multiplied by the output of basisFcn the result is a scalar indicating the value of the corresponding action given the observation.
W0 = rand(numel(obsInfo.Elements)*numel(actInfo.Elements),1);
For example, the weight corresponding to observation number 13 and action number 3 is:
W0(numel(obsInfo.Elements)*(3-1)+13)
ans = 0.8173
Use rlQValueFunction to create the critic object, passing the basis function handle and the initial condition as a cell.
qcbfcritic = rlQValueFunction({basisFcnQ,W0},obsInfo,actInfo);Ignoring the batch dimension, check the critic by returning the value corresponding to the observation number 13 and the action number 3.
getValue(qcbfcritic,{13},{3})ans = 0.8173
Check the critic with a batch of 10 random observation and action inputs.
robs = randi(numel(obsInfo.Elements),[1 1 10]);
ract = randi(numel(actInfo.Elements),[1 1 10]);
v = getValue(qcbfcritic,{robs},{ract});Display the seventh element of the batch.
v(:,7)
ans = 0.3786
Create a Custom Basis Function-Based Actor
In this section, you create a discrete categorical actor for use within PG, AC, and PPO agents. The actor uses a custom basis function that implements a table.
Because rlDiscreteCategoricalActor does not support tables, use the helper function defined in the supporting file cbftbl.m, which implements a table using custom basis functions. Note that using local functions to implement a custom basis function is not recommended if you want to save an agent and load it later. This is because local functions are available only in the file in which they are defined, and when you load an agent in the workspace the function is no longer available to the agent. Additionally, local functions are not supported for code generation.
Alternatively, you could also define a custom layer and then use it in a custom network that effectively implements a table. For an example on how to do that, see Use Custom Layer in TRPO Agent to Solve Tabular Approximation Problem.
Define the custom basis function. Here, the output of the basis function is a one-hot vector of 8*7 elements, where the element equal to 1 indicates the current observation.
basisFcn = @(obs) cbftbl(obs,numel(obsInfo.Elements));
Specify initial conditions for the table values. When the transpose of W0 is multiplied by the output of basisFcn the result, after passing through a softMax function, is a vector in which each of the four elements is the probability of executing the corresponding action.
W0 = zeros(numel(obsInfo.Elements),numel(actInfo.Elements));
Use rlDiscreteCategoricalActor to create the actor object, passing the basis function handle and the initial condition as a cell.
actor = rlDiscreteCategoricalActor({basisFcn,W0},obsInfo,actInfo);Check the actor with a batch of 10 random observation inputs.
robs = randi(numel(obsInfo.Elements),[1 1 10]);
v = getAction(actor,{robs});Display the seventh element of the batch.
v{1}(:,:,7)ans = 1
Configure Training and Simulation Options for All Agents
Set up an evaluator object to evaluate the agent ten times without exploration every 100 training episodes. For more information, see rlEvaluator.
evl = rlEvaluator(NumEpisodes=10,EvaluationFrequency=100);
Create a training options object.
trainOpts = rlTrainingOptions;
For this example, use the following options:
Train for a maximum of 5000 episodes. Specify that each episode lasts for most 50 time steps.
Stop the training when the average reward in the evaluation episodes is greater than 1.
trainOpts.MaxEpisodes= 5000;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.StopTrainingCriteria = "EvaluationStatistic";
trainOpts.StopTrainingValue = 1;For more information on training options, see rlTrainingOptions.
To simulate the trained agent, create a simulation options object and configure it to simulate for 50 steps.
simOpts = rlSimulationOptions(MaxSteps=50);
For more information on simulation options, see rlSimulationOptions.
Create, Train, and Simulate a Q-Agent
In this section, you create a Q-learning agent and then configure its parameters.
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlQAgent object using the critic you have created before.
qAgent = rlQAgent(qcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
qAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2; qAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic qTngRes = train(qAgent,env,trainOpts,Evaluator=evl); qTngTime = toc; % Extract the number of training episodes and total steps. qTngEps = qTngRes.EpisodeIndex(end); qTngSteps = sum(qTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchQAgent.mat", ... % "qAgent","qTngEps","qTngSteps","qTngTime") else % Load the pretrained agent and results for the example. load("swgwBchQAgent.mat", ... "qAgent","qTngEps","qTngSteps","qTngTime") end

The training does not stop within 5000 episodes. The evaluation statistic indicates a reward of -3.2, suggesting that the agent does not enact the optimal policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(qAgent,env,simOpts);

qTotalRwd = sum(experience.Reward)
qTotalRwd = -50
The agent trace shows that the trained agent does not find the target cell.
Extract the table with the final Q-value function.
qAgtFinalQ = getLearnableParameters(getCritic(qAgent));
Display the maximum values for each state (that is, the approximate value function) in a 5-by-5 format.
reshape(max(qAgtFinalQ{1}'),8,7)ans = 8×7 single matrix
-8.4910 -7.6386 0 0 0 0 -0.3060
-9.7053 -8.2515 -3.7453 0 0 0 -0.3177
-10.3059 -9.7609 -6.6872 -1.4023 -0.6185 -0.3177 -0.9506
-11.3702 -10.3471 -8.4235 -3.6939 0 -0.6037 -2.0492
-11.7910 -11.0499 -8.9738 -5.4382 -4.3062 -1.6570 -2.8689
-10.9000 -10.5582 -10.2957 -7.4134 -5.6825 -4.2986 -3.6178
-10.0000 -10.0393 -10.2703 -9.2107 -8.4403 -5.5935 -4.6992
0 0 0 0 0 0 0
Create, Train, and Simulate a SARSA Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlSARSAAgent object using the critic you have created before.
sarsaAgent = rlSARSAAgent(qcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
sarsaAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2; sarsaAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic sarsaTngRes = train(sarsaAgent,env,trainOpts,Evaluator=evl); sarsaTngTime = toc; % Extract the number of training episodes and total steps. sarsaTngEps = sarsaTngRes.EpisodeIndex(end); sarsaTngSteps = sum(sarsaTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchSARSAAgent.mat", ... % "sarsaAgent","sarsaTngEps","sarsaTngSteps","sarsaTngTime") else % Load the pretrained agent and results for the example. load("swgwBchSARSAAgent.mat", ... "sarsaAgent","sarsaTngEps","sarsaTngSteps","sarsaTngTime") end

The training stops after 800 episodes. The evaluation statistic indicates a reward of 2.4, suggesting that the agent learns a policy that can be close to the optimal one.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(sarsaAgent,env,simOpts);

sarsaTotalRwd = sum(experience.Reward)
sarsaTotalRwd = -14
The agent trace shows that, in this simulation, the trained agent does not find the target cell. Note that even if the agent executes an optimal policy, arriving at the target cell cannot be guaranteed, as sometimes the waterfall is too strong to overcome, as it seems to be the case in this simulation.
Extract the table with the final Q-value function.
sarsaAgtFinalQ = getLearnableParameters(getCritic(sarsaAgent));
Display the maximum values for each state (that is, the value function) in a 5-by-5 format.
reshape(max(sarsaAgtFinalQ{1}'),8,7)ans = 8×7 single matrix
-8.4358 -7.6420 0 0 0 0 -0.3177
-9.2846 -8.3873 -3.5061 0 0.3176 0 -0.9372
-10.2814 -9.3075 -6.3571 -1.2661 0.0602 -0.6334 -1.7699
-10.8428 -10.3274 -8.2447 -2.5470 0 -0.9809 -2.6761
-11.7603 -10.9135 -8.4797 -5.9128 -3.0310 -3.0969 -3.5276
-11.0152 -10.8841 -10.4255 -7.1040 -5.5859 -4.3203 -4.7720
-9.9046 -9.8601 -10.3371 -10.2489 -9.6857 -6.7507 -5.9167
0 0 0 0 0 0 0
Create, Train, and Simulate an LSPI Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlLSPIAgent object using the critic you have created before.
lspiAgent = rlLSPIAgent(qcbfcritic);
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic lspiTngRes = train(lspiAgent,env,trainOpts,Evaluator=evl); lspiTngTime = toc; % Extract the number of training episodes and total steps. lspiTngEps = lspiTngRes.EpisodeIndex(end); lspiTngSteps = sum(lspiTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchLSPIAgent.mat", ... % "lspiAgent","lspiTngEps","lspiTngSteps","lspiTngTime") else % Load the pretrained agent and results for the example. load("swgwBchLSPIAgent.mat", ... "lspiAgent","lspiTngEps","lspiTngSteps","lspiTngTime") end

The training does not converge after 5000 episodes. The evaluation statistic indicates a reward of -13, suggesting that the agent does not learn the optimal policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(lspiAgent,env,simOpts);

lspiTotalRwd = sum(experience.Reward)
lspiTotalRwd = -13
The agent trace shows that the trained agent does not find the target cell.
Extract the table with the final Q-value function.
lspiAgtFinalQ = getLearnableParameters(getCritic(lspiAgent));
Display the maximum values for each state (that is, the value function) in a 5-by-5 format.
reshape(max(reshape(lspiAgtFinalQ{1},8*7,4)'),8,7)ans = 8×7 single matrix
0 0 0 0 0 0 0
0.0000 0 0 0 0 0 0
0 -10.8792 0 0 0 0 0
-12.9099 -14.2565 0 0 0 0 0
-11.7942 -12.7368 -11.1774 0 0 0 0
-10.9021 -10.3578 0 0 0 0 0
-10.0000 -9.9995 -10.3403 -9.9900 0 0 0
0 0 0 0 0 0 0
Create, Train, and Simulate a DQN Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlDQNAgent object using the critic you have created before.
dqnAgent = rlDQNAgent(qcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
dqnAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2; dqnAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic dqnTngRes = train(dqnAgent,env,trainOpts,Evaluator=evl); dqnTngTime = toc; % Extract the number of training episodes and total steps. dqnTngEps = dqnTngRes.EpisodeIndex(end); dqnTngSteps = sum(dqnTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchDQNAgent.mat", ... % "dqnAgent","dqnTngEps","dqnTngSteps","dqnTngTime") else % Load the pretrained agent and results for the example. load("swgwBchDQNAgent.mat", ... "dqnAgent","dqnTngEps","dqnTngSteps","dqnTngTime") end

The training stops after 200 episodes. The evaluation statistic indicates a reward of 15, suggesting that the agent learns the optimal policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(dqnAgent,env,simOpts);

dqnTotalRwd = sum(experience.Reward)
dqnTotalRwd = 44
The agent trace shows that the trained agent successfully finds the target cell using just 6 moves before the final one. As a consequence, the total reward is 50-6=44.
Extract the table with the final Q-value function.
dqnAgtFinalQ = getLearnableParameters(getCritic(dqnAgent));
Display the maximum values for each state (that is, the value function) in a 5-by-5 format.
reshape(max(dqnAgtFinalQ{1}'),8,7)ans = 8×7 single matrix
-7.0829 -6.1218 0 0 0.0000 0.0000 0.0000
-7.9759 -7.1075 -1.3664 0 3.9302 6.4918 3.9297
-8.8278 -8.0157 -5.2283 4.9758 6.1051 5.1754 2.9144
-9.6662 -8.9143 -8.6060 0.4956 0 2.4971 1.8557
-10.5174 -9.7126 -7.9808 -5.3856 0.3213 -0.2988 0.7551
-10.1551 -9.3803 -8.9581 -8.1845 -5.4431 -1.0009 -0.3174
-9.3846 -8.6961 -9.2285 -9.1227 -7.5849 -2.4418 -1.4120
0 0 0 0 0 0 0
Create, Train, and Simulate a PG Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlPGAgent object using the actor you have created before. Alternatively, to use a baseline, supply the critic you have created before vcritic as a second argument. However, for this example, do not use any baseline.
pgAgent = rlPGAgent(actor);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
pgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-2; pgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Set the entropy loss weight to increase exploration.
pgAgent.AgentOptions.EntropyLossWeight = 0.005;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic pgTngRes = train(pgAgent,env,trainOpts,Evaluator=evl); pgTngTime = toc; % Extract the number of training episodes and total steps. pgTngEps = pgTngRes.EpisodeIndex(end); pgTngSteps = sum(pgTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchPGAgent.mat", ... % "pgAgent","pgTngEps","pgTngSteps","pgTngTime") else % Load the pretrained agent and results for the example. load("swgwBchPGAgent.mat", ... "pgAgent","pgTngEps","pgTngSteps","pgTngTime") end

The training does not stop within 5000 episodes. The evaluation statistic indicates a reward of -50, suggesting that the agent does not learn any acceptable policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(pgAgent,env,simOpts);

pgTotalRwd = sum(experience.Reward)
pgTotalRwd = -35
The agent trace shows that the trained agent does not even move toward the target cell.
Create, Train, and Simulate an AC Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlACAgent object using the actor and critic you have created before.
acAgent = rlACAgent(actor,vcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-2; acAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1; acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2; acAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Set the entropy loss weight to increase exploration.
pgAgent.AgentOptions.EntropyLossWeight = 0.005;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic acTngRes = train(acAgent,env,trainOpts,Evaluator=evl); acTngTime = toc; % Extract the number of training episodes and total steps. acTngEps = acTngRes.EpisodeIndex(end); acTngSteps = sum(acTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchACAgent.mat", ... % "acAgent","acTngEps","acTngSteps","acTngTime") else % Load the pretrained agent and results for the example. load("swgwBchACAgent.mat", ... "acAgent","acTngEps","acTngSteps","acTngTime") end

The training does not stop within 5000 episodes. The evaluation statistic indicates a reward of -50, suggesting that the agent does not learn any acceptable policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(acAgent,env,simOpts); acTotalRwd = sum(experience.Reward)
acTotalRwd = -15
Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env)

env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
The agent trace shows that the trained agent does not learn to avoid the terminal locations in the bottom row.
Because the AC agent has a critic, you can display the final value function represented by the critic.
acAgtFinalV = getLearnableParameters(getCritic(acAgent));
reshape(acAgtFinalV{1},8,7)ans = 8×7 single matrix
-8.0807 -8.0911 0 0 0 0 0
-8.9049 -9.2618 -4.5136 0 0 0 0
-11.2843 -10.7371 -5.1383 -1.6528 -0.2732 -0.1013 0
-12.9438 -11.6077 -7.0862 -1.3626 0 -0.1750 0
-11.5213 -10.6213 -7.5784 -3.1510 -1.5609 -0.0005 -0.0006
-11.3652 -10.6607 -10.3237 -3.7910 -2.7339 -0.1979 -0.2384
-8.6697 -10.1707 -10.2305 -10.2505 -7.4023 -1.8267 -1.0707
0 0 0 0 0 0 0
Create, Train, and Simulate a PPO Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlPPOAgent object using the actor and critic you have created before.
ppoAgent = rlPPOAgent(actor,vcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
ppoAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-2; ppoAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1; ppoAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2; ppoAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic ppoTngRes = train(ppoAgent,env,trainOpts,Evaluator=evl); ppoTngTime = toc; % Extract the number of training episodes and total steps. ppoTngEps = ppoTngRes.EpisodeIndex(end); ppoTngSteps = sum(ppoTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchPPOAgent.mat", ... % "ppoAgent","ppoTngEps","ppoTngSteps","ppoTngTime") else % Load the pretrained agent and results for the example. load("swgwBchPPOAgent.mat", ... "ppoAgent","ppoTngEps","ppoTngSteps","ppoTngTime") end

The training does not stop within 5000 episodes. The evaluation statistic indicates a reward of -12.3, suggesting that the agent does not learn the optimal policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(ppoAgent,env,simOpts);

ppoTotalRwd = sum(experience.Reward)
ppoTotalRwd = -13
The agent trace shows that the trained agent does not learn to avoid the terminal states in the bottom row,
Because the PPO agent has a critic, you can display the final value function represented by the critic.
ppoAgtFinalV = getLearnableParameters(getCritic(ppoAgent));
reshape(ppoAgtFinalV{1},8,7)ans = 8×7 single matrix
-2.5166 -2.3795 0 0 0 0 0
-2.8635 -2.8792 -1.0599 0 0 0 0
-3.8991 -3.8537 -2.3695 -0.3886 0 0 0
-4.9219 -5.7343 -3.0409 -0.3288 0 0 0
-7.2191 -7.2691 -5.6524 -1.2116 -0.4680 0 0
-6.8135 -6.6869 -6.7841 -2.9645 -1.6540 -0.4749 -0.1796
-6.7691 -6.0512 -7.5760 -6.1112 -3.7429 -1.2074 -0.7553
0 0 0 0 0 0 0
Create, Train, and Simulate a SAC Agent
Ensure reproducibility of the section by fixing the seed used for random number generation.
rng(0,"twister")Create an rlSACAgent object using the actor you have created before and critic.
sacAgent = rlSACAgent(actor,vqcritic);
For tabular problems, you can typically safely set a slightly higher learning rate. However, set a gradient threshold to minimize the risk of learning instabilities.
sacAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-2; sacAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1; sacAgent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-2; sacAgent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1; sacAgent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-2; sacAgent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1;
Set the initial entropy weight and target entropy to increase exploration.
sacAgent.AgentOptions.EntropyWeightOptions.EntropyWeight = 5e-3; sacAgent.AgentOptions.EntropyWeightOptions.TargetEntropy = 5e-1;
Train the agent, passing the previously created evaluator object to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining =false; if doTraining % To avoid plotting in training, recreate the environment. env = rlPredefinedEnv("WaterFallGridWorld-Stochastic"); env.ResetFcn = @() 4; env.Model.R(env.Model.R>0)=50; % Train the agent. Record the training time. tic sacTngRes = train(sacAgent,env,trainOpts,Evaluator=evl); sacTngTime = toc; % Extract the number of training episodes and total steps. sacTngEps = sacTngRes.EpisodeIndex(end); sacTngSteps = sum(sacTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("swgwBchSACAgent.mat", ... % "sacAgent","sacTngEps","sacTngSteps","sacTngTime") else % Load the pretrained agent and results for the example. load("swgwBchSACAgent.mat", ... "sacAgent","sacTngEps","sacTngSteps","sacTngTime") end

The training stops after 200 episodes. The evaluation statistic indicates a reward of 8.5, suggesting that the agent learns the optimal policy.
You can check the trained agent within the environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")Visualize the environment and configure the visualization to maintain a trace of the agent states.
plot(env) env.Model.Viewer.ShowTrace = true; env.Model.Viewer.clearTrace;
By default, the agent uses a greedy (hence deterministic) policy in simulation. To use the exploratory policy instead, set the UseExplorationPolicy agent property to true.
Simulate the environment with the trained agent for 50 steps and display the total reward. For more information on agent simulation, see sim.
experience = sim(sacAgent,env,simOpts);

sacTotalRwd = sum(experience.Reward)
sacTotalRwd = -16
The agent trace shows that the trained agent successfully finds the target cell using just 6 moves before the final one. As a consequence, the total reward is 50-6=44.
Because the SAC agent has a critic, you can display the final value function represented by critic.
sacAgtFinalV = getLearnableParameters(getCritic(sacAgent));
reshape(max(sacAgtFinalV{1}),8,7)ans = 8×7 single matrix
-5.9484 -4.9739 0.0000 0.0000 0.0000 2.3252 1.3694
-6.8278 -5.9814 0.1031 0.0000 5.1474 6.4981 3.7063
-7.5995 -6.9054 -4.4824 5.6607 6.6548 6.2080 2.5382
-8.4234 -7.7959 -8.1279 1.8500 0.0000 2.2902 1.3569
-9.2436 -8.6889 -7.8357 -4.4469 1.3079 -0.1922 0.2753
-9.4661 -8.8603 -8.7405 -7.9632 -4.6766 -1.6075 -0.7324
-8.8719 -8.3455 -9.3886 -9.4166 -7.9859 -2.7729 -1.7554
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Plot Training and Simulation Metrics
For each agent, collect the total reward from the final simulation episode, the number of training episodes, the total number of agent steps, and the training time.
simReward = [
qTotalRwd
sarsaTotalRwd
lspiTotalRwd
dqnTotalRwd
pgTotalRwd
acTotalRwd
ppoTotalRwd
sacTotalRwd
];
tngEpisodes = [
qTngEps
sarsaTngEps
lspiTngEps
dqnTngEps
pgTngEps
acTngEps
ppoTngEps
sacTngEps
];
tngSteps = [
qTngSteps
sarsaTngSteps
lspiTngSteps
dqnTngSteps
pgTngSteps
acTngSteps
ppoTngSteps
sacTngSteps
];
tngTime = [
qTngTime
sarsaTngTime
lspiTngTime
dqnTngTime
pgTngTime
acTngTime
ppoTngTime
sacTngTime
];Because the training for the PG and LSPI agents does not converge, to avoid visualizing their metrics, set them to NaN.
simReward([1 3 5 6 7]) = NaN; tngEpisodes([1 3 5 6 7]) = NaN; tngSteps([1 3 5 6 7]) = NaN; tngTime([1 3 5 6 7]) = NaN;
Plot the simulation reward, number of training episodes, number of training steps (that is, the number of interactions between the agent and the environment) and the training time. Scale the data by the factor [5e-3 2 1e4 1] for better visualization.
figure; bar([simReward,tngEpisodes,tngSteps,tngTime]./[5e-3 2 1e4 1]) xticklabels(["Q" "SARSA" "LSPI" "DQN" "PG" "AC" "PPO" "SAC"]) legend( ... ["Simulation Reward", ... "Training Episodes", ... "Training Steps", ... "Training Time"], ... "Location","northeast")

The plot shows that, for this environment, and with the used random number generator seed and initial conditions, only DQN and SAC successfully complete the task, with SAC requiring slightly less training time. SARSA probably also learns the optimal policy though it is not able to reach the target cell in the final simulation.
Finally, note that with a different random seed, the initial parameters of some agents would be different, and therefore, results might be different. For more information on the relative strengths and weaknesses of each agent, see Reinforcement Learning Agents.
Restore the random number stream using the information stored in previousRngState.
rng(previousRngState);
See Also
Functions
Objects
rlEvaluator|rlTrainingOptions|rlSimulationOptions|rlQValueFunction|rlValueFunction|rlVectorQValueFunction|rlDiscreteCategoricalActor
Topics
- Train Reinforcement Learning Agent in MDP Environment
- Train Reinforcement Learning Agent in Basic Grid World
- Compare Agents on Basic Grid World
- Compare Agents on Deterministic Waterfall Grid World
- Reinforcement Learning Environments
- Use Predefined Grid World Environments
- Reinforcement Learning Agents
- Train Reinforcement Learning Agents







