Train PPO Agent for a Lander Vehicle
This example shows how to train a proximal policy optimization (PPO) agent with a discrete action space to land an airborne vehicle on the ground. For more information on PPO agents, see Proximal Policy Optimization (PPO) Agent.
Fix Random Seed Generator to Improve Reproducibility
The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the actor and critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.Fix the random number stream with the seed 0
and random number algorithm Mersenne Twister. For more information on random number generation see rng
.
previousRngState = rng(0,"twister")
previousRngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625x1 uint32]
The output previousRngState
is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.
Create Environment Object
The environment in this example is a lander vehicle represented by a 3-DOF circular disc with mass. The vehicle has two thrusters for forward and rotational motion. Gravity acts vertically downwards, and there are no aerodynamic drag forces. The training goal is to make the vehicle land on the ground at a specified location.
For this environment:
Motion of the lander vehicle is bounded in X (horizontal axis) from -100 to 100 meters and Y (vertical axis) from 0 to 120 meters.
The goal position is at (0,0) meters and the goal orientation is 0 radians.
The maximum thrust applied by each thruster is 8.5 N.
The sample time is 0.1 seconds.
The observations from the environment are the vehicle's position , orientation , velocity , angular velocity , and a sensor reading that detects rough landing (-1), soft landing (1) or airborne (0) condition. The observations are normalized between -1 and 1.
The environment has a discrete action space. At every time step, the agent selects one of the following nine discrete action pairs:
Here, and are normalized thrust values for each thruster. The environment step
function scales these values to determine the actual thrust values.
At the beginning of every episode, the vehicle starts from a random initial position and orientation. The altitude is always reset to 100 meters.
The reward provided at the time step is as follows.
Here:
,,, and are the positions and velocities of the lander vehicle along the x and y axes.
is the normalized distance of the lander vehicle from the goal position.
is the normalized speed of the lander vehicle.
and are the maximum distances and speeds.
is the orientation with respect to the vertical axis.
and are the action values for the left and right thrusters.
is a sparse reward for soft-landing with horizontal and vertical velocities less than 0.5 m/s.
Create a MATLAB® environment using the LanderVehicle
class provided in the example folder.
env = LanderVehicle()
env = LanderVehicle with properties: Mass: 1 L1: 10 L2: 5 Gravity: 9.8060 ThrustLimits: [0 8.5000] Ts: 0.1000 State: [6x1 double] LastAction: [2x1 double] LastShaping: 0 DistanceIntegral: 0 VelocityIntegral: 0 TimeCount: 0
To view the implementation of the LanderVehicle
class, open the class file.
open("LanderVehicle.m")
Obtain the observation and action specifications from the environment.
obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "states" Description: [0x0 string] Dimension: [7 1] DataType: "double"
actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: {9x1 cell} Name: "thrusts" Description: [0x0 string] Dimension: [2 1] DataType: "double"
Create PPO Agent Object
You will create and train a PPO agent in this example. The agent uses:
A value function critic to estimate the value of the policy. The critic takes the current observation as input and returns a single scalar as output (the estimated discounted cumulative long-term reward for following the policy from the state corresponding to the current observation).
A stochastic actor for computing actions. This actor takes an observation as input and returns as output a random action sampled (among the finite number of possible actions) from a categorical probability distribution.
The actor and critic functions are approximated using neural network representations. Specify a hidden layer size of 400 for the networks.
hlsz = 400;
Create an agent initialization object to initialize the networks with the specified layer size.
initOpts = rlAgentInitializationOptions(NumHiddenUnit=hlsz);
Specify hyperparameters for training using the rlPPOAgentOptions
object:
Specify an experience horizon of
900
steps. A large experience horizon can improve the stability of the training.Train with
3
epochs and mini-batches of length300
. Smaller mini-batches are computationally efficient but may introduce variance in training. Contrarily, larger batch sizes can make the training stable but require higher memory.Specify a learning rate of
5e-4
for the actor and9e-3
for the critic. A large learning rate causes drastic updates which may lead to divergent behaviors, while a low value may require many updates before reaching the optimal point.Specify an objective function clip factor of
0.04
for improving training stability.Specify a discount factor value of
0.998
to promote long term rewards.Specify an entropy loss weight factor of
0.02
to enhance exploration during training.
agentOpts = rlPPOAgentOptions(... ExperienceHorizon = 900,... ClipFactor = 0.04,... EntropyLossWeight = 0.02,... ActorOptimizerOptions = rlOptimizerOptions(LearnRate=5e-4),... CriticOptimizerOptions = rlOptimizerOptions(LearnRate=9e-3),... MiniBatchSize = 300,... NumEpoch = 3,... SampleTime = env.Ts,... DiscountFactor = 0.998);
Create the PPO agent object. When you create the agent, the initial parameters of the actor and critic networks are initialized with random values. Fix the random number stream so that the agent is always initialized with the same parameter values.
rng(0,"twister");
agent = rlPPOAgent(obsInfo,actInfo,initOpts,agentOpts);
Train Agent
To train the PPO agent, specify the following training options.
Run the training for at most
10000
episodes, with each episode lasting at most600
time steps.Specify an averaging window length of
100
for the episode reward.Stop the training when the average reward reaches
450
for100
consecutive episodes.
% training options trainOpts = rlTrainingOptions(... MaxEpisodes=10000,... MaxStepsPerEpisode=600,... StopTrainingCriteria="AverageReward",... StopTrainingValue=450,... ScoreAveragingWindowLength=100);
Fix the random stream for reproducibility.
rng(0,"twister");
Train the agent using the train
function. Due to the complexity of the environment, training process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
.
doTraining = false; if doTraining trainingStats = train(agent, env, trainOpts); else load("landerVehicleAgent.mat"); end
An example of the training is shown below. The actual results may vary because of randomness in the training process.
Simulate Trained Agent
Plot the environment first to create a visualization for the lander vehicle.
plot(env)
Set up simulation options to perform 5 simulations. For more information see rlSimulationOptions
.
simOptions = rlSimulationOptions(MaxSteps=600); simOptions.NumSimulations = 5;
Fix the random stream for reproducibility.
rng(0,"twister");
Simulate the trained agent within the environment. For more information see sim
.
experience = sim(env, agent, simOptions);
Plot the time history of the states for all simulations using the helper function plotLanderVehicleTrajectory
provided in the example folder.
% Observations to plot obsToPlot = ["x", "y", "dx", "dy", "theta", "dtheta", "landing"]; % Create a figure f = figure(); f.Position(3:4) = [800,1000]; % Create a tiled layout for the plots t = tiledlayout(f, 4, 2, TileSpacing="compact"); % Plot the data for ct = 1:numel(obsToPlot) ax = nexttile(t); plotLanderVehicleTrajectory(ax, experience, env, obsToPlot(ct)); end
Restore the random number stream using the information stored in previousRngState
.
rng(previousRngState);
See Also
Functions
Objects
Related Examples
- Train Discrete Soft Actor Critic Agent for Lander Vehicle
- Train DDPG Agent to Control Sliding Robot
- Train PPO Agent for Automatic Parking Valet
- Train Multiple Agents to Perform Collaborative Task