Main Content

Train PPO Agent for a Lander Vehicle

This example shows how to train a proximal policy optimization (PPO) agent with a discrete action space to land an airborne vehicle on the ground. For more information on PPO agents, see Proximal Policy Optimization (PPO) Agent.

Fix Random Seed Generator to Improve Reproducibility

The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the actor and critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.Fix the random number stream with the seed 0 and random number algorithm Mersenne Twister. For more information on random number generation see rng.

previousRngState = rng(0,"twister")
previousRngState = struct with fields:
     Type: 'twister'
     Seed: 0
    State: [625x1 uint32]

The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.

Create Environment Object

The environment in this example is a lander vehicle represented by a 3-DOF circular disc with mass. The vehicle has two thrusters for forward and rotational motion. Gravity acts vertically downwards, and there are no aerodynamic drag forces. The training goal is to make the vehicle land on the ground at a specified location.

For this environment:

  • Motion of the lander vehicle is bounded in X (horizontal axis) from -100 to 100 meters and Y (vertical axis) from 0 to 120 meters.

  • The goal position is at (0,0) meters and the goal orientation is 0 radians.

  • The maximum thrust applied by each thruster is 8.5 N.

  • The sample time is 0.1 seconds.

  • The observations from the environment are the vehicle's position (x,y), orientation (θ), velocity (x˙,y˙), angular velocity (θ˙), and a sensor reading that detects rough landing (-1), soft landing (1) or airborne (0) condition. The observations are normalized between -1 and 1.

  • The environment has a discrete action space. At every time step, the agent selects one of the following nine discrete action pairs:

L,L-donothingL,M-fireright(med)L,H-fireright(high)M,L-fireleft(med)M,M-fireleft(med)+right(med)M,H-fireleft(med)+right(high)H,L-fireleft(high)H,M-fireleft(high)+right(med)H,H-fireleft(high)+right(high)

Here, L=0.0,M=0.5 and H=1.0 are normalized thrust values for each thruster. The environment step function scales these values to determine the actual thrust values.

  • At the beginning of every episode, the vehicle starts from a random initial x position and orientation. The altitude is always reset to 100 meters.

  • The reward rt provided at the time step t is as follows.

rt=(st-st-1)-0.1θt2-0.01(Lt2+Rt2)+500cst=1-(dˆt+vˆt2)c=(yt0)&&(y˙t-0.5&&|x˙t|0.5)

Here:

  • xt,yt,x˙t, and y˙t are the positions and velocities of the lander vehicle along the x and y axes.

  • dˆt=xt2+yt2/dmax is the normalized distance of the lander vehicle from the goal position.

  • vˆt=xt˙2+yt˙2/vmax is the normalized speed of the lander vehicle.

  • dmax and vmax are the maximum distances and speeds.

  • θt is the orientation with respect to the vertical axis.

  • Lt and Rt are the action values for the left and right thrusters.

  • c is a sparse reward for soft-landing with horizontal and vertical velocities less than 0.5 m/s.

Create a MATLAB® environment using the LanderVehicle class provided in the example folder.

env = LanderVehicle()
env = 
  LanderVehicle with properties:

                Mass: 1
                  L1: 10
                  L2: 5
             Gravity: 9.8060
        ThrustLimits: [0 8.5000]
                  Ts: 0.1000
               State: [6x1 double]
          LastAction: [2x1 double]
         LastShaping: 0
    DistanceIntegral: 0
    VelocityIntegral: 0
           TimeCount: 0

To view the implementation of the LanderVehicle class, open the class file.

open("LanderVehicle.m")

Obtain the observation and action specifications from the environment.

obsInfo = getObservationInfo(env)
obsInfo = 
  rlNumericSpec with properties:

     LowerLimit: -Inf
     UpperLimit: Inf
           Name: "states"
    Description: [0x0 string]
      Dimension: [7 1]
       DataType: "double"

actInfo = getActionInfo(env)
actInfo = 
  rlFiniteSetSpec with properties:

       Elements: {9x1 cell}
           Name: "thrusts"
    Description: [0x0 string]
      Dimension: [2 1]
       DataType: "double"

Create PPO Agent Object

You will create and train a PPO agent in this example. The agent uses:

  • A value function critic to estimate the value of the policy. The critic takes the current observation as input and returns a single scalar as output (the estimated discounted cumulative long-term reward for following the policy from the state corresponding to the current observation).

  • A stochastic actor for computing actions. This actor takes an observation as input and returns as output a random action sampled (among the finite number of possible actions) from a categorical probability distribution.

The actor and critic functions are approximated using neural network representations. Specify a hidden layer size of 400 for the networks.

hlsz = 400;

Create an agent initialization object to initialize the networks with the specified layer size.

initOpts = rlAgentInitializationOptions(NumHiddenUnit=hlsz);

Specify hyperparameters for training using the rlPPOAgentOptions object:

  • Specify an experience horizon of 900 steps. A large experience horizon can improve the stability of the training.

  • Train with 3 epochs and mini-batches of length 300. Smaller mini-batches are computationally efficient but may introduce variance in training. Contrarily, larger batch sizes can make the training stable but require higher memory.

  • Specify a learning rate of 5e-4 for the actor and 9e-3 for the critic. A large learning rate causes drastic updates which may lead to divergent behaviors, while a low value may require many updates before reaching the optimal point.

  • Specify an objective function clip factor of 0.04 for improving training stability.

  • Specify a discount factor value of 0.998 to promote long term rewards.

  • Specify an entropy loss weight factor of 0.02 to enhance exploration during training.

agentOpts = rlPPOAgentOptions(...
    ExperienceHorizon       = 900,...
    ClipFactor              = 0.04,...
    EntropyLossWeight       = 0.02,...
    ActorOptimizerOptions   = rlOptimizerOptions(LearnRate=5e-4),...
    CriticOptimizerOptions  = rlOptimizerOptions(LearnRate=9e-3),...
    MiniBatchSize           = 300,...
    NumEpoch                = 3,...
    SampleTime              = env.Ts,...
    DiscountFactor          = 0.998);

Create the PPO agent object. When you create the agent, the initial parameters of the actor and critic networks are initialized with random values. Fix the random number stream so that the agent is always initialized with the same parameter values.

rng(0,"twister");
agent = rlPPOAgent(obsInfo,actInfo,initOpts,agentOpts);

Train Agent

To train the PPO agent, specify the following training options.

  • Run the training for at most 10000 episodes, with each episode lasting at most 600 time steps.

  • Specify an averaging window length of 100 for the episode reward.

  • Stop the training when the average reward reaches 450 for 100 consecutive episodes.

% training options
trainOpts = rlTrainingOptions(...
    MaxEpisodes=10000,...
    MaxStepsPerEpisode=600,...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=450,...
    ScoreAveragingWindowLength=100);

Fix the random stream for reproducibility.

rng(0,"twister");

Train the agent using the train function. Due to the complexity of the environment, training process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false.

doTraining = false;
if doTraining
    trainingStats = train(agent, env, trainOpts);
else
    load("landerVehicleAgent.mat");
end

An example of the training is shown below. The actual results may vary because of randomness in the training process.

Simulate Trained Agent

Plot the environment first to create a visualization for the lander vehicle.

plot(env)

Set up simulation options to perform 5 simulations. For more information see rlSimulationOptions.

simOptions = rlSimulationOptions(MaxSteps=600);
simOptions.NumSimulations = 5;

Fix the random stream for reproducibility.

rng(0,"twister");

Simulate the trained agent within the environment. For more information see sim.

experience = sim(env, agent, simOptions);

Figure Lander Vehicle contains an axes object. The axes object contains 7 objects of type rectangle, line, patch, text.

Plot the time history of the states for all simulations using the helper function plotLanderVehicleTrajectory provided in the example folder.

% Observations to plot
obsToPlot = ["x", "y", "dx", "dy", "theta", "dtheta", "landing"];

% Create a figure
f = figure();
f.Position(3:4) = [800,1000];

% Create a tiled layout for the plots
t = tiledlayout(f, 4, 2, TileSpacing="compact");

% Plot the data
for ct = 1:numel(obsToPlot)
    ax = nexttile(t);
    plotLanderVehicleTrajectory(ax, experience, env, obsToPlot(ct));
end

Figure contains 7 axes objects. Axes object 1 with title x Position (m), xlabel Time (s), ylabel x contains 5 objects of type line. Axes object 2 with title y Position (m), xlabel Time (s), ylabel y contains 5 objects of type line. Axes object 3 with title x Velocity (m/s), xlabel Time (s), ylabel dx contains 5 objects of type line. Axes object 4 with title y Velocity (m/s), xlabel Time (s), ylabel dy contains 5 objects of type line. Axes object 5 with title Angle (rad), xlabel Time (s), ylabel theta contains 5 objects of type line. Axes object 6 with title Angular Velocity (rad/s), xlabel Time (s), ylabel dtheta contains 5 objects of type line. Axes object 7 with title Landing Flag Airborne (0), Soft Landing (1) Rough Landing (-1), xlabel Time (s), ylabel Value contains 5 objects of type stair.

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState);

See Also

Functions

Objects

Related Examples

More About