evaluate
Evaluate function approximator object given observation (or observation-action) input data
Since R2022a
Syntax
Description
___ = evaluate(___,UseForward=
allows you to explicitly call a forward pass when computing gradients.useForward
)
Examples
Evaluate Function Approximator Object
This example shows you how to evaluate a function approximator object (that is, an actor or a critic). For this example, the function approximator object is a discrete categorical actor and you evaluate it given some observation data, obtaining in return the action probability distribution and the updated network state.
Load the same environment used in Train PG Agent to Balance Discrete Cart-Pole System, and obtain the observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "CartPole States" Description: "x, dx, theta, dtheta" Dimension: [4 1] DataType: "double"
actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: [-10 10] Name: "CartPole Action" Description: [0x0 string] Dimension: [1 1] DataType: "double"
To approximate the policy within the actor, use a recurrent deep neural network. Define the network as an array of layer objects. Get the dimensions of the observation space and the number of possible actions directly from the environment specification objects.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(8)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(numel(actInfo.Elements))
];
Convert the network to a dlnetwork
object and display the number of weights.
net = dlnetwork(net); summary(net)
Initialized: true Number of learnables: 602 Inputs: 1 'sequenceinput' Sequence input with 4 dimensions
Create a stochastic actor representation for the network.
actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);
Use evaluate
to return the probability of each of the two possible actions. Note that the type of the returned numbers is single
, not double
.
[prob,state] = evaluate(actor,{rand(obsInfo.Dimension)}); prob{1}
ans = 2x1 single column vector
0.4847
0.5153
Since a recurrent neural network is used for the actor, the second output argument, representing the updated state of the neural network, is not empty. In this case, it contains the updated (cell and hidden) states for the eight units of the lstm
layer used in the network.
state{:}
ans = 8x1 single column vector
-0.0833
0.0619
-0.0066
-0.0651
0.0714
-0.0957
0.0614
-0.0326
ans = 8x1 single column vector
-0.1367
0.1142
-0.0158
-0.1820
0.1305
-0.1779
0.0947
-0.0833
You can use dot notation to extract and set the current state of the recurrent neural network in the actor.
actor.State
ans=2×1 cell array
{8x1 dlarray}
{8x1 dlarray}
actor.State = { dlarray(-0.1*rand(8,1)) dlarray(0.1*rand(8,1)) };
You can obtain action probabilities and updated states for a batch of observations. For example, use a batch of five independent observations.
obsBatch = reshape(1:20,4,1,5,1); [prob,state] = evaluate(actor,{obsBatch})
prob = 1x1 cell array
{2x5 single}
state=2×1 cell array
{8x5 single}
{8x5 single}
The output arguments contain action probabilities and updated states for each observation in the batch.
Note that the actor treats observation data along the batch length dimension independently, not sequentially.
prob{1}
ans = 2x5 single matrix
0.5303 0.5911 0.6083 0.6158 0.6190
0.4697 0.4089 0.3917 0.3842 0.3810
prob = evaluate(actor,{obsBatch(:,:,[5 4 3 1 2])}); prob{1}
ans = 2x5 single matrix
0.6190 0.6158 0.6083 0.5303 0.5911
0.3810 0.3842 0.3917 0.4697 0.4089
To evaluate the actor using sequential observations, use the sequence length (time) dimension. For example, obtain action probabilities for five independent sequences, each one made of nine sequential observations.
[prob,state] = evaluate(actor, ...
{rand([obsInfo.Dimension 5 9])})
prob = 1x1 cell array
{2x5x9 single}
state=2×1 cell array
{8x5 single}
{8x5 single}
The first output argument contains a vector of two probabilities (first dimension) for each element of the observation batch (second dimension) and for each time element of the sequence length (third dimension).
The second output argument contains two vectors of final states for each observation batch (that is, the network maintains a separate state history for each observation batch).
Display the probability of the second action, after the seventh sequential observation in the fourth independent batch.
prob{1}(2,4,7)
ans = single
0.5653
For more information on input and output format for recurrent neural networks, see the Algorithms section of lstmLayer
.
Input Arguments
fcnAppx
— Function approximator object
function approximator object
Function approximator object, specified as one of the following:
rlValueFunction
object — Value function criticrlQValueFunction
object — Q-value function criticrlVectorQValueFunction
object — Multi-output Q-value function critic with a discrete action spacerlContinuousDeterministicActor
object — Deterministic policy actor with a continuous action spacerlDiscreteCategoricalActor
— Stochastic policy actor with a discrete action spacerlContinuousGaussianActor
object — Stochastic policy actor with a continuous action spacerlContinuousDeterministicTransitionFunction
object — Continuous deterministic transition function for a model based agentrlContinuousGaussianTransitionFunction
object — Continuous Gaussian transition function for a model based agentrlContinuousDeterministicRewardFunction
object — Continuous deterministic reward function for a model based agentrlContinuousGaussianRewardFunction
object — Continuous Gaussian reward function for a model based agentrlIsDoneFunction
object — Is-done function for a model based agent
inData
— Input data for function approximator
cell array
Input data for the function approximator, specified as a cell array with as many
elements as the number of input channels of fcnAppx
. In the
following section, the number of observation channels is indicated by
NO.
If
fcnAppx
is anrlQValueFunction
, anrlContinuousDeterministicTransitionFunction
or anrlContinuousGaussianTransitionFunction
object, then each of the first NO elements ofinData
must be a matrix representing the current observation from the corresponding observation channel. They must be followed by a final matrix representing the action.If
fcnAppx
is a function approximator object representing an actor or critic (but not anrlQValueFunction
object),inData
must contain NO elements, each one a matrix representing the current observation from the corresponding observation channel.If
fcnAppx
is anrlContinuousDeterministicRewardFunction
, anrlContinuousGaussianRewardFunction
, or anrlIsDoneFunction
object, then each of the first NO elements ofinData
must be a matrix representing the current observation from the corresponding observation channel. They must be followed by a matrix representing the action, and finally by NO elements, each one being a matrix representing the next observation from the corresponding observation channel.
Each element of inData
must be a matrix of dimension
MC-by-LB-by-LS,
where:
MC corresponds to the dimensions of the associated input channel.
LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of (independent) inputs, specify LB > 1. If
inData
has multiple elements, then LB must be the same for all elements ofinData
.LS specifies the sequence length (length of the sequence of inputs along the time dimension) for recurrent neural network. If
fcnAppx
does not use a recurrent neural network (which is the case for environment function approximators, as they do not support recurrent neural networks), then LS = 1. IfinData
has multiple elements, then LS must be the same for all elements ofinData
.
For more information on input and output formats for recurrent neural networks, see
the Algorithms section of lstmLayer
.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}
useForward
— Option to use parallel training
false
(default) | true
Output Arguments
outData
— Output data from evaluation of function approximator object
cell array
Output data from the evaluation of the function approximator object, returned as a
cell array. The size and contents of outData
depend on the type of
object you use for fcnAppx
, and are shown in the following list.
Here, NO is the number of observation
channels.
rlContinuousDeterministicTransitionFunction
- NO matrices, each one representing the predicted observation from the corresponding observation channel.rlContinuousGaussianTransitionFunction
- NO matrices representing the mean value of the predicted observation for the corresponding observation channel, followed by NO matrices representing the standard deviation of the predicted observation for the corresponding observation channel.rlContinuousGaussianActor
- Two matrices representing the mean value and standard deviation of the action, respectively.rlDiscreteCategoricalActor
- A matrix with the probabilities of each action.rlContinuousDeterministicActor
A matrix with the action.rlVectorQValueFunction
- A matrix with the values of each possible action.rlQValueFunction
- A matrix with the value of the action.rlValueFunction
- A matrix with the value of the current observation.rlContinuousDeterministicRewardFunction
- A matrix with the predicted reward as a function of current observation, action, and next observation following the action.rlContinuousGaussianRewardFunction
- Two matrices representing the mean value and standard deviation, respectively, of the predicted reward as a function of current observation, action, and next observation following the action.rlIsDoneFunction
- A vector with the probabilities of the predicted termination status. Termination probabilities range from0
(no termination predicted) or1
(termination predicted), and depend (in the most general case) on the values of observation, action, and next observation following the action.
Each element of outData
is a matrix of dimensions
D-by-LB-by-LS,
where:
D is the vector of dimensions of the corresponding output channel of
fcnAppx
. Depending on the type of approximator function, this channel can carry a predicted observation (or its mean value or standard deviation), an action (or its mean value or standard deviation), the value (or values) of an observation (or observation-action couple), a predicted reward, or a predicted termination status.LB is the batch size (length of a batch of independent inputs).
LS is the sequence length (length of the sequence of inputs along the time dimension) for a recurrent neural network. If
fcnAppx
does not use a recurrent neural network (which is the case for environment function approximators, as they do not support recurrent neural networks), then LS = 1.
Note
If fcnAppx
is a critic, then evaluate
behaves identically to getValue
except that it returns results inside a single-cell array. If
fcnAppx
is an rlContinuousDeterministicActor
actor, then evaluate
behaves identically to getAction
. If
fcnAppx
is a stochastic actor such as an rlDiscreteCategoricalActor
or rlContinuousGaussianActor
, then evaluate
returns the
action probability distribution, while getAction
returns a sample action. Specifically, for an rlDiscreteCategoricalActor
actor object, evaluate
returns the probability of each possible action. For an rlContinuousGaussianActor
actor object, evaluate
returns the mean and standard deviation of the Gaussian distribution. For these kinds
of actors, see also the note in getAction
regarding the enforcement of constraints set by the action specification.
Note
If fcnAppx
is an rlContinuousDeterministicRewardFunction
object, then
evaluate
behaves identically to predict
except that it returns results inside a single-cell array. If
fcnAppx
is an rlContinuousDeterministicTransitionFunction
object, then
evaluate
behaves identically to predict
. If
fcnAppx
is an rlContinuousGaussianTransitionFunction
object, then
evaluate
returns the mean value and standard deviation the
observation probability distribution, while predict
returns an observation sampled from this distribution. Similarly, for an rlContinuousGaussianRewardFunction
object, evaluate
returns the mean value and standard deviation the reward probability distribution,
while predict
returns a reward sampled from this distribution. Finally, if
fcnAppx
is an rlIsDoneFunction
object, then evaluate
returns the
probabilities of the termination status being false or true, respectively, while
predict
returns a predicted termination status sampled with
these probabilities.
nextState
— Updated state of function approximator object
cell array
Next state of the function approximator object, returned as a cell array. If
fcnAppx
does not use a recurrent neural network (which is the
case for environment function approximators), then nextState
is an
empty cell array.
You can set the state of the approximator to state
using dot
notation. For example:
critic.State = state;
Tips
When the elements of the cell array in inData
are
dlarray
objects, the elements of the cell array returned in
outData
are also dlarray
objects. This allows
evaluate
to be used with automatic differentiation.
Specifically, you can write a custom loss function that directly uses
evaluate
and dlgradient
within
it, and then use dlfeval
and
dlaccelerate
with
your custom loss function. For an example, see Train Reinforcement Learning Policy Using Custom Training Loop and Custom Training Loop with Simulink Action Noise.
Version History
Introduced in R2022a
See Also
Functions
runEpisode
|update
|rlOptimizer
|syncParameters
|getValue
|getAction
|getMaxQValue
|getLearnableParameters
|setLearnableParameters
|predict
Objects
rlValueFunction
|rlQValueFunction
|rlVectorQValueFunction
|rlContinuousDeterministicActor
|rlDiscreteCategoricalActor
|rlContinuousGaussianActor
|rlContinuousDeterministicTransitionFunction
|rlContinuousGaussianTransitionFunction
|rlContinuousDeterministicRewardFunction
|rlContinuousGaussianRewardFunction
|rlIsDoneFunction
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)