主要内容

getValue

Obtain estimated value from a critic given environment observations and actions

Description

Value Function Critic

value = getValue(valueFcnAppx,obs) evaluates the value function critic valueFcnAppx and returns the value corresponding to the observation obs. In this case, valueFcnAppx is an rlValueFunction approximator object.

example

Q-Value Function Critics

value = getValue(vqValueFcnAppx,obs) evaluates the discrete-action-space Q-value function critic vqValueFcnAppx and returns the vector value, in which each element represents the estimated value given the state corresponding to the observation obs and the action corresponding to the element number of value. In this case, vqValueFcnAppx is an rlVectorQValueFunction approximator object.

example

value = getValue(qValueFcnAppx,obs,act) evaluates the Q-value function critic qValueFcnAppx and returns the scalar value, representing the value given the observation obs and action act. In this case, qValueFcnAppx is an rlQValueFunction approximator object.

example

Return Recurrent Neural Network State

[value,state] = getValue(___) also returns the updated state of the critic object when it contains a recurrent neural network.

Use Forward

___ = getValue(___,UseForward=useForward) allows you to explicitly call a forward pass when computing gradients.

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

To approximate the value function within the critic, create a neural network. Define a single path from the network input (the observation) to its output (the value), as an array of layer objects.

net = [ featureInputLayer(4) ...
        fullyConnectedLayer(1)];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net);
   Initialized: true

   Number of learnables: 5

   Inputs:
      1   'input'   4 features

Create a critic using the network and the observation specification object. When you use this syntax the network input layer is automatically associated with the environment observation according to the dimension specifications in obsInfo.

critic = rlValueFunction(net,obsInfo);

Obtain a value function estimate for a random single observation. Use an observation array with the same dimensions as the observation specification.

val = getValue(critic,{rand(4,1)})
val = single

0.7904

You can also obtain value function estimates for a batch of observations. For example obtain value functions for a batch of 20 observations.

batchVal = getValue(critic,{rand(4,1,20)});
size(batchVal)
ans = 1×2

     1    20

valBatch contains one value function estimate for each observation in the batch.

Create observation and action specification objects (or alternatively use getObservationInfo and getActionInfo to extract the specification objects from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles, and the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec([7 5 3]);

Create a vector Q-value function approximator to use as a critic. A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy afterwards.

To model the parametrized vector Q-value function within the critic. The network must have one input layer that accepts a four-element vector, as defined by obsInfo. The output must be a single output layer having as many elements as the number of possible discrete actions (three in this case, as defined by actInfo).

Define a single path from the network input to its output as array of layer objects.

net = [
    featureInputLayer(4) 
    fullyConnectedLayer(3)
    ];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net)
   Initialized: true

   Number of learnables: 15

   Inputs:
      1   'input'   4 features

Create the critic using the network, as well as the names of the observation and action specification objects. The network input layers are automatically associated with the observation channels, according to the dimension specifications in obsInfo.

critic = rlVectorQValueFunction(net,obsInfo,actInfo);

Use getValue to return the values of a random observation, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3×1 single column vector

    0.7232
    0.8177
   -0.2212

v contains three value function estimates, one for each possible discrete action.

You can also obtain value function estimates for a batch of observations. For example, obtain value function estimates for a batch of 10 observations.

batchV = getValue(critic,{rand([obsInfo.Dimension 10])});
size(batchV)
ans = 1×2

     3    10

batchV contains three value function estimates for each observation in the batch.

Create observation and action specification objects (or alternatively use getObservationInfo and getObservationInfo to extract the specification object from an environment). For this example, define the observation space as having two continuous channels, the first one carrying an 8 by 3 matrix, and the second one a continuous four-dimensional vector. The action specification is a continuous column vector containing 2 doubles.

obsInfo = [rlNumericSpec([8 3]), rlNumericSpec([4 1])];
actInfo = rlNumericSpec([2 1]);

Create a custom basis function and its initial weight matrix. Note that both channels carry 2-D matrices but the respective myBasisFcn input has also the batch and sequence dimensions.

myBasisFcn = @(obsA,obsB,act) [...
    ones(30,1,size(obsA,3),like=obsA);
    reshape(obsA,24,1,[]); 
    reshape(obsB,4,1,[]); 
    reshape(act,2,1,[]);
    reshape(obsA,24,1,[]).^2; 
    reshape(obsB,4,1,[]).^2; 
    reshape(act,2,1,[]).^2;
    sin(reshape(obsA,24,1,[])); 
    sin(reshape(obsB,4,1,[])); 
    sin(reshape(act,2,1,[]));
    cos(reshape(obsA,24,1,[])); 
    cos(reshape(obsB,4,1,[])); 
    cos(reshape(act,2,1,[]))];
W0 = rand(150,1);

The output of the critic is the scalar W'*myBasisFcn(obs,act), representing the Q-value function to be approximated.

Create the critic.

critic = rlQValueFunction({myBasisFcn,W0}, ...
    obsInfo,actInfo);

Use getValue to return the value of a random observation-action pair, using the current parameter matrix.

v = getValue(critic,{rand(8,3),(1:4)'},{rand(2,1)})
v = 
72.7248

Create a random observation set of batch size 64 for each channel. The third dimension is the batch size, while the fourth is the sequence length for any recurrent neural network used by the critic (in this case not used).

batchobs_ch1 = rand(8,3,64,1);
batchobs_ch2 = rand(4,1,64,1);

Create a random action set of batch size 64.

batchact = rand(2,1,64,1);

Obtain the state-action value function estimate for the batch of observations and actions.

bv = getValue(critic,{batchobs_ch1,batchobs_ch2},{batchact});
size(bv)
ans = 1×2

     1    64

bv(23)
ans = 
44.8497

Input Arguments

collapse all

Value function critic, specified as an rlValueFunction approximator object.

Vector Q-value function critic, specified as an rlVectorQValueFunction approximator object.

Q-value function critic, specified as an rlQValueFunction object.

Observations, specified as a cell array with as many elements as there are observation input channels. Each element of obs contains an array of observations for a single observation input channel.

The dimensions of each element in obs are MO-by-LB-by-LS, where:

  • MO corresponds to the dimensions of the associated observation input channel.

  • LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations, specify LB > 1. If the critic object given as first input argument has multiple observation input channels, then LB must be the same for all elements of obs.

  • LS specifies the sequence length for a recurrent neural network. If the critic object given as first input argument does not use a recurrent neural network, then LS = 1. If the critic object has multiple observation input channels, then LS must be the same for all elements of obs.

LB and LS must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms section of lstmLayer.

Action, specified as a single-element cell array that contains an array of action values.

The dimensions of this array are MA-by-LB-by-LS, where:

  • MA corresponds to the dimensions of the associated action specification.

  • LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations, specify LB > 1.

  • LS specifies the sequence length for a recurrent neural network. If the critic object given as a first input argument does not use a recurrent neural network, then LS = 1.

LB and LS must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms section of lstmLayer.

Option to use forward pass, specified as a logical value. When you specify UseForward=true the function calculates its outputs using forward instead of predict. This allows layers such as batch normalization and dropout to appropriately change their behavior for training.

Example: true

Output Arguments

collapse all

Estimated value function, returned as array with dimensions N-by-LB-by-LS, where:

  • N is the number of outputs of the critic network.

    • For a state value critic (valueFcnAppx), N = 1.

    • For a single-output state-action value function critic (qValueFcnAppx), N = 1.

    • For a multi-output state-action value function critic (vqValueFcnAppx), N is the number of discrete actions.

  • LB is the batch size.

  • LS is the sequence length for a recurrent neural network.

Updated state of the critic, returned as a cell array. If the critic does not use a recurrent neural network, then state is an empty cell array.

You can set the state of the critic to state using dot notation. For example:

valueFcnAppx.State=state;

Tips

Version History

Introduced in R2020a