Deep Reinforcement Learning for Optimal Trade Execution

This example uses:

This example shows how to use the Reinforcement Learning Toolbox™ and Deep Learning Toolbox™ to design agents for optimal trade execution.

Introduction

Optimal trade execution seeks to minimize trading costs when selling or buying a set number of stock shares over a certain time horizon within the trading day. Optimization is necessary because executing trades too quickly results in suboptimal prices due to market impact, while executing trades too slowly results in exposure to the risk of adverse price changes or the inability to execute all trades within the time horizon. This optimization affects nearly all trading strategies and portfolio management practices, since minimizing trading costs is directly linked to profitability that is related to buying or selling decisions. Also, brokers compete to provide better trade order execution, and many jurisdictions legally require brokers to provide the best execution for their clients. Early studies by Bertsimas and Lo [1] and Almgren and Chriss [2] derived analytical solutions for the optimal trade execution problem by assuming a model for the underlying prices. Later, Nevmyvaka et al. [3] demonstrated that the use of reinforcement learning (RL) for optimal execution did not require making assumptions about the market micro-structure. More recent studies use deep reinforcement learning, which combines reinforcement learning (RL) with deep learning, for optimal trade execution (Ning et al. [4], Lin and Beling [5,6]). This approach overcomes the limitations of Q-learning as noted by Nevmyvaka et al. [3].

This example uses deep reinforcement learning to design and train two deep Q-network (DQN) agents for optimal trade execution. One DQN agent is for sell trades and the other agent is for buy trades.

Data

This example uses the intraday trading data on Intel Corporation stock (INTC) provided by LOBSTER [7] and included with Financial Toolbox™. While the reinforcement learning for optimal execution literature typically uses trading data over one year or more to train and test RL agents [3,4,5,6], in this proof-of-concept example you use a shorter data set, which you can expand for further study. The intraday trading data on the Intel stock (INTC_2012-06-21_34200000_57600000_orderbook_5.csv) contains limit order book data over 6.5 hours, including bid and ask prices as well as corresponding shares for 5 levels. This data is sampled at 5 second intervals to create 390 one minute trading horizons, such that each trading horizon contains 12 steps at five second intervals. Out of these 390 horizons, you use 293 horizons (about 75 percent) to train the agents and the remaining 97 horizons (about 25 percent) to test the agents.

Baseline Trade Execution Algorithm

An agent's performance must be compared with a baseline trade execution algorithm. For this example, the baseline trade execution algorithm is the time-weighted average price (TWAP) policy. Under this policy, the total trading shares are divided equally over the trading horizon, so that the same number of shares are traded at each time step. In contrast, the original paper by Nevmyvaka et al. [3] used a "submit and leave" policy as the baseline, where a limit order is placed at a fixed price for the entire trading shares for the entire horizon, and any remaining unfilled shares are traded with a market order at the last step. However, the TWAP baseline was adopted in later studies [4,5,6], since it has become more popular in recent years with increasing use of algorithmic trading, and it is also a highly effective strategy that is thought to be optimal when the price movement follows a Brownian motion [4,5,6].

Agents

Different types of agents are used in the reinforcement learning for optimal execution studies, including Q-learning [3], DQN [4,5], and PPO [6] agents. In this example, you design and train DQN agents using the Reinforcement Learning Designer (Reinforcement Learning Toolbox) after creating separate training environments for sell trades and buy trades. These DQN agents only have a critic without an actor, and they use the default setting of UseDoubleDQN set to true, while the DiscountFactor is set to 1 due to the short trading horizon [4,5,6]. In addition, the critic network of these agents include a long short-term memory (LSTM) network because the trading data is sequential in time, and LSTM networks are effective in designing agents for optimal execution [6]. For more information on setting options for DQN agents, see rlDQNAgentOptions (Reinforcement Learning Toolbox).

Observation Space

The observation space defines which information (or feature) to provide to the agents at each step. The agents use this observation to decide the next action, based on the policy learned and reinforced by the rewards received after taking the previous actions. Designing a good observation space for the agents is an area of active research in optimal execution studies. You can experiment with different observation spaces, using variables such as the bid-ask spread, the bid or ask volume, price trends, or even the entire limit order book prices and shares [6]. This example uses the following observation variables:

Remaining shares to be traded by the agent in the current trading horizon
Remaining time intervals (steps) left in the current trading horizon
Cumulative implementation shortfall (IS) of agent until the previous step
Current limit order book price divergence from arrival price

The first two observation variables (remaining shares to be traded and remaining time in horizon) are called "private" variables by Nevmyvaka et al. [3], as they only apply to the specific situation of the trading agents, and they do not apply to other market participants. As the agent trained by Nevmyvaka et al. [3] using only "private" variables was able to outperform the "submit and leave" baseline policy, the "private" variables are also used in the subsequent studies [4,5,6].

The third variable, cumulative IS, is a measure of trading cost that the agents seek to minimize. As the name "shortfall" indicates, a lower implementation shortfall implies better trade execution. Technically, Implementation Shortfall should include the opportunity cost of failing to execute all shares within the trading horizon. In this example, as with many of the previously mentioned studies, the Implementation Shortfall is computed under the ending inventory constraint:

For a sell trade, the constraint is to have zero ending inventory, assuming successful liquidation of all shares.
For a buy trade, the constraint is to have full ending inventory, assuming successful purchase of all shares.

Under this ending inventory constraint, Implementation Shortfall is computed as follows:

For a sell trade: $Implementation Shortfall = Arrival Price \times Traded Volume - Executed Price \times Traded Volume$
For a buy trade: $Implementation Shortfall = Executed Price \times Traded Volume - Arrival Price \times Traded Volume$

The arrival price is the first bid or ask price observed at the beginning of the trade horizon. If Implementation Shortfall is positive, it implies that the average executed price was worse than the arrival price, while a negative Implementation Shortfall implies that the average executed price was better than the arrival price. The cumulative Implementation Shortfall from the beginning of the trade horizon until the previous step reflects the trading cost incurred by the agent in the past, and it serves as the third observation variable.

The last observation variable is the current price divergence from arrival price. This variable reflects the current market condition and it is measured as the difference between the arrival price and the average price of the first two levels of the current limit order book (LOB).

For a sell trade: $Price Divergence = Average Bid Prices of First 2 Levels of Current LOB - Arrival Price$
For a buy trade: $Price Divergence = Arrival Price - Average Ask Prices of First 2 Levels of Current LOB$

A positive price divergence implies more favorable current trading conditions than the arrival price, and a negative divergence less favorable conditions.

Action Space

The action space is defined as the possible number of shares traded at each step. For a TWAP policy, the only possible action at each step is to trade a constant number of shares computed as the total trading shares divided by the number of steps in the horizon. Meanwhile, the agents in this example choose from 39 possible actions that are mostly evenly spaced, ranging from trading 0 shares to trading twice the number of shares as the corresponding TWAP trade.

For both TWAP trades and agent trades, you enforce ending inventory constraints so that if the target ending inventory is not met by the last step of the horizon, the action for the last step is to:

Sell any remaining shares in the inventory for a Sell trade.
Buy any missing shares in the inventory for a Buy trade.

Also, you place limits on the agent actions to take advantage of price divergence and to prevent selling more than the available shares in the inventory or buying more than the target number of shares.

Reward

The reward is another important aspect of agent design, as agents learn their policies at each step using the rewards received after taking the previous actions. Research on reward design shows wide variations in optimal execution, and you can experiment with different rewards. Since the agent performance is measured against the baseline using Implementation Shortfall, many studies use Implementation Shortfalls directly in the rewards, while others use related quantities. Some studies use the proceeds from the trade as the reward [3], while others also impose a penalty for trading many shares too quickly [4]. Another study uses an elaborately shaped reward algorithm that is designed to use the TWAP policy most of the time, while encouraging the agent to deviate from the TWAP policy only when it is advantageous to do so [5]. If there is enough data, the reward may be given sparsely, only at the end of each horizon rather than at each step [6]. In this example, the data size is small, so you give the reward at each step and design it so that the reward at each step approximates the expected reward at the end of the horizon.

Specifically, you use a reward that compares the Implementation Shortfall of the TWAP baseline against that of the agent at each step, while also taking into account the penalty for the approximate Implementation Shortfall of the remaining unexecuted shares based on the current limit order book. The reward is also scaled based on the current time relative to the end of the horizon.

$Reward (t) = [({IS}_{TWAP} (0, t) + {IS}_{TWAP, Penalty} (t, T)) - ({IS}_{Agent} (0, t) + {IS}_{Agent, Penalty} (t, T))] \times \frac{t}{T}$

$Reward (t)$ is the reward at time step $t$

${IS}_{TWAP} (0, t)$ is the Implementation Shortfall of TWAP from the beginning of the horizon until time step $t$

${IS}_{TWAP, Penalty} (t, T)$ is the approximate Implementation Shortfall of TWAP from time step $t$ to the end of the horizon at time step $T$

${IS}_{Agent} (0, t)$ is the Implementation Shortfall of the agent from the beginning of horizon until time step $t$

${IS}_{Agent, Penalty} (t, T)$ is the approximate Implementation Shortfall of the agent from time step $t$ to the end of the horizon at time step $T$

Since the agent performance is ultimately measured against the baseline using Implementation Shortfall at the end of the horizon at time step $T$ , and the approximate reward becomes more accurate as $t$ approaches $T$ , $Reward (t)$ has a scaling factor $\frac{t}{T}$ so that its magnitude increases as $t$ approaches $T$ . At time step $T$ , the reward becomes $Reward (T)$ .

$Reward (T) = {IS}_{TWAP} (0, T) - {IS}_{Agent} (0, T)$

Performance Analysis

You measure agent performance against the TWAP baseline using Implementation Shortfall (IS), by computing the following for all horizons:

IS total outperformance of agent over TWAP — Sum of ${IS}_{TWAP} (0, T) - {IS}_{Agent} (0, T)$
IS total-gain-to-loss ratio (TGLR) — Ratio of total positive over total negative ${IS}_{TWAP} (0, T) - {IS}_{Agent} (0, T)$ magnitudes
IS gain-to-loss ratio (GLR) — Ratio of mean positive over mean negative ${IS}_{TWAP} (0, T) - {IS}_{Agent} (0, T)$ magnitudes

Workflow Organization

The workflow in this example is:

Load and Prepare Data
- LOBSTER Message Data — Contains time stamps
- LOBSTER Limit Order Book Data — Contains bid and ask prices and corresponding shares
Optimal Execution for Sell Trades
- TWAP Baseline
- Train the DQN Agent
- Test the DQN Agent
- Summary of Selling Results
Optimal Execution for Buy Trades
Local Functions
- executeTWAPTrades — Executes TWAP Baseline trades and computes IS
- RL_OptimalExecution_LOB_ResetFcn — Reset function for training and testing environments
- RL_OptimalExecution_LOB_StepFcn — Step function for training and testing environments
- tradeLOB — Executes trades on current limit order book
- endingInventoryIS — Computes IS with ending inventory constraints
- ISTGLR — Computes total-gain-to-loss ratio and gain-to-loss ratio for IS

Load and Prepare Message Data

Extract the LOBSTER data files from the ZIP file included with Financial Toolbox™. The expanded files include two CSV files of data and the text file LOBSTER_SampleFiles_ReadMe.txt.

unzip("LOBSTER_SampleFile_INTC_2012-06-21_5.zip");

The limit order book file contains the bid and ask prices and corresponding shares for five levels, while the message file contains the description of the limit order book data including time stamps.

LOBFileName = 'INTC_2012-06-21_34200000_57600000_orderbook_5.csv'; % Limit Order Book Data file
MSGFileName = 'INTC_2012-06-21_34200000_57600000_message_5.csv'; % Message file (description of data)

Load the message file and define the time at the start and end of trading hours as seconds after midnight.

TradeMessage = readmatrix(MSGFileName);
StartTradingSec   = 9.5*60*60;  % Starting time 9:30 in seconds after midnight
EndTradingSec     = 16*60*60;   % Ending time 16:00 in seconds after midnight

Define the trading interval as five seconds and pick message index values corresponding to the five-second intervals.

TradingIntervalSec = 5; % Trading interval length in seconds
IntervalBoundSec = (StartTradingSec:TradingIntervalSec:EndTradingSec)';
MessageSampleIdx = zeros(size(IntervalBoundSec));
k1 = 1;
for k2 = 1:size(TradeMessage,1)
    if TradeMessage(k2,1) >= IntervalBoundSec(k1)
        MessageSampleIdx(k1) = k2;
        k1 = k1+1;
    end
end
MessageSampleIdx(end) = length(TradeMessage);

Load and Prepare Limit Order Book Data

Load the limit order book data and define the number of levels in the limit order book.

LOB = readmatrix(LOBFileName);
NumLevels = 5;

Since LOBSTER stores prices in units of 10,000 dollars, convert the prices in the limit order book data into dollars by dividing by 10000.

LOB(:,1:2:(4*NumLevels)) = LOB(:,1:2:(4*NumLevels))./10000;

Using the message index values obtained previously, sample the limit order book at 5 second intervals.

SampledBook = LOB(MessageSampleIdx(2:end),1:NumLevels*4);

Define a one minute (60 seconds) trading horizon and compute the total number of intervals over 6.5 hours of trading data.

TradingHorizon = 1*60;
TotalNumIntervals = length(MessageSampleIdx(2:end));
TotalNumHorizons = 6.5*60*60/TradingHorizon;

The number of intervals per horizon is computed by dividing the total number of intervals by the total number of horizons.

NumIntervalsPerHorizon = round(TotalNumIntervals/TotalNumHorizons);

Allocate 75 percent of data for training and 25 percent for testing.

NumTrainingHorizons = round(TotalNumHorizons*0.75);
NumTestingHorizons = TotalNumHorizons - NumTrainingHorizons;
NumTrainingSteps = NumTrainingHorizons*NumIntervalsPerHorizon;
NumTestingSteps = NumTestingHorizons*NumIntervalsPerHorizon;
TrainingLOB = SampledBook(1:NumTrainingSteps,:);
TestingLOB = SampledBook(NumTrainingSteps+1:end,:);

Optimal Execution for Sell Trades

These sections compose the workflow for designing a DQN agent for the optimal execution of sell trades:

Define Settings for Sell Trades

To indicate sell trades, set the BuyTrade logical flag to false.

BuyTrade = false;

Define the total number of shares to sell over the horizon.

TotalTradingShares_Sell = 2738;

TWAP Baseline Analysis on Training Data

Execute TWAP trades and compute Implementation Shortfall (IS) using the executeTWAPTrades function, which uses the tradeLOB function to execute the trades and the endingInventoryIS function to compute IS. These functions are in the Local Functions section.

[IS_TWAP_Horizon_Train_Sell,IS_TWAP_Step_Train_Sell, ...
    InventoryShares_Step_TWAP_Train_Sell,TradingShares_Step_TWAP_Train_Sell] = ...
    executeTWAPTrades(TrainingLOB,NumLevels,TotalTradingShares_Sell, ...
    NumIntervalsPerHorizon,NumTrainingHorizons,BuyTrade);

Plot the implementation shortfall, executed shares, and inventory shares for the TWAP baseline for each step using the training data.

figure
subplot(3,1,1)
bar(1:NumTrainingSteps, IS_TWAP_Step_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall (TWAP, Training, Sell)")
xlim([0 300])

subplot(3,1,2)
bar(1:NumTrainingSteps, TradingShares_Step_TWAP_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (TWAP, Training, Sell)")
xlim([0 300])
ylim([0 2000])

subplot(3,1,3)
bar(1:NumTrainingSteps, InventoryShares_Step_TWAP_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (TWAP, Training, Sell)")
xlim([0 300])

The TWAP executed shares in the middle panel have a flat profile, since the same number of shares are traded at each step. Also, the inventory shares in the bottom panel have a nearly perfect triangular sawtooth pattern, with the number of inventory shares declining linearly at each horizon as the shares are sold. The IS in the top panel sometimes declines to negative values (implying better executions than arrival price) for some horizons, while it increases to positive values (implying worse executions than arrival price) for other horizons.

Set Up RL Training Environment for Sell Trades

The RL environment requires observation and action information, reset function, and a step function.

Define the observation information for the four observation variables.

ObservationDimension = [1 4];
ObservationInfo_Train_Sell = rlNumericSpec(ObservationDimension);
ObservationInfo_Train_Sell.Name = 'Observation Data';
ObservationInfo_Train_Sell.Description = 'RemainingSharesToTrade, RemainingIntervals, IS_Agent, PriceDivergence';
ObservationInfo_Train_Sell.LowerLimit = -inf;
ObservationInfo_Train_Sell.UpperLimit = inf;

Define the environment constants for the training data and the reset function. The function definition for RL_OptimalExecution_LOB_ResetFcn is in the Local Functions section.

EnvConstants_Train_Sell.SampledBook = TrainingLOB;
EnvConstants_Train_Sell.BuyTrade = BuyTrade;
EnvConstants_Train_Sell.TotalTradingShares = TotalTradingShares_Sell;
EnvConstants_Train_Sell.NumIntervalsPerHorizon = NumIntervalsPerHorizon;
EnvConstants_Train_Sell.NumHorizons = NumTrainingHorizons;
EnvConstants_Train_Sell.NumSteps = NumTrainingSteps;
EnvConstants_Train_Sell.NumLevels = NumLevels;
EnvConstants_Train_Sell.IS_TWAP = IS_TWAP_Step_Train_Sell;

ResetFunction_Sell = @() RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Train_Sell);

Define the action information as the number of shares to be traded at each step, ranging from zero shares to twice the number of shares as in TWAP.

NumActions = 39;
LowerActionRange = linspace(0,TradingShares_Step_TWAP_Train_Sell(1), round(NumActions/2));
ActionRange = round([LowerActionRange(1:end-1) ...
    linspace(TradingShares_Step_TWAP_Train_Sell(1), ...
    TradingShares_Step_TWAP_Train_Sell(1)*2,round(NumActions/2))]);

ActionInfo_Sell = rlFiniteSetSpec(ActionRange);
ActionInfo_Sell.Name = 'Trading Action (Number of Shares)';
ActionInfo_Sell.Description = 'Number of Shares to be Traded at Each Time Step';

Define the step function, which computes the rewards and updates the observations at each step. The function definition for RL_OptimalExecution_LOB_StepFcn is in the Local Functions section.

StepFunction_Train_Sell = @(Action,LoggedSignals) ...
    RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Train_Sell);

Create the environment using the custom function handles defined in this section.

RL_OptimalExecution_Training_Environment_Sell = rlFunctionEnv(ObservationInfo_Train_Sell,ActionInfo_Sell,StepFunction_Train_Sell,ResetFunction_Sell)

Reset!
Reset!

RL_OptimalExecution_Training_Environment_Sell = 
  rlFunctionEnv with properties:

     StepFcn: @(Action,LoggedSignals)RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Train_Sell)
    ResetFcn: @()RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Train_Sell)
        Info: [1×1 struct]

Validate Training Environment Reset and Step Functions for Sell Trades

Before using the created environment, the best practice is to validate the reset and step functions for the environment. Call the reset and step functions to see if they produce reasonable results without errors.

Validate the reset function and the initial observation.

InitialObservation = reset(RL_OptimalExecution_Training_Environment_Sell);

Reset!

InitialObservation

InitialObservation = 1×4

        2738          12           0           0

Validate the step function by monitoring the logged signals and rewards while calling the step functions multiple times. Here, you can see the time interval index increasing at each step and the current inventory shares changing as the trades are executed at each step.

[~,Reward1,~,LoggedSignals1] = step(RL_OptimalExecution_Training_Environment_Sell,TradingShares_Step_TWAP_Train_Sell(1));
LoggedSignals1

LoggedSignals1 = struct with fields:
               IntervalIdx: 1
                HorizonIdx: 1
    CurrentInventoryShares: 2624
              ArrivalPrice: 27.4600
            ExecutedShares: 114
            ExecutedPrices: 27.4600
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [3516×1 double]
                    Reward: [3516×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 293
                 NumLevels: 5

Reward1

Reward1 = 0

[~,Reward2,~,LoggedSignals2] = step(RL_OptimalExecution_Training_Environment_Sell,10);
LoggedSignals2

LoggedSignals2 = struct with fields:
               IntervalIdx: 2
                HorizonIdx: 1
    CurrentInventoryShares: 2614
              ArrivalPrice: 27.4600
            ExecutedShares: 10
            ExecutedPrices: 27.4900
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [3516×1 double]
                    Reward: [3516×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 293
                 NumLevels: 5

Reward2

Reward2 = 0.6150

[~,Reward3,~,LoggedSignals3] = step(RL_OptimalExecution_Training_Environment_Sell,35);
LoggedSignals3

LoggedSignals3 = struct with fields:
               IntervalIdx: 3
                HorizonIdx: 1
    CurrentInventoryShares: 2579
              ArrivalPrice: 27.4600
            ExecutedShares: 35
            ExecutedPrices: 27.4900
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [3516×1 double]
                    Reward: [3516×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 293
                 NumLevels: 5

Reward3

Reward3 = 0.8925

Call the reset function again to restore the environment back to the initial state.

reset(RL_OptimalExecution_Training_Environment_Sell);

Reset!

Create and Train Agent for Sell Trades

Open Reinforcement Learning Designer (Reinforcement Learning Toolbox).

reinforcementLearningDesigner

Import the training environment by clicking the Import button and selecting the created training environment.

Create a new DQN agent by selecting New > Agent, then selecting DQN from the Compatible algorithm menu. Enable LSTM by selecting Use recurrent neural network.

Set the agent hyperparameters and exploration parameters in the DQN Agent tab. Use the default options, except for the following:

Discount factor — 1
Learn rate — 0.0001
Batch size — 128
Sequence length — 4
Initial epsilon — 1
Epsilon decay — 0.001

Go to the Train tab, and set the Max Episode Length to 3516 (NumTrainingSteps). You can experiment with different Stopping Criteria, or you can manually stop the training by clicking on the Stop button. In this example, you set the Stopping Criteria to EpisodeCount and the Max Episodes to a value greater than or equal to EpisodeCount. Click Train to start the training.

Once the training is complete, export the trained agent to the workspace. Since training can take a long time, you can load a pretrained agent (DQN_agent_Sell_Trained.mat) for sell trades.

load DQN_agent_Sell_Trained.mat

You can view the DQN agent options by examining the AgentOptions property.

DQN_agent_Sell_Trained.AgentOptions

ans = 
  rlDQNAgentOptions with properties:

                             SampleTime: 1
                         DiscountFactor: 1
               EpsilonGreedyExploration: [1×1 rl.option.EpsilonGreedyExploration]
                 ExperienceBufferLength: 10000
                          MiniBatchSize: 128
                         SequenceLength: 4
                 CriticOptimizerOptions: [1×1 rl.option.rlOptimizerOptions]
                    NumStepsToLookAhead: 1
                           UseDoubleDQN: 1
                     TargetSmoothFactor: 1.0000e-03
                  TargetUpdateFrequency: 1
            BatchDataRegularizerOptions: []
    ResetExperienceBufferBeforeTraining: 0
                             InfoToSave: [1×1 struct]

Also, you can visualize the DQN agent's critic network by using Deep Network Designer.

deepNetworkDesigner(layerGraph(getModel(getCritic(DQN_agent_Sell_Trained))))

Compute IS Using Trained Agent and Training Data for Sell Trades

Simulate selling actions using the trained agent and training environment.

rng('default');
reset(RL_OptimalExecution_Training_Environment_Sell);

Reset!

simOptions = rlSimulationOptions(MaxSteps=NumTrainingSteps)

simOptions = 
  rlSimulationOptions with properties:

                  MaxSteps: 3516
            NumSimulations: 1
               StopOnError: "on"
               UseParallel: 0
    ParallelizationOptions: [1×1 rl.option.ParallelSimulation]

Trained_Agent_Sell = DQN_agent_Sell_Trained;
experience_Sell = sim(RL_OptimalExecution_Training_Environment_Sell,Trained_Agent_Sell,simOptions);

Reset!
Last Horizon. Episode is Done!
HorizonIdx:

ans = 293

SimAction_Train_Sell = squeeze(experience_Sell.Action.TradingAction_NumberOfShares_.Data);

Compute the IS using simulated actions on the training data.

NumSimSteps = length(SimAction_Train_Sell);
SimInventory_Train_Sell = nan(NumSimSteps,1);
SimExecutedShares_Train_Sell = SimInventory_Train_Sell;
SimReward_Train_Sell = nan(NumSimSteps,1);

reset(RL_OptimalExecution_Training_Environment_Sell);

Reset!

for k=1:NumSimSteps
    [~,SimReward_Train_Sell(k),~,LoggedSignals] = ...
        step(RL_OptimalExecution_Training_Environment_Sell,SimAction_Train_Sell(k));
    SimInventory_Train_Sell(k) = LoggedSignals.CurrentInventoryShares;
    SimExecutedShares_Train_Sell(k) = sum(LoggedSignals.ExecutedShares);
end

Last Horizon. Episode is Done!
HorizonIdx:

ans = 293

IS_Agent_Step_Train_Sell = LoggedSignals.IS_Agent;
IS_Agent_Horizon_Train_Sell = IS_Agent_Step_Train_Sell( ....
    NumIntervalsPerHorizon:NumIntervalsPerHorizon:NumTrainingSteps);

Plot Training Results for Sell Trades

Plot Implementation Shortfall (IS), executed shares, and inventory shares for the agent for each step using training data.

figure
subplot(3,1,1)
bar(1:NumTrainingSteps, IS_Agent_Step_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall  (Agent, Training, Sell)")
xlim([0 300])
ylim([-200 200])

subplot(3,1,2)
bar(1:NumTrainingSteps, SimExecutedShares_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (Agent, Training, Sell)")
xlim([0 300])

subplot(3,1,3)
bar(1:NumTrainingSteps, SimInventory_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (Agent, Training, Sell)")
xlim([0 300])

Unlike the previous TWAP baseline plot, where the executed shares in the middle panel have a flat profile, here you can see that the agent actively changes the number of executed shares at each step. Also, the inventory shares in the bottom panel decline in a non-linear manner.

To directly compare the results from the agent with the TWAP baseline, you can plot both results on the same chart. Generate plots to compare implementation shortfall, executed shares, and inventory shares between the TWAP baseline and the agent for each step using the training data.

figure
subplot(3,1,1)
plot(1:NumTrainingSteps, [IS_TWAP_Step_Train_Sell IS_Agent_Step_Train_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall (Training, Sell)")
legend(["TWAP" "Agent"], location="east")
xlim([0 300])

subplot(3,1,2)
plot(1:NumTrainingSteps, [TradingShares_Step_TWAP_Train_Sell ...
    SimExecutedShares_Train_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (Training, Sell)")
legend(["TWAP" "Agent"], location="east")
xlim([0 300])

subplot(3,1,3)
plot(1:NumTrainingSteps, [InventoryShares_Step_TWAP_Train_Sell ...
    SimInventory_Train_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (Training, Sell)")
legend(["TWAP" "Agent"], location="east")
xlim([0 300])

You can see the differences between the TWAP baseline (blue) and the agent (red) in their IS, executed shares, and inventory shares. Also, the agent tends to behave differently when the IS trends negative as compared to positive.

For example, in the first horizon (step numbers 1 to 12) and the third horizon (step numbers 25 to 36), the IS trends negative (more favorable prices than the arrival price) and the agent initially trades fewer shares than TWAP, until there are large spikes in the executed shares at the ends of the horizons to meet the ending inventory constraint. Also, the inventories initially declines more slowly than TWAP, until the abrupt declines to zero at the end of the horizon. This coincided with negative pulls in the agent's IS at the end of the first and the third horizons that go even lower than the IS of TWAP.
On the other hand, in the second horizon (step numbers 13 to 24), the IS trends positive (less favorable prices than the arrival price), and the agent initially trades more shares than TWAP to liquidate the inventory faster than TWAP. This has a limiting effect on the agent's IS at the end of the horizon, which is also lower than the IS of TWAP.

You can see similar behaviors in other horizons, although they do not always lead to an outperformance of the agent over TWAP.

Also, you can plot the differences in IS between TWAP and the agent, and then compare them with the rewards.

figure
subplot(3,1,1)
bar(1:NumTrainingHorizons, IS_TWAP_Horizon_Train_Sell - IS_Agent_Horizon_Train_Sell)
xlabel(strcat("Horizon number (", num2str(NumTrainingHorizons), " horizons total.)"))
title("IS TWAP - IS Agent Over Horizon (Training, Sell)")
xlim([0 43])
subplot(3,1,2)
bar(1:NumTrainingSteps, IS_TWAP_Step_Train_Sell - IS_Agent_Step_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
title("IS TWAP - IS Agent Each Step (Training, Sell)")
xlim([0 510])
subplot(3,1,3)
bar(1:NumTrainingSteps, SimReward_Train_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTrainingSteps), " steps total.)"))
title("Reward (Training, Sell)")
xlim([0 510])

The top panel ( ${IS}_{TWAP} (0, T) - {IS}_{Agent} (0, T)$ ) shows the final performance of the agent for the training data as the differences in IS between TWAP and the agent at the ends of the horizons at times $T$ . For example, in the top panel, you can see the outperformance of the agent (positive IS difference) in the first three horizons, while the agent underperforms in some of the other horizons. The middle panel ( ${IS}_{TWAP} (0, t) - {IS}_{Agent} (0, t)$ ) shows the differences in IS for all time steps $t$ , including the ones before the ends of the horizons. The middle panel does not appear to track the top panel values very well in some steps (except for the very last steps in the horizons, which are the top panel values when $t = T$ ), as there are frequent trend reversals between negative and positive values at the last steps of the horizons. Therefore, the simple difference in IS between TWAP and the agent at each step ( ${IS}_{TWAP} (0, t) - {IS}_{Agent} (0, t)$ ) shown in the middle panel may not be a good reward function. On the other hand, the bottom panel, which plots the rewards used by the agent in this example at each step, appears to track the top panel values more closely than the middle panel at most steps.

Display Summary of Training Results for Sell Trades

Display the total Implementation Shortfall outperformance of the agent over TWAP for the training data.

sum(IS_TWAP_Horizon_Train_Sell - IS_Agent_Horizon_Train_Sell)

ans = 164.4700

Display total-gain-to-loss ratio (TGLR) for the training data.

ISTGLR(IS_TWAP_Horizon_Train_Sell - IS_Agent_Horizon_Train_Sell)

ans = 1.2327

TWAP Baseline Analysis on Testing Data

Using the same trained agent as the Compute IS Using Trained Agent and Training Data for Sell Trades section, you can perform the same analysis on the testing data that you kept separate from the training data.

First, execute the TWAP trades and compute IS for the testing data.

[IS_TWAP_Horizon_Test_Sell,IS_TWAP_Step_Test_Sell, ...
    InventoryShares_Step_TWAP_Test_Sell,TradingShares_Step_TWAP_Test_Sell] = ...
    executeTWAPTrades(TestingLOB,NumLevels,TotalTradingShares_Sell,...
    NumIntervalsPerHorizon,NumTestingHorizons,BuyTrade);

Plot the implementation shortfall, executed shares, and inventory shares for the TWAP baseline for each step using testing data.

figure
subplot(3,1,1)
bar(1:NumTestingSteps, IS_TWAP_Step_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall (TWAP, Testing, Sell)")
xlim([0 300])
ylim([-65 75])

subplot(3,1,2)
bar(1:NumTestingSteps, TradingShares_Step_TWAP_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (TWAP, Testing, Sell)")
xlim([0 300])
ylim([0 2000])

subplot(3,1,3)
bar(1:NumTestingSteps, InventoryShares_Step_TWAP_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (TWAP, Testing, Sell)")
xlim([0 300])

Set Up RL Testing Environment for Sell Trades

Next, define the reset function for the testing environment.

EnvConstants_Test_Sell.SampledBook = TestingLOB;
EnvConstants_Test_Sell.BuyTrade = BuyTrade;
EnvConstants_Test_Sell.TotalTradingShares = TotalTradingShares_Sell;
EnvConstants_Test_Sell.NumIntervalsPerHorizon = NumIntervalsPerHorizon;
EnvConstants_Test_Sell.NumHorizons = NumTestingHorizons;
EnvConstants_Test_Sell.NumSteps = NumTestingSteps;
EnvConstants_Test_Sell.NumLevels = NumLevels;
EnvConstants_Test_Sell.IS_TWAP = IS_TWAP_Step_Test_Sell;

ResetFunction_Test_Sell = @() RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Test_Sell);

Define the step function for the testing environment.

StepFunction_Test_Sell = @(Action,LoggedSignals) ...
    RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Test_Sell);

Create the testing environment using the reset and step function handles.

RL_OptimalExecution_Testing_Environment_Sell = ...
    rlFunctionEnv(ObservationInfo_Train_Sell,ActionInfo_Sell, ...
    StepFunction_Test_Sell,ResetFunction_Test_Sell)

Reset!
Reset!

RL_OptimalExecution_Testing_Environment_Sell = 
  rlFunctionEnv with properties:

     StepFcn: @(Action,LoggedSignals)RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Test_Sell)
    ResetFcn: @()RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Test_Sell)
        Info: [1×1 struct]

Validate Testing Environment Reset and Step Functions for Sell Trades

Now, validate the testing environment using reset and step function handles.

InitialObservation = reset(RL_OptimalExecution_Testing_Environment_Sell);

Reset!

InitialObservation

InitialObservation = 1×4

        2738          12           0           0

[~,Reward1,~,LoggedSignals1] = step( ...
    RL_OptimalExecution_Testing_Environment_Sell,TradingShares_Step_TWAP_Train_Sell(1));
LoggedSignals1

LoggedSignals1 = struct with fields:
               IntervalIdx: 1
                HorizonIdx: 1
    CurrentInventoryShares: 2624
              ArrivalPrice: 26.8000
            ExecutedShares: 114
            ExecutedPrices: 26.8000
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [1164×1 double]
                    Reward: [1164×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 97
                 NumLevels: 5

Reward1

Reward1 = 0

[~,Reward2,~,LoggedSignals2] = step( ...
    RL_OptimalExecution_Testing_Environment_Sell,10);
LoggedSignals2

LoggedSignals2 = struct with fields:
               IntervalIdx: 2
                HorizonIdx: 1
    CurrentInventoryShares: 2614
              ArrivalPrice: 26.8000
            ExecutedShares: 10
            ExecutedPrices: 26.7900
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [1164×1 double]
                    Reward: [1164×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 97
                 NumLevels: 5

Reward2

Reward2 = -0.2050

[~,Reward3,~,LoggedSignals3] = step( ...
    RL_OptimalExecution_Testing_Environment_Sell,35);
LoggedSignals3

LoggedSignals3 = struct with fields:
               IntervalIdx: 3
                HorizonIdx: 1
    CurrentInventoryShares: 2579
              ArrivalPrice: 26.8000
            ExecutedShares: 35
            ExecutedPrices: 26.7900
     HorizonExecutedShares: [60×1 double]
     HorizonExecutedPrices: [60×1 double]
                  IS_Agent: [1164×1 double]
                    Reward: [1164×1 double]
    NumIntervalsPerHorizon: 12
               NumHorizons: 97
                 NumLevels: 5

Reward3

Reward3 = -0.2975

Call the reset function again to restore the environment back to the initial state.

reset(RL_OptimalExecution_Testing_Environment_Sell);

Reset!

Compute IS Using Trained Agent and Testing Data for Sell Trades

Simulate actions using the trained agent and testing data for sell trades.

simOptions = rlSimulationOptions(MaxSteps=NumTestingSteps);
experience_Sell = sim(RL_OptimalExecution_Testing_Environment_Sell,Trained_Agent_Sell,simOptions);

Reset!
Last Horizon. Episode is Done!
HorizonIdx:

ans = 97

SimAction_Test_Sell = squeeze(experience_Sell.Action.TradingAction_NumberOfShares_.Data);

Compute Implementation Shortfall (IS) using simulated actions on testing data.

NumSimSteps = length(SimAction_Test_Sell);
SimInventory_Test_Sell = nan(NumSimSteps,1);
SimExecutedShares_Test_Sell = SimInventory_Test_Sell;
SimReward_Test_Sell = nan(NumSimSteps,1);

reset(RL_OptimalExecution_Testing_Environment_Sell);

Reset!

for k=1:NumSimSteps
    [~,SimReward_Test_Sell(k),~,LoggedSignals] = step( ...
        RL_OptimalExecution_Testing_Environment_Sell,SimAction_Test_Sell(k));
    SimInventory_Test_Sell(k) = LoggedSignals.CurrentInventoryShares;
    SimExecutedShares_Test_Sell(k) = sum(LoggedSignals.ExecutedShares);
end

Last Horizon. Episode is Done!
HorizonIdx:

ans = 97

IS_Agent_Step_Test_Sell = LoggedSignals.IS_Agent;
IS_Agent_Horizon_Test_Sell = IS_Agent_Step_Test_Sell( ...
    NumIntervalsPerHorizon:NumIntervalsPerHorizon:NumTestingSteps);

Plot Testing Results for Sell Trades

Plot Implementation Shortfall (IS), executed shares, and inventory shares of the agent for each step using testing data.

figure
subplot(3,1,1)
bar(1:NumTestingSteps, IS_Agent_Step_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall (Agent, Testing, Sell)")
xlim([650 950])
ylim([-200 150])

subplot(3,1,2)
bar(1:NumTestingSteps, SimExecutedShares_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (Agent, Testing, Sell)")
xlim([650 950])
ylim([0 2900])

subplot(3,1,3)
bar(1:NumTestingSteps, SimInventory_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (Agent, Testing, Sell)")
xlim([650 950])

Compare implementation shortfall, executed shares, and inventory shares between the TWAP baseline and the agent for each step using testing data.

figure
subplot(3,1,1)
plot(1:NumTestingSteps, [IS_TWAP_Step_Test_Sell IS_Agent_Step_Test_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Implementation Shortfall")
title("Implementation Shortfall (Testing, Sell)")
legend(["TWAP" "Agent"],location='best')
xlim([650 950])
ylim([-200 150])

subplot(3,1,2)
plot(1:NumTestingSteps, [TradingShares_Step_TWAP_Test_Sell ...
    SimExecutedShares_Test_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Executed Shares")
title("Shares Traded at Each Step (Testing, Sell)")
legend(["TWAP" "Agent"])
xlim([650 950])
ylim([0 2900])

subplot(3,1,3)
plot(1:NumTestingSteps, [InventoryShares_Step_TWAP_Test_Sell ...
    SimInventory_Test_Sell], linewidth=1.5)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
ylabel("Inventory Shares")
title("Inventory Shares (Testing, Sell)")
legend(["TWAP" "Agent"])
xlim([650 950])

Just as in the training data, the agent (red) actively changes the number of executed shares for each step in the testing data, while the TWAP policy (blue) executes the same number of shares at each step.

Plot the differences in IS between TWAP and the agent and then compare them with the rewards.

figure
subplot(3,1,1)
bar(1:NumTestingHorizons, IS_TWAP_Horizon_Test_Sell - IS_Agent_Horizon_Test_Sell)
xlabel(strcat("Horizon number (", num2str(NumTestingHorizons), " horizons total.)"))
title("IS TWAP - IS Agent Over Horizon (Testing, Sell)")

subplot(3,1,2)
bar(1:NumTestingSteps, IS_TWAP_Step_Test_Sell - IS_Agent_Step_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
title("IS TWAP - IS Agent Each Step (Testing, Sell)")
xlim([0 NumTestingSteps+9])

subplot(3,1,3)
bar(1:NumTestingSteps, SimReward_Test_Sell)
xlabel(strcat("Step number (", num2str(TradingIntervalSec), ...
    " sec intervals, ", num2str(NumTestingSteps), " steps total.)"))
title("Reward (Testing, Sell)")
xlim([0 NumTestingSteps+9])

Display Summary of Training and Testing Results for Sell Trades

You can summarize the results of the sell trade agent's training and testing.

Display total Implementation Shortfall (IS) outperformance of the agent over TWAP for training data.

Total_Outperformance_Train_Sell = sum(IS_TWAP_Horizon_Train_Sell - IS_Agent_Horizon_Train_Sell);
table(Total_Outperformance_Train_Sell)

ans=table
    Total_Outperformance_Train_Sell
    _______________________________

                164.47

Display total-gain-to-loss ratio (TGLR) and gain-to-loss ratio (GLR) for the training data.

[TGLR_Train_Sell, GLR_Train_Sell] = ...
    ISTGLR(IS_TWAP_Horizon_Train_Sell - IS_Agent_Horizon_Train_Sell);

table(TGLR_Train_Sell, GLR_Train_Sell)

ans=1×2 table
    TGLR_Train_Sell    GLR_Train_Sell
    _______________    ______________

        1.2327             1.2532

Display the total IS outperformance of the agent over TWAP for the testing data.

Total_Outperformance_Test_Sell = sum(IS_TWAP_Horizon_Test_Sell - IS_Agent_Horizon_Test_Sell);
table(Total_Outperformance_Test_Sell)

ans=table
    Total_Outperformance_Test_Sell
    ______________________________

                104.12

Display TGLR and GLR for the testing data.

[TGLR_Test_Sell, GLR_Test_Sell] = ...
    ISTGLR(IS_TWAP_Horizon_Test_Sell - IS_Agent_Horizon_Test_Sell);
table(TGLR_Test_Sell, GLR_Test_Sell)

ans=1×2 table
    TGLR_Test_Sell    GLR_Test_Sell
    ______________    _____________

        1.389            1.1906

Optimal Execution for Buy Trades

You can use a shortened version of the previous procedure (Optimal Execution for Sell Trades) to support the workflow for designing a DQN agent for optimal execution of buy trades. To repeat the same workflow for buy trades, first set the BuyTrade logical flag to true and define the total number of shares to buy over the horizon.

BuyTrade = true;
TotalTradingShares_Buy = 2551;

Execute TWAP trades and compute the Implementation Shortfall (IS).

[IS_TWAP_Horizon_Train_Buy,IS_TWAP_Step_Train_Buy, ...
    InventoryShares_Step_TWAP_Train_Buy,TradingShares_Step_TWAP_Train_Buy] = ...
    executeTWAPTrades(TrainingLOB,NumLevels,TotalTradingShares_Buy, ...
    NumIntervalsPerHorizon,NumTrainingHorizons,BuyTrade);

Set up the RL training environment for buy trades.

ObservationDimension = [1 4];
ObservationInfo_Train_Buy = rlNumericSpec(ObservationDimension);
ObservationInfo_Train_Buy.Name = 'Observation Data';
ObservationInfo_Train_Buy.Description = ['RemainingSharesToTrade, ...' ...
    'RemainingIntervals, IS_Agent, PriceDivergence'];
ObservationInfo_Train_Buy.LowerLimit = -inf;
ObservationInfo_Train_Buy.UpperLimit = inf;

EnvConstants_Train_Buy.SampledBook = TrainingLOB;
EnvConstants_Train_Buy.BuyTrade = BuyTrade;
EnvConstants_Train_Buy.TotalTradingShares = TotalTradingShares_Buy;
EnvConstants_Train_Buy.NumIntervalsPerHorizon = NumIntervalsPerHorizon;
EnvConstants_Train_Buy.NumHorizons = NumTrainingHorizons;
EnvConstants_Train_Buy.NumSteps = NumTrainingSteps;
EnvConstants_Train_Buy.NumLevels = NumLevels;
EnvConstants_Train_Buy.IS_TWAP = IS_TWAP_Step_Train_Buy;
ResetFunction_Buy = @() RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Train_Buy);

NumActions = 39;
LowerActionRange = linspace(0,TradingShares_Step_TWAP_Train_Buy(1), round(NumActions/2));
ActionRange = round([LowerActionRange(1:end-1) ...
    linspace(TradingShares_Step_TWAP_Train_Buy(1), ...
    TradingShares_Step_TWAP_Train_Buy(1)*2,round(NumActions/2))]);
NumActions = length(ActionRange);

ActionInfo_Buy = rlFiniteSetSpec(ActionRange);
ActionInfo_Buy.Name = 'Trading Action (Number of Shares)';
ActionInfo_Buy.Description = 'Number of Shares to be Traded at Each Time Step';

StepFunction_Train_Buy = @(Action,LoggedSignals) ...
    RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Train_Buy);

RL_OptimalExecution_Training_Environment_Buy = rlFunctionEnv( ...
    ObservationInfo_Train_Buy,ActionInfo_Buy,StepFunction_Train_Buy,ResetFunction_Buy)

Reset!
Reset!

RL_OptimalExecution_Training_Environment_Buy = 
  rlFunctionEnv with properties:

     StepFcn: @(Action,LoggedSignals)RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Train_Buy)
    ResetFcn: @()RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Train_Buy)
        Info: [1×1 struct]

Use Reinforcement Learning Designer (Reinforcement Learning Toolbox) to create and train the agent for the buy trades. You use the same procedure (Optimal Execution for Sell Trades) as with the sell trades, except that the imported training environment is for buy trades (RL_OptimalExecution_Training_Environment_Buy). Once the training is complete, export the trained agent to the workspace. Since training can take a long time, this example uses a pretrained agent (DQN_agent_Buy_Trained.mat).

load DQN_agent_Buy_Trained.mat

Compute IS using the trained agent and training data for buy trades.

rng('default');
reset(RL_OptimalExecution_Training_Environment_Buy);

Reset!

simOptions = rlSimulationOptions(MaxSteps=NumTrainingSteps);
Trained_Agent_Buy = DQN_agent_Buy_Trained;
experience_Buy = sim(RL_OptimalExecution_Training_Environment_Buy,Trained_Agent_Buy,simOptions);

Reset!
Last Horizon. Episode is Done!
HorizonIdx:

ans = 293

SimAction_Train_Buy = squeeze(experience_Buy.Action.TradingAction_NumberOfShares_.Data);
NumSimSteps = length(SimAction_Train_Buy);
SimInventory_Train_Buy = nan(NumSimSteps,1);
SimExecutedShares_Train_Buy = SimInventory_Train_Buy;
SimReward_Train_Buy = nan(NumSimSteps,1);
reset(RL_OptimalExecution_Training_Environment_Buy);

Reset!

for k=1:NumSimSteps
    [~,SimReward_Train_Buy(k),~,LoggedSignals] = ...
        step(RL_OptimalExecution_Training_Environment_Buy,SimAction_Train_Buy(k));
    SimInventory_Train_Buy(k) = LoggedSignals.CurrentInventoryShares;
    SimExecutedShares_Train_Buy(k) = sum(LoggedSignals.ExecutedShares);
end

Last Horizon. Episode is Done!
HorizonIdx:

ans = 293

IS_Agent_Step_Train_Buy = LoggedSignals.IS_Agent;
IS_Agent_Horizon_Train_Buy = IS_Agent_Step_Train_Buy( ....
    NumIntervalsPerHorizon:NumIntervalsPerHorizon:NumTrainingSteps);

Execute the TWAP trades and compute the IS for the testing data.

[IS_TWAP_Horizon_Test_Buy,IS_TWAP_Step_Test_Buy, ...
    InventoryShares_Step_TWAP_Test_Buy,TradingShares_Step_TWAP_Test_Buy] = ...
    executeTWAPTrades(TestingLOB,NumLevels,TotalTradingShares_Buy,...
    NumIntervalsPerHorizon,NumTestingHorizons,BuyTrade);

Set up the RL testing environment for buy trades.

EnvConstants_Test_Buy.SampledBook = TestingLOB;
EnvConstants_Test_Buy.BuyTrade = BuyTrade;
EnvConstants_Test_Buy.TotalTradingShares = TotalTradingShares_Buy;
EnvConstants_Test_Buy.NumIntervalsPerHorizon = NumIntervalsPerHorizon;
EnvConstants_Test_Buy.NumHorizons = NumTestingHorizons;
EnvConstants_Test_Buy.NumSteps = NumTestingSteps;
EnvConstants_Test_Buy.NumLevels = NumLevels;
EnvConstants_Test_Buy.IS_TWAP = IS_TWAP_Step_Test_Buy;
ResetFunction_Test_Buy = @() RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Test_Buy);
StepFunction_Test_Buy = @(Action,LoggedSignals) ...
    RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Test_Buy);
RL_OptimalExecution_Testing_Environment_Buy = ...
    rlFunctionEnv(ObservationInfo_Train_Buy,ActionInfo_Buy, ...
    StepFunction_Test_Buy,ResetFunction_Test_Buy)

Reset!
Reset!

RL_OptimalExecution_Testing_Environment_Buy = 
  rlFunctionEnv with properties:

     StepFcn: @(Action,LoggedSignals)RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants_Test_Buy)
    ResetFcn: @()RL_OptimalExecution_LOB_ResetFcn(EnvConstants_Test_Buy)
        Info: [1×1 struct]

Compute IS using the trained agent and testing data for buy trades.

simOptions = rlSimulationOptions(MaxSteps=NumTestingSteps);
experience_Buy = sim(RL_OptimalExecution_Testing_Environment_Buy,Trained_Agent_Buy,simOptions);

Reset!
Last Horizon. Episode is Done!
HorizonIdx:

ans = 97

SimAction_Test_Buy = squeeze(experience_Buy.Action.TradingAction_NumberOfShares_.Data);
NumSimSteps = length(SimAction_Test_Buy);
SimInventory_Test_Buy = nan(NumSimSteps,1);
SimExecutedShares_Test_Buy = SimInventory_Test_Buy;
SimReward_Test_Buy = nan(NumSimSteps,1);
reset(RL_OptimalExecution_Testing_Environment_Buy);

Reset!

for k=1:NumSimSteps
    [NextObs1,SimReward_Test_Buy(k),IsDone1,LoggedSignals] = step( ...
        RL_OptimalExecution_Testing_Environment_Buy,SimAction_Test_Buy(k));
    SimInventory_Test_Buy(k) = LoggedSignals.CurrentInventoryShares;
    SimExecutedShares_Test_Buy(k) = sum(LoggedSignals.ExecutedShares);
end

Last Horizon. Episode is Done!
HorizonIdx:

ans = 97

IS_Agent_Step_Test_Buy = LoggedSignals.IS_Agent;
IS_Agent_Horizon_Test_Buy = IS_Agent_Step_Test_Buy( ...
    NumIntervalsPerHorizon:NumIntervalsPerHorizon:NumTestingSteps);

Display the total Implementation Shortfall (IS) outperformance of the agent over TWAP for training data.

Total_Outperformance_Train_Buy = sum(IS_TWAP_Horizon_Train_Buy - IS_Agent_Horizon_Train_Buy);
table(Total_Outperformance_Train_Buy)

ans=table
    Total_Outperformance_Train_Buy
    ______________________________

                326.25

Display total-gain-to-loss ratio (TGLR) and gain-to-loss ratio (GLR) for the training data.

[TGLR_Train_Buy, GLR_Train_Buy] = ...
    ISTGLR(IS_TWAP_Horizon_Train_Buy - IS_Agent_Horizon_Train_Buy);
table(TGLR_Train_Buy, GLR_Train_Buy)

ans=1×2 table
    TGLR_Train_Buy    GLR_Train_Buy
    ______________    _____________

        1.4146           1.2587

Display total IS outperformance of the agent over TWAP for the testing data.

Total_Outperformance_Test_Buy = sum(IS_TWAP_Horizon_Test_Buy - IS_Agent_Horizon_Test_Buy);
table(Total_Outperformance_Test_Buy)

ans=table
    Total_Outperformance_Test_Buy
    _____________________________

               196.57

Display TGLR and GLR for the testing data.

[TGLR_Test_Buy, GLR_Test_Buy] = ...
    ISTGLR(IS_TWAP_Horizon_Test_Buy - IS_Agent_Horizon_Test_Buy);
table(TGLR_Test_Buy, GLR_Test_Buy)

ans=1×2 table
    TGLR_Test_Buy    GLR_Test_Buy
    _____________    ____________

        2.018           1.4641

References

[1] Bertsimas, Dimitris, and Andrew W. Lo. "Optimal Control of Execution Costs." Journal of Financial Markets 1, no. 1 (1998): 1-50.

[2] Almgren, Robert, and Neil Chriss. "Optimal Execution of Portfolio Transactions." Journal of Risk 3, no. 2 (2000): 5-40.

[3] Nevmyvaka, Yuriy, Yi Feng, and Michael Kearns. "Reinforcement Learning for Optimized Trade Execution." In Proceedings of the 23rd International Conference on Machine Learning, pp. 673-680. 2006.

[4] Ning B., F. Lin, and S. Jaimungal. "Double Deep Q-Learning for Optimal Execution." Preprint, submitted June 8, 2020. Available at https://arxiv.org/abs/1812.06600.

[5] Lin S. and P. A. Beling. "Optimal Liquidation with Deep Reinforcement Learning." 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Deep Reinforcement Learning Workshop. Vancouver, Canada, 2019.

[6] Lin S. and P. A. Beling. "An End-to-End Optimal Trade Execution Framework Based on Proximal Policy Optimization." Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020) Special Track on AI in FinTech. 4548-4554. 2020.

[7] LOBSTER Limit Order Book Data. Berlin: frischedaten UG (haftungsbeschränkt). Available at https://lobsterdata.com/.

Local Functions

function [IS_TWAP_Horizon,IS_TWAP_Step,InventorySharesStepTWAP,TradingSharesStepTWAP] = ...
    executeTWAPTrades(InputLOB,NumLevels,TotalTradingShares,NumIntervalsPerHorizon,NumHorizons,BuyTrade)
% executeTWAPTrades Function for Executing Trades for TWAP Baseline
% This function executes buy or sell trades for the TWAP baseline and
% computes the corresponding implementation shortfalls, inventory shares
% and trading shares at each step.

NumSteps = NumHorizons*NumIntervalsPerHorizon;
IS_TWAP_Horizon = nan(NumHorizons,1);
IS_TWAP_Step = nan(NumSteps,1);
TradingSharesStepTWAP = round(TotalTradingShares./NumIntervalsPerHorizon).*ones(NumSteps,1);

if BuyTrade
    InventorySharesStepTWAP = zeros(NumSteps,1);
else
    InventorySharesStepTWAP = zeros(NumSteps,1) + TotalTradingShares;
end

for HorizonIdx = 1:NumHorizons
    HorizonExecutedPrices = zeros(NumLevels*NumIntervalsPerHorizon,1);
    HorizonExecutedShares = HorizonExecutedPrices;
    for IntervalIdx = 1:NumIntervalsPerHorizon
        CurrentBook =  InputLOB(IntervalIdx + NumIntervalsPerHorizon*(HorizonIdx-1),:);

        % Execute trade
        if IntervalIdx==NumIntervalsPerHorizon
            if BuyTrade
                % Enforce target ending inventory constraint for buy trade (buy missing shares)
                CurrentTradingShares = max(TotalTradingShares - ...
                    InventorySharesStepTWAP(IntervalIdx + NumIntervalsPerHorizon.*(HorizonIdx-1)), 0);
            else
                % Enforce zero ending inventory constraint for sell trade (sell remaining shares)
                CurrentTradingShares = max(InventorySharesStepTWAP(IntervalIdx + ...
                    NumIntervalsPerHorizon.*(HorizonIdx-1)), 0);
            end
        else
            CurrentTradingShares = TradingSharesStepTWAP(IntervalIdx + ...
                NumIntervalsPerHorizon.*(HorizonIdx-1));
        end

        [ExecutedShares,ExecutedPrices,InitialPrice,TradedLevels] = ...
            tradeLOB(CurrentBook,CurrentTradingShares,BuyTrade);
        if IntervalIdx==1
            ArrivalPrice = InitialPrice;
        end
        HorizonExecutedShares((1:TradedLevels) + (IntervalIdx-1)*NumLevels) = ExecutedShares;
        HorizonExecutedPrices((1:TradedLevels) + (IntervalIdx-1)*NumLevels) = ExecutedPrices;

        if BuyTrade
            InventorySharesStepTWAP((IntervalIdx:NumIntervalsPerHorizon) + ...
                NumIntervalsPerHorizon.*(HorizonIdx-1)) = ...
                InventorySharesStepTWAP(IntervalIdx + ...
                NumIntervalsPerHorizon.*(HorizonIdx-1)) + sum(ExecutedShares);
        else
            InventorySharesStepTWAP((IntervalIdx:NumIntervalsPerHorizon) + ...
                NumIntervalsPerHorizon.*(HorizonIdx-1)) = ...
                InventorySharesStepTWAP(IntervalIdx + ...
                NumIntervalsPerHorizon.*(HorizonIdx-1)) - sum(ExecutedShares);
        end

        % Compute Implementation Shortfall for each Step
        IS_TWAP_Step(IntervalIdx + NumIntervalsPerHorizon.*(HorizonIdx-1)) = ...
            endingInventoryIS(ArrivalPrice,HorizonExecutedPrices,HorizonExecutedShares,BuyTrade);
    end

    % Compute Implementation Shortfall for each Horizon
    IS_TWAP_Horizon(HorizonIdx) = endingInventoryIS(ArrivalPrice, ...
        HorizonExecutedPrices,HorizonExecutedShares,BuyTrade);
end

end


function [InitialObservation,LoggedSignals] = RL_OptimalExecution_LOB_ResetFcn(EnvConstants)
% RL_OptimalExecution_LOB_ResetFcn is Reset Function for RL Agent
% Define Reset Function
%
% Inputs:
%
% EnvConstants: Structure with the following fields:
% - SampledBook: Limit order book sampled at time intervals
% - BuyTrade: Scalar logical indicating the trading direction
%               true: Buy
%               false: Sell
% - TotalTradingShares: Number of shares to trade over the horizon (scalar)
% - NumIntervalsPerHorizon: Number of intervals per horizon (scalar)
% - NumHorizons: Number of horizons in the data (scalar)
% - NumSteps: Number of steps in the data (scalar)
% - NumLevels: Number of levels in the limit order book (scalar)
% - IS_TWAP: Implementation shortfall for TWAP baseline (NumSteps-by-1 vector)
%
% Outputs:
%
% InitialObservation:
% - Remaining shares to be traded by agent
% - Remaining intervals (steps) in the current trading horizon
% - Implementation shortfall of agent
% - Price divergence times 100
%
% LoggedSignals: Structure with following fields:
% - IntervalIdx: Current time interval index (initial value: 0)
% - HorizonIdx: Current time horizon index (initial value: 0)
% - CurrentInventoryShares: Current number of shares in inventory
%    (Initial value: 0 for Buy trade)
%    (Initial value: TotalTradingShares for Sell trade)
% - ArrivalPrice: Arrival price at the beginning of horizon (initial value: 0)
% - ExecutedShares: Number of executed shares at each traded level (initial value: 0)
% - ExecutedPrices: Prices of executed shares at each traded level (initial value: 0)
% - HorizonExecutedShares: Executed shares at each level over the horizon
%     (Initial value: (NumLevels * NumIntervalsPerHorizon)-by-1 vector of zeros)
% - HorizonExecutedPrices: Prices of Executed shares at each level over the horizon
%     (Initial value: (NumLevels * NumIntervalsPerHorizon)-by-1 vector of zeros)
% - IS_Agent: Implementation shortfall history of agent
%     (Initial value: NumIntervals-by-1 vector of zeros)
% - Reward: Reward history of agent
%     (Initial value: NumIntervals-by-1 vector of zeros)
% - NumIntervalsPerHorizon: Number of intervals per horizon (fixed value: NumIntervalsPerHorizon)
% - NumHorizons: Number of horizons (fixed value: NumHorizon)
% - NumLevels: Number of levels in the limit order book (fixed value: NumLevels)

IntervalIdx = 0;
HorizonIdx = 0;

LoggedSignals.IntervalIdx = IntervalIdx;
LoggedSignals.HorizonIdx = HorizonIdx;

InitialObservation = ...
    [EnvConstants.TotalTradingShares EnvConstants.NumIntervalsPerHorizon 0 0];

if EnvConstants.BuyTrade
    LoggedSignals.CurrentInventoryShares = 0;
else
    LoggedSignals.CurrentInventoryShares = EnvConstants.TotalTradingShares;
end

LoggedSignals.ArrivalPrice = 0;
LoggedSignals.ExecutedShares = 0;
LoggedSignals.ExecutedPrices = 0;
LoggedSignals.HorizonExecutedShares = ...
    zeros(EnvConstants.NumLevels*EnvConstants.NumIntervalsPerHorizon,1);
LoggedSignals.HorizonExecutedPrices = LoggedSignals.HorizonExecutedShares;
LoggedSignals.IS_Agent = zeros(EnvConstants.NumSteps,1);
LoggedSignals.Reward = LoggedSignals.IS_Agent;
LoggedSignals.NumIntervalsPerHorizon = EnvConstants.NumIntervalsPerHorizon;
LoggedSignals.NumHorizons = EnvConstants.NumHorizons;
LoggedSignals.NumLevels = EnvConstants.NumLevels;

disp("Reset!")
end

function [Observation,Reward,IsDone,LoggedSignals] = ...
    RL_OptimalExecution_LOB_StepFcn(Action,LoggedSignals,EnvConstants)
% RL_OptimalExecution_LOB_StepFcn function is Step Function for RL Environment
%   Given current Action and LoggedSignals, update Observation, Reward, and
%   LoggedSignals, and indicate whether the episode is complete.
%
% Inputs:
%
% Action: Number of shares to be traded at the current step (scalar)
%
% LoggedSignals: Structure with following fields
% - IntervalIdx: Current time interval index
% - HorizonIdx: Current time horizon index
% - CurrentInventoryShares: Current number of shares in inventory
% - ArrivalPrice: Arrival price at the beginning of horizon
% - ExecutedShares: Number of executed shares at each traded level
% - ExecutedPrices: Prices of executed shares at each traded level
% - HorizonExecutedShares: Executed shares at each level over the horizon
%     ((NumLevels * NumIntervalsPerHorizon)-by-1 vector)
% - HorizonExecutedPrices: Prices of Executed shares at each level over the horizon
%     ((NumLevels * NumIntervalsPerHorizon)-by-1 vector)
% - IS_Agent: Implementation shortfall history of agent
%     (NumIntervals-by-1 vector)
% - Reward: Reward history of agent
%     (NumIntervals-by-1 vector)
% - NumIntervalsPerHorizon: Number of intervals per horizon (fixed value: NumIntervalsPerHorizon)
% - NumHorizons: Number of horizons (fixed value: NumHorizon)
% - NumLevels: Number of levels in the limit order book (fixed value: NumLevels)
%
% EnvConstants: Structure with the following fields
% - SampledBook: Limit order book sampled at time intervals
% - BuyTrade: Scalar logical indicating the trading direction
%               true: Buy
%               false: Sell
% - TotalTradingShares: Number of shares to trade over the horizon (scalar)
% - NumIntervalsPerHorizon: Number of intervals per horizon (scalar)
% - NumHorizons: Number of horizons in the data (scalar)
% - NumSteps: Number of steps in the data (scalar)
% - NumLevels: Number of levels in the limit order book (scalar)
% - IS_TWAP: Implementation shortfall for TWAP baseline (NumSteps-by-1 vector)
%
% Outputs:
%
% Observation:
% - Remaining shares to be traded by agent
% - Remaining intervals (steps) in the current trading horizon
% - Implementation shortfall of agent
% - Price divergence times 100
%
% Reward: Reward for current step
% 
% IsDone: Logical indicating whether the current episode is complete
%
% LoggedSignals: Updated LoggedSignals

% Update LoggedSignals indices
if (LoggedSignals.IntervalIdx == 0) && (LoggedSignals.HorizonIdx == 0)
    LoggedSignals.IntervalIdx = 1;
    LoggedSignals.HorizonIdx = 1;
elseif (LoggedSignals.IntervalIdx >= LoggedSignals.NumIntervalsPerHorizon)
    LoggedSignals.IntervalIdx = 1;
    LoggedSignals.HorizonIdx = LoggedSignals.HorizonIdx + 1;
    LoggedSignals.HorizonExecutedShares = ...
        zeros(EnvConstants.NumLevels*EnvConstants.NumIntervalsPerHorizon,1);
    LoggedSignals.HorizonExecutedPrices = LoggedSignals.HorizonExecutedShares;
    if EnvConstants.BuyTrade
        LoggedSignals.CurrentInventoryShares = 0;
    else
        LoggedSignals.CurrentInventoryShares = EnvConstants.TotalTradingShares;
    end
else
    LoggedSignals.IntervalIdx = LoggedSignals.IntervalIdx + 1;
end

StepIdx = LoggedSignals.IntervalIdx + ...
    LoggedSignals.NumIntervalsPerHorizon.*(LoggedSignals.HorizonIdx-1);
RemainingIntervals = LoggedSignals.NumIntervalsPerHorizon - LoggedSignals.IntervalIdx + 1;
IS_TWAP = EnvConstants.IS_TWAP(StepIdx);


if EnvConstants.BuyTrade
    AvgPrice = mean(EnvConstants.SampledBook(StepIdx,[1 5]),2);
    PriceDivergence = LoggedSignals.ArrivalPrice - AvgPrice;
else
    AvgPrice = mean(EnvConstants.SampledBook(StepIdx,[3 7]),2);
    PriceDivergence = AvgPrice - LoggedSignals.ArrivalPrice;
end

% Enforce ending inventory constraint
if LoggedSignals.IntervalIdx >= LoggedSignals.NumIntervalsPerHorizon
    if EnvConstants.BuyTrade
        TradingShares = max(EnvConstants.TotalTradingShares - ...
            LoggedSignals.CurrentInventoryShares, 0);
    else
        TradingShares = LoggedSignals.CurrentInventoryShares;
    end
else
    if EnvConstants.BuyTrade
        if PriceDivergence > 0
            AgentRemainingSharestoTrade = max(EnvConstants.TotalTradingShares - ...
                LoggedSignals.CurrentInventoryShares,0);
            TradingShares = round(AgentRemainingSharestoTrade./RemainingIntervals./2);
            TradingShares = min(TradingShares,Action);
        else
            TradingShares = min(EnvConstants.TotalTradingShares - ...
                LoggedSignals.CurrentInventoryShares, Action);
        end
    else
        if PriceDivergence > 0
            AgentRemainingSharestoTrade = max(LoggedSignals.CurrentInventoryShares, 0);
            TradingShares = round(AgentRemainingSharestoTrade./RemainingIntervals./2);
            TradingShares = min(TradingShares,Action);
        else
            TradingShares = min(LoggedSignals.CurrentInventoryShares, Action);
        end
    end
end

% Update IsDone:
% Episode is complete (IsDone is true) when time step reaches the end of
% the trading data.
if (LoggedSignals.HorizonIdx >= LoggedSignals.NumHorizons) && ...
        (LoggedSignals.IntervalIdx >= LoggedSignals.NumIntervalsPerHorizon)
    IsDone = true;
    disp("Last Horizon. Episode is Done!");
    disp("HorizonIdx:");
    LoggedSignals.HorizonIdx    
elseif (LoggedSignals.IntervalIdx >= LoggedSignals.NumIntervalsPerHorizon)
    IsDone = false;    
    % disp("HorizonIdx:");
    % LoggedSignals.HorizonIdx
else
    IsDone = false;
end

% Execute trade
CurrentLOB = EnvConstants.SampledBook(LoggedSignals.IntervalIdx + ...
    LoggedSignals.NumIntervalsPerHorizon*(LoggedSignals.HorizonIdx-1),:);

[ExecutedShares,ExecutedPrices,InitialPrice,TradedLevels] = ...
    tradeLOB(CurrentLOB,TradingShares,EnvConstants.BuyTrade);

% Update Reward
if LoggedSignals.IntervalIdx==1
    LoggedSignals.ArrivalPrice = InitialPrice;
end

LoggedSignals.HorizonExecutedShares((1:TradedLevels) + ...
    (LoggedSignals.IntervalIdx-1)*LoggedSignals.NumLevels) = ExecutedShares;
LoggedSignals.HorizonExecutedPrices((1:TradedLevels) + ...
    (LoggedSignals.IntervalIdx-1)*LoggedSignals.NumLevels) = ExecutedPrices;


IS_Agent = endingInventoryIS(LoggedSignals.ArrivalPrice,LoggedSignals.HorizonExecutedPrices, ...
    LoggedSignals.HorizonExecutedShares,EnvConstants.BuyTrade);

if EnvConstants.BuyTrade
    AgentRemainingSharestoTrade = max(EnvConstants.TotalTradingShares - ...
        LoggedSignals.CurrentInventoryShares - sum(ExecutedShares),0);
else
    AgentRemainingSharestoTrade = max(LoggedSignals.CurrentInventoryShares - ...
        sum(ExecutedShares), 0);
end

[PenaltyAgentExecutedShares,PenaltyAgentExecutedPrices] = tradeLOB(CurrentLOB, ...
    round(AgentRemainingSharestoTrade./RemainingIntervals),EnvConstants.BuyTrade);

IS_Agent_Penalty = RemainingIntervals.*endingInventoryIS(LoggedSignals.ArrivalPrice, ...
    PenaltyAgentExecutedPrices,PenaltyAgentExecutedShares,EnvConstants.BuyTrade);

TWAPRemainingSharestoTrade = (EnvConstants.NumIntervalsPerHorizon - LoggedSignals.IntervalIdx)* ...
    EnvConstants.TotalTradingShares/EnvConstants.NumIntervalsPerHorizon;

[PenaltyTWAPExecutedShares,PenaltyTWAPExecutedPrices] = tradeLOB(CurrentLOB, ...
    round(TWAPRemainingSharestoTrade./RemainingIntervals),EnvConstants.BuyTrade);

IS_TWAP_Penalty = RemainingIntervals.*endingInventoryIS(LoggedSignals.ArrivalPrice, ...
    PenaltyTWAPExecutedPrices,PenaltyTWAPExecutedShares,EnvConstants.BuyTrade);

Reward = ((IS_TWAP + IS_TWAP_Penalty) - (IS_Agent + IS_Agent_Penalty)) .* ...
    LoggedSignals.IntervalIdx./ LoggedSignals.NumIntervalsPerHorizon;

% No reward if trading target already met
if EnvConstants.BuyTrade
    if LoggedSignals.CurrentInventoryShares >= EnvConstants.TotalTradingShares
        Reward = 0;
    end
else
    if LoggedSignals.CurrentInventoryShares <= 0
        Reward = 0;
    end
end

% Update LoggedSignals
LoggedSignals.ExecutedPrices = ExecutedPrices;
LoggedSignals.ExecutedShares = ExecutedShares;

if EnvConstants.BuyTrade
    LoggedSignals.CurrentInventoryShares = ...
        LoggedSignals.CurrentInventoryShares + sum(ExecutedShares);
else
    LoggedSignals.CurrentInventoryShares = ...
        max(LoggedSignals.CurrentInventoryShares - sum(ExecutedShares), 0);
end

% Update LoggedSignals.IS_Agent
LoggedSignals.IS_Agent(StepIdx) = IS_Agent;

% Update LoggedSignals.Reward
LoggedSignals.Reward(StepIdx) = Reward;

% Update Observation
if LoggedSignals.IntervalIdx == LoggedSignals.NumIntervalsPerHorizon
    Observation = ...
        [EnvConstants.TotalTradingShares RemainingIntervals-1 IS_Agent PriceDivergence*100];
else
    Observation = ...
        [AgentRemainingSharestoTrade RemainingIntervals-1 IS_Agent PriceDivergence*100];
end

end


function [ExecutedShares,ExecutedPrices,InitialPrice,TradedLevels] = ...
    tradeLOB(CurrentLOB,TradingShares,BuyTrade)
% tradeLOB function executes market order trade based on current limit order book.
%   This function computes the results of executing a market order trade
%   for a specified number of shares and trade direction (buy/sell) based on
%   the information available in the current snapshot of the limit order book.
%
%   Inputs:
%
%      CurrentLOB - 1-by-(NumLevels*4) vector of current limit order book information
%
%         e.g. For NumLevels==5, 1-by-20 vector with the following values:
%    [AskPrice1 AskSize1 BidPrice1 BidSize1 ... AskPrice5 AskSize5 BidPrice5 BidSize5]
%
%         e.g. For NumLevels==10, 1-by-40 vector with the following values:
%    [AskPrice1 AskSize1 BidPrice1 BidSize1 ... AskPrice10 AskSize10 BidPrice10 BidSize10]
%
%      TradingShares - Scalar number of shares to be traded
%
%      BuyTrade - Scalar logical indicating trading direction:
%                    true: Buy
%                    false: Sell
%
%    Outputs:
%      ExecutedShares - Vector of number of executed shares at each traded level
%      ExecutedPrices - Vector of number of executed prices at each traded level
%      InitialPrice - Scalar numeric initial price
%      TradedLevels - Scalar numeric traded levels

LOBLength = length(CurrentLOB);

% Separate Bid and Ask data
AskPriceIdx = (1:4:LOBLength);
AskSizeIdx = AskPriceIdx + 1;
BidPriceIdx = AskSizeIdx + 1;
BidSizeIdx = BidPriceIdx + 1;

AskPrices = CurrentLOB(AskPriceIdx)';
AskSizes = CurrentLOB(AskSizeIdx)';
BidPrices = CurrentLOB(BidPriceIdx)';
BidSizes = CurrentLOB(BidSizeIdx)';

if BuyTrade
    LOBPrices = AskPrices;
    LOBSizes = AskSizes;
else
    LOBPrices = BidPrices;
    LOBSizes = BidSizes;
end

InitialPrice = LOBPrices(1);

% Execute trade
CumulativeLOBSizes = cumsum(LOBSizes);
TradedLevels = sum(CumulativeLOBSizes <= TradingShares);
if TradedLevels < 1
    TradedLevels = 1;
elseif CumulativeLOBSizes(TradedLevels) < TradingShares
    TradedLevels = TradedLevels + 1;
end

ExecutedPrices = LOBPrices(1:TradedLevels);
ExecutedShares = LOBSizes(1:TradedLevels);
ExecutedShares(end) = ExecutedShares(end) - (sum(ExecutedShares) - TradingShares);
end


function outIS = endingInventoryIS(ArrivalPrice,ExecutedPrices,TradedVolumes,BuyTrade)
% EndingInventoryIS function for Implementation Shortfall given ending inventory constraints
%   This function computes the Implementation Shortfall (IS) under the
%   assumption of ending inventory constraints (i.e. no opportunity cost).
%
%   For a Sell trade, the constraint is to have zero ending inventory
%   assuming successful liquidation of all shares.
%
%   For a Buy trade, the constraint is to have full ending inventory
%   assuming successful purchase of all shares.
%
%   Inputs:
%
%      ArrivalPrice - Scalar price immediately before trade execution
%
%      ExecutedPrices - Vector of executed prices
%
%      TradedVolumes - Vector of executed trade volumes
%
%      BuyTrade - Scalar logical indicating trading direction:
%                    true: Buy
%                    false: Sell
%
%    Outputs:
%      outIS - Implementation shortfall

outIS = ArrivalPrice.*sum(TradedVolumes) - sum(ExecutedPrices.*TradedVolumes);

if BuyTrade
    outIS = -outIS;
end

end


function [TGLR, GLR] = ISTGLR(ISDiffValues)
% ISTGLR function for Total-Gain-to-Loss Ratio for Implementation Shortfalls (IS)
%   This function computes the Total-Gain-to-Loss Ratio (TGLR) and the
%   Gain-to-Loss Ratio (GLR) for a vector of values (IS_Baseline - IS_Agent)
%   where:  
%     - IS_Baseline: Implementation shortfalls for baseline (e.g. TWAP)
%     - IS_Agent: Implementation shortfalls for agent
%
%   Input:
%     - ISDiffValues: Vector of IS difference values (IS_Baseline - IS_Agent)
%
%   Outputs:
%     - TGLR: Total-Gain-to-Loss Ratio computed as the ratio of 
%             Total Positive over Total Negative magnitudes of IS difference
%     - GLR:  Gain-to-Loss Ratio computed as the ratio of 
%             Mean Positive over Mean Negative magnitudes of IS difference

ISDiffValues = ISDiffValues(:);

ISGains = ISDiffValues(ISDiffValues>0);
ISLosses = ISDiffValues(ISDiffValues<0);

MeanGains = abs(mean(ISGains));
MeanLosses = abs(mean(ISLosses));

TGLR = sum(ISGains)./abs(sum(ISLosses));
GLR = MeanGains./MeanLosses;

end

Deep Reinforcement Learning for Optimal Trade Execution

Introduction

Data

Baseline Trade Execution Algorithm

Agents

Observation Space

Action Space

Reward

Performance Analysis

Workflow Organization

Load and Prepare Message Data

Load and Prepare Limit Order Book Data

Optimal Execution for Sell Trades

Define Settings for Sell Trades

TWAP Baseline Analysis on Training Data

Set Up RL Training Environment for Sell Trades

Validate Training Environment Reset and Step Functions for Sell Trades

Create and Train Agent for Sell Trades

Compute IS Using Trained Agent and Training Data for Sell Trades

Plot Training Results for Sell Trades

Display Summary of Training Results for Sell Trades

TWAP Baseline Analysis on Testing Data

Set Up RL Testing Environment for Sell Trades

Validate Testing Environment Reset and Step Functions for Sell Trades

Compute IS Using Trained Agent and Testing Data for Sell Trades

Plot Testing Results for Sell Trades

Display Summary of Training and Testing Results for Sell Trades

Optimal Execution for Buy Trades

References

Local Functions

See Also