Sequence-to-Sequence Classification Using 1-D Convolutions

This example shows how to classify each time step of sequence data using a generic temporal convolutional network (TCN).

While sequence-to-sequence tasks are commonly solved with recurrent neural network architectures, Bai et al. [1] show that convolutional neural networks can match the performance of recurrent networks on typical sequence modeling tasks or even outperform them. Potential benefits of using convolutional networks can be better parallelism, better control over the receptive field size, better control of the network's memory footprint during training, and more stable gradients. Just like recurrent networks, convolutional networks can operate on variable length input sequences and can be used to model sequence-to-sequence or sequence-to-one tasks.

This example uses sensor data obtained from a smartphone worn on the body and trains a temporal convolutional network as a function using a custom training loop and automatic differentiation to recognize the activity of the wearer given time series data representing accelerometer readings in three different directions.

Load Training Data

Load the human activity recognition data. The data contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions. Six sequences are used for training and one sequence is used for testing after training.

s = load("HumanActivityTrain.mat");
XTrain = s.XTrain;
YTrain = s.YTrain;

numObservations = numel(XTrain);
classes = categories(YTrain{1});
numClasses = numel(classes);

Visualize one training sequence in a plot. Plot the first feature of the first training sequence and color the plot according to the corresponding activity.

X = XTrain{1}(1,:);

for j = 1:numel(classes)
    label = classes(j);
    idx = find(YTrain{1} == label);
    hold on
hold off

xlabel("Time Step")
title("Training Sequence 1, Feature 1")

Define Deep Learning Model

The main building block of a temporal convolutional network is a dilated causal convolution layer which operates over the time steps of each sequence. In this context, "causal" means that the activations computed for a particular time step cannot depend on activations from future time steps.

In order to build up context from previous time steps, multiple convolutional layers are typically stacked on top of each other. To achieve large receptive field sizes the dilation factor of subsequent convolution layers is increased exponentially as shown in the image below. Assuming the dilation factor of the k-th convolutional layer is 2(k-1) and the stride is 1, then the receptive field size of such a network can be computed as R=(f-1)(2K-1)+1, where f is the filter size and K is the number of convolutional layers. Change the filter size and number of layers to easily adjust the receptive field size and the number or learnable parameters as necessary for the data and task at hand.

One of the disadvantages of TCNs compared to RNNs is the higher memory footprint during inference. The entire raw sequence is required to compute the next time step. To reduce inference time and memory consumption, especially for step-ahead predictions, it can be beneficial to train with the smallest sensible receptive field size R and only perform prediction with the last R time steps of the input sequence.

A general TCN architecture (as described in [1]) consists of multiple residual blocks, each containing two sets of dilated causal convolution layers with the same dilation factor, followed by normalization, ReLU activation, and spatial dropout layers. The input to each block is added to the output of the block (including a 1-by-1 convolution on the input when the number of channels between the input and output do not match) and a final activation function is applied.

Define and Initialize Model Parameters

Specify the parameters for the TCN architecture with four residual blocks containing dilated causal convolution layers with each 175 filters of size 3.

numBlocks = 4;
numFilters = 175;
filterSize = 3;
dropoutFactor = 0.05;

To pass the model hyperparameters to the model functions (the number of blocks and the dropout factor), create a struct containing these values.

hyperparameters = struct;
hyperparameters.NumBlocks = numBlocks;
hyperparameters.DropoutFactor = dropoutFactor;

Create a struct containing dlarray objects for all the learnable parameters of the model based on the number of input channels and the hyperparameters that define the model architecture. Each residual block requires weights parameters and bias parameters for each of the two convolution operations. The first residual block usually also requires weights and biases for an additional convolution operation with filter size 1. The final fully connected operation requires a weights and bias parameter as well. Initialize the learnable layer weights using the function initializeGaussian, listed at the end of the example. Initialize the learnable layer biases with zeros.

numInputChannels = 3;

parameters = struct;
numChannels = numInputChannels;

for k = 1:numBlocks
    parametersBlock = struct;
    blockName = "Block"+k;
    weights = initializeGaussian([filterSize, numChannels, numFilters]);
    bias = zeros(numFilters, 1, 'single');
    parametersBlock.Conv1.Weights = dlarray(weights);
    parametersBlock.Conv1.Bias = dlarray(bias);
    weights = initializeGaussian([filterSize, numFilters, numFilters]);
    bias = zeros(numFilters, 1, 'single');
    parametersBlock.Conv2.Weights = dlarray(weights);
    parametersBlock.Conv2.Bias = dlarray(bias);
    % If the input and output of the block have different numbers of
    % channels, then add a convolution with filter size 1.
    if numChannels ~= numFilters
        weights = initializeGaussian([1, numChannels, numFilters]);
        bias = zeros(numFilters, 1, 'single');
        parametersBlock.Conv3.Weights = dlarray(weights);
        parametersBlock.Conv3.Bias = dlarray(bias);
    numChannels = numFilters;
    parameters.(blockName) = parametersBlock;

weights = initializeGaussian([numClasses,numChannels]);
bias = zeros(numClasses,1,'single');

parameters.FC.Weights = dlarray(weights);
parameters.FC.Bias = dlarray(bias);

View the network parameters.

parameters = struct with fields:
    Block1: [1×1 struct]
    Block2: [1×1 struct]
    Block3: [1×1 struct]
    Block4: [1×1 struct]
        FC: [1×1 struct]

View the parameters of the first block.

ans = struct with fields:
    Conv1: [1×1 struct]
    Conv2: [1×1 struct]
    Conv3: [1×1 struct]

View the parameters of the first convolutional operation of the first block.

ans = struct with fields:
    Weights: [3×3×175 dlarray]
       Bias: [175×1 dlarray]

Define Model and Model Gradients Functions

Create the function model, listed in the Model Function section at the end of the example, that computes the outputs of the deep learning model. The function model takes the input data, the learnable model parameters, the model hyperparameters, and a flag that specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence.

Create the function modelGradients, listed in the Model Gradients Function section at the end of the example, that takes a mini-batch of input data, the corresponding target sequences, and the parameters of the network, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss.

Specify Training Options

Specify a set of training options used in the custom training loop.

  • Train for 30 epochs with mini-batch size 1.

  • Start with an initial learn rate of 0.001

  • Multiply the learn rate by 0.1 every 12 epochs.

  • Clip the gradients using the L2 norm with threshold 1.

maxEpochs = 30;
miniBatchSize = 1;
initialLearnRate = 0.001;
learnRateDropFactor = 0.1;
learnRateDropPeriod = 12;
gradientThreshold = 1;

To train on a GPU if one is available, specify the execution environment "auto". To explicitly train on the CPU or GPU, specify "cpu" or "gpu" respectively. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU with compute capability 3.0 or higher.

executionEnvironment = "auto";

To monitor the training progress, you can plot the training loss after each iteration. Create the variable plots that contains "training-progress". If you do not want to plot the training progress, then set this value to "none".

plots = "training-progress";

Train Model

Train the network via stochastic gradient descent by looping over the sequences in the training dataset, computing parameter gradients, and updating the network parameters via the Adam update rule. This process is repeated multiple times (referred to as epochs) until training converges and the maximum number of epochs is reached.

For each epoch, shuffle the training data.

For each mini-batch:

  • Convert the labels to dummy variables.

  • Preprocess the sequence data using the transformSequences function, listed at the end of the example.

  • Convert the data to dlarray objects with underlying type single

  • For GPU training, convert the data to gpuArray.

  • Evaluate the model gradients and loss using dlfeval and the modelGradients function.

  • Clip the gradients if they are too large, using the function thresholdL2Norm, listed at the end of the example, and the dlupdate function.

  • Update the network parameters using the adamupdate function.

  • Update the training progress plot.

After completing learnRateDropPeriod epochs, reduce the learn rate by multiplying the current learning rate by learnRateDropFactor.

Initialize the learning rate which will be multiplied by the LearnRateDropFactor value every LearnRateDropPeriod epochs.

learnRate = initialLearnRate;

Initialize the moving average of the parameter gradients and the element-wise squares of the gradients used by the Adam optimizer.

trailingAvg = [];
trailingAvgSq = [];

Initialize a plot showing the training progress.

if plots == "training-progress"
    lineLossTrain = animatedline('Color',[0.85 0.325 0.098]);
    ylim([0 inf])
    grid on

Train the model.

iteration = 0;
numIterationsPerEpoch = floor(numObservations./miniBatchSize);

start = tic;

% Loop over epochs.
for epoch = 1:maxEpochs
    % Shuffle the data.
    idx = randperm(numObservations);
    XTrain = XTrain(idx);
    YTrain = YTrain(idx);
    % Loop over mini-batches.
    for i = 1:numIterationsPerEpoch
        iteration = iteration + 1;
        % Read mini-batch of data and apply the transformSequences
        % preprocessing function.
        idx = (i-1)*miniBatchSize+1:i*miniBatchSize;
        [X,Y,numTimeSteps] = transformSequences(XTrain(idx),YTrain(idx));
        % Convert to dlarray.
        dlX = dlarray(X);
        % If training on a GPU, convert data to gpuArray.
        if (executionEnvironment == "auto" && canUseGPU) || executionEnvironment == "gpu"
            dlX = gpuArray(dlX);
        % Evaluate the model gradients and loss using dlfeval.
        [gradients, loss] = dlfeval(@modelGradients,dlX,Y,parameters,hyperparameters,numTimeSteps);
        % Clip the gradients.
        gradients = dlupdate(@(g) thresholdL2Norm(g,gradientThreshold),gradients);
        % Update the network parameters using the Adam optimizer.
        [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ...
            trailingAvg, trailingAvgSq, iteration, learnRate);
        if plots == "training-progress"
            % Plot training progress.
            D = duration(0,0,toc(start),'Format','hh:mm:ss');
            % Normalize the loss over the sequence lengths            
            loss = mean(loss ./ numTimeSteps);
            loss = double(gather(extractdata(loss)));
            loss = mean(loss);
            addpoints(lineLossTrain,iteration, mean(loss));

            title("Epoch: " + epoch + ", Elapsed: " + string(D))
    % Reduce the learning rate after learnRateDropPeriod epochs
    if mod(epoch,learnRateDropPeriod) == 0
        learnRate = learnRate*learnRateDropFactor;

Test Model

Test the classification accuracy of the model by comparing the predictions on a held-out test set with the true labels for each time step.

s = load("HumanActivityTest.mat");
XTest = s.XTest;
YTest = s.YTest;
numObservationsTest = numel(XTest);

Preprocess the test data using the same function used for training and convert the data to dlarray.

[X,Y] = transformSequences(XTest,YTest);
dlXTest = dlarray(X);

To predict the labels of the test data, use the model function with the trained parameters, the hyperparameters, and the doTraining option set to false.

doTraining = false;
dlYPred = model(dlXTest,parameters,hyperparameters,doTraining);

To get the categorical labels of the predictions, find the class label with the highest score for each time step. Calculate the accuracy by evaluating the mean classification accuracy for the sequences.

YPred = gather(extractdata(dlYPred));

labelsPred = categorical(zeros(numObservationsTest,size(dlYPred,3)));
accuracy = zeros(1,numObservationsTest);

for i = 1:numObservationsTest
    [~,idxPred] = max(YPred(:,i,:),[],1);
    [~,idxTest] = max(Y(:,i,:),[],1);
    labelsPred(i,:) = classes(idxPred)';
    accuracy(i) = mean(idxPred == idxTest);

View the mean accuracy over the test set.

ans = 0.9997

Compare the predictions for a single sequence with the corresponding test data by using a plot.

idx = 1;
hold on
hold off

xlabel("Time Step")
title("Predicted Activities")
legend(["Predicted" "Test Data"])

Model Function

The function model takes the input data dlX, the learnable model parameters, the model hyperparameters, and the flag doTraining which specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence. The model consists of multiple residual blocks with exponentially increasing dilation factors. After the last residual block, a final fullyconnect operation maps the output to the number of classes in the target data.

function dlY = model(dlX,parameters,hyperparameters,doTraining)

numBlocks = hyperparameters.NumBlocks;
dropoutFactor = hyperparameters.DropoutFactor;

dlY = dlX;

% Residual blocks.
for k = 1:numBlocks
    dilationFactor = 2^(k-1);
    parametersBlock = parameters.("Block"+k);
    dlY = residualBlock(dlY,dilationFactor,dropoutFactor,parametersBlock,doTraining);

% Fully connect.
weights = parameters.FC.Weights;
bias = parameters.FC.Bias;
dlY = fullyconnect(dlY,weights,bias,'DataFormat','CBT');


Residual Block Function

The function residualBlock implements the core building block of the temporal convolutional network.

To apply 1-D causal dilated convolution, use the dlconv function:

  • To convolve over the spatial dimensions, set the 'DataFormat' option to 'CBS' (use the dimension label 'S' instead of 'T'),

  • Set the 'DilationFactor' option according to the dilation factor of the residual block.

  • To ensure only the past time steps are used, apply padding only at the beginning of the sequence.

function dlY = residualBlock(dlX,dilationFactor,dropoutFactor,parametersBlock,doTraining)

% Convolution options.
filterSize = size(parametersBlock.Conv1.Weights,1);
paddingSize = (filterSize - 1) * dilationFactor;

% Convolution.
weights = parametersBlock.Conv1.Weights;
bias =  parametersBlock.Conv1.Bias;
dlY = dlconv(dlX,weights,bias, ...
    'DataFormat','CBS', ...
    'Stride', 1, ...
    'DilationFactor', dilationFactor, ...
    'Padding', [paddingSize; 0] );

% Instance normalization, ReLU, spatial dropout.
dlY = instanceNormalization(dlY,'CBS');
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining);

% Convolution.
weights = parametersBlock.Conv2.Weights;
bias = parametersBlock.Conv2.Bias;
dlY = dlconv(dlY,weights,bias, ...
    'DataFormat','CBS', ...
    'Stride', 1, ...
    'DilationFactor', dilationFactor, ...
    'Padding',[paddingSize; 0] );

% Instance normalization, ReLU, spatial dropout.
dlY = instanceNormalization(dlY,'CBS');
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining);

% Optional 1-by-1 convolution.
if ~isequal(size(dlX),size(dlY))
    weights = parametersBlock.Conv3.Weights;
    bias = parametersBlock.Conv3.Bias;
    dlX = dlconv(dlX,weights,bias,'DataFormat','CBS');

% Addition and ReLU
dlY = relu(dlX+dlY);


Model Gradients Function

The modelGradients function takes a mini-batch of input data dlX, the corresponding target sequences T, the learnable parameters, and the hyperparameters, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss. To compute the gradients, evaluate the modelGradients function using the dlfeval function in the training loop.

function [gradients,loss] = modelGradients(dlX,T,parameters,hyperparameters,numTimeSteps)

dlY = model(dlX,parameters,hyperparameters,true);
dlY = softmax(dlY,'DataFormat','CBT');

dlT = dlarray(T,'CBT');
loss = maskedCrossEntropyLoss(dlY, dlT, numTimeSteps);

gradients = dlgradient(mean(loss),parameters);


Masked Cross-Entropy Loss Function

The maskedCrossEntropyLoss function computes the cross-entropy loss for mini-batches of sequences, where the sequences are different lengths.

function loss = maskedCrossEntropyLoss(dlY, dlT, numTimeSteps)

numObservations = size(dlY,2);
loss = dlarray(zeros(1,numObservations,'like',dlY));

for i = 1:numObservations
    idx = 1:numTimeSteps(i);
    loss(i) = crossentropy(dlY(:,i,idx),dlT(:,i,idx),'DataFormat','CBT');


Instance Normalization Function

The instanceNormalization function normalizes the input dlX by first calculating the mean μ and the variance σ2 for each observation over each input channel. Then it calculates the normalized activations as


In comparison to batch normalization, the mean and variance is different for each observation in the mini-batch. Use normalization, such as instance normalization, between convolutional layers and nonlinearities to speed up training of convolutional neural networks and improve convergence.

function dlY = instanceNormalization(dlX,fmt)

reductionDims = find(fmt == 'S');
mu = mean(dlX,reductionDims);
sigmaSq = var(dlX,1,reductionDims);

epsilon = 1e-5;
dlY = (dlX-mu) ./ sqrt(sigmaSq+epsilon);


Spatial Dropout Function

The spatialDropout function performs spatial dropout [3] on the input dlX with dimension labels fmt when the doTraining flag is true. Spatial dropout drops an entire channel of the input data. That is, all time steps of a certain channel are dropped with the probability specified by dropoutFactor. The channels are dropped out independently in the batch dimension.

function dlY = spatialDropout(dlX,dropoutFactor,fmt,doTraining)

if doTraining
    maskSize = size(dlX);
    maskSize(fmt=='S') = 1;
    dropoutScaleFactor = single(1 - dropoutFactor);
    dropoutMask = (rand(maskSize,'like',dlX) > dropoutFactor) / dropoutScaleFactor;
    dlY = dlX .* dropoutMask;
    dlY = dlX;


Weights Initialization Function

The initializeGaussian function samples weights from a Gaussian distribution with mean 0 and standard deviation 0.01.

function parameter = initializeGaussian(sz)

parameter = randn(sz,'single') .* 0.01;


Sequence Transform Function

The sequenceTransform function takes a cell array of N sequences and returns a C-by-N-by-S numeric array of left-padded 1-D sequences and the number of time steps in each sequence, where C corresponds to the number of features of the sequences and S corresponds to the number of time steps of the longest sequence.

function [XTransformed, YTransformed, numTimeSteps] = transformSequences(X,Y)

numTimeSteps = cellfun(@(sequence) size(sequence,2),X);

miniBatchSize = numel(X);
numFeatures = size(X{1},1);
sequenceLength = max(cellfun(@(sequence) size(sequence,2),X));
classes = categories(Y{1});
numClasses = numel(classes);

sz = [numFeatures miniBatchSize sequenceLength];
XTransformed = zeros(sz,'single');

sz = [numClasses miniBatchSize sequenceLength];
YTransformed = zeros(sz,'single');

for i = 1:miniBatchSize
    predictors = X{i};
    % Create dummy labels.
    responses = zeros(numClasses, numTimeSteps(i), 'single'); 
    for c = 1:numClasses
        responses(c,Y{i}==classes(c)) = 1;
    % Left pad.
    XTransformed(:,i,:) = leftPad(predictors,sequenceLength);
    YTransformed(:,i,:) = leftPad(responses,sequenceLength);


Left Padding Function

The leftPad function takes a sequence and left-pads it with zeros to have the specified sequence length.

function sequencePadded = leftPad(sequence,sequenceLength)

[numFeatures,numTimeSteps] = size(sequence);

paddingSize = sequenceLength - numTimeSteps;
padding = zeros(numFeatures,paddingSize);

sequencePadded = [padding sequence];


Gradient Clipping Function

The thresholdL2Norm function scales the gradient g so that its L2 norm equals gradientThreshold when the L2 norm of the gradient is larger than gradientThreshold.

function g = thresholdL2Norm(g,gradientThreshold)

gradientNorm = sqrt(sum(g.^2,'all'));
if gradientNorm > gradientThreshold
    g = g * (gradientThreshold / gradientNorm);



[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).

[2] Van Den Oord, Aäron, et al. "WaveNet: A generative model for raw audio." SSW 125 (2016).

[3] Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

See Also

| | | | | | | |

Related Topics