Main Content

incrementalLearner

Convert robust random cut forest model to incremental learner

Since R2023b

    Description

    IncrementalForest = incrementalLearner(forest) returns a robust random cut forest (RRCF) model IncrementalForest for anomaly detection, initialized using the parameters provided in the RRCF model forest. Because its property values reflect the knowledge gained from forest, IncrementalForest can detect anomalies given new observations, and it is warm, meaning that the incremental fit function can return scores and detect anomalies.

    example

    IncrementalForest = incrementalLearner(forest,Name=Value) specifies additional options using one or more name-value arguments. For example, ScoreWarmupPeriod=500 specifies to process 500 observations before score computation and anomaly detection.

    example

    Examples

    collapse all

    Train an incremental robust random cut forest (RRCF) model and perform anomaly detection on a data set with categorical predictors.

    Load Data

    Load census1994.mat. The data set consists of demographic data from the US Census Bureau.

    load census1994.mat

    incrementalRobustRandomCutForest does not use observations with missing values. Remove missing values in the data to reduce memory consumption and speed up training. Keep only the first 1000 observations in the training data set and the first 2000 observations in the test data set.

    adultdata = rmmissing(adultdata);
    adulttest = rmmissing(adulttest);
    Xtrain = adultdata(1:1000,:);
    Xstream = adulttest(1:2000,:);

    Train RRCF Model

    Fit an RRCF model to the training data. Specify an anomaly contamination fraction of 0.001.

    rng(0,"twister"); % For reproducibility
    TTforest = rrcforest(Xtrain,ContaminationFraction=0.001);
    details(TTforest)
      RobustRandomCutForest with properties:
    
            CollusiveDisplacement: 'maximal'
                      NumLearners: 100
        NumObservationsPerLearner: 256
                               Mu: []
                            Sigma: []
            CategoricalPredictors: [2 4 6 7 8 9 10 14 15]
            ContaminationFraction: 1.0000e-03
                   ScoreThreshold: 55.5745
                   PredictorNames: {'age'  'workClass'  'fnlwgt'  'education'  'education_num'  'marital_status'  'occupation'  'relationship'  'race'  'sex'  'capital_gain'  'capital_loss'  'hours_per_week'  'native_country'  'salary'}
    

    TTforest is a RobustRandomCutForest model object representing a traditionally trained RRCF model. The software identifies nine variables in the data as categorical predictors because they contain string arrays.

    Convert Trained Model

    Convert the traditionally trained RRCF model to an RRCF model for incremental learning.

    Incrementalforest = incrementalLearner(TTforest);

    Incrementalforest is an incrementalRobustRandomCutForest model object that is ready for incremental learning and anomaly detection.

    Fit Incremental Model and Detect Anomalies

    Perform incremental learning on the Xstream data by using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

    • Process 100 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming observations.

    • Store medianscore, the median score value of the data chunk, to see how it evolves during incremental learning.

    • Store threshold, the score threshold value for anomalies, to see how it evolves during incremental learning.

    • Store numAnom, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.

    n = numel(Xstream(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    medianscore = zeros(nchunk,1);
    numAnom = zeros(nchunk,1);
    threshold = zeros(nchunk,1);
    
    % Incremental fitting
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        idx = ibegin:iend;    
        [Incrementalforest,tf,scores] = fit(Incrementalforest,Xstream(idx,:));
        medianscore(j) = median(scores);
        numAnom(j) = sum(tf);
        threshold(j) = Incrementalforest.ScoreThreshold;
    end

    Analyze Incremental Model During Training

    To see how the median score, score threshold, and number of detected anomalies per chunk evolve during training, plot them on separate tiles.

    tiledlayout(3,1);
    nexttile
    plot(medianscore)
    ylabel("Median Score")
    xlabel("Iteration")
    xlim([0 nchunk])
    nexttile
    plot(threshold)
    ylabel("Score Threshold")
    xlabel("Iteration")
    xlim([0 nchunk])
    nexttile
    plot(numAnom,"+")
    ylabel("Anomalies")
    xlabel("Iteration")
    xlim([0 nchunk])
    ylim([0 max(numAnom)+0.2])

    Figure contains 3 axes objects. Axes object 1 with xlabel Iteration, ylabel Median Score contains an object of type line. Axes object 2 with xlabel Iteration, ylabel Score Threshold contains an object of type line. Axes object 3 with xlabel Iteration, ylabel Anomalies contains a line object which displays its values using only markers.

    totalanomalies=sum(numAnom)
    totalanomalies = 
    1
    
    anomfrac= totalanomalies/n
    anomfrac = 
    5.0000e-04
    

    fit updates the model and returns the observation scores and the indices of observations with scores above the score threshold value as anomalies. A high score value indicates a normal observation, and a low value indicates an anomaly. The median score fluctuates between approximately 230 and 270. The score threshold rises from a value of 260 after the first iteration and steadily approaches 285 after 12 iterations. The software detected 4 anomalies in the Xstream data, yielding a total contamination fraction of 0.002.

    Train a robust random cut forest (RRCF) model on a simulated, noisy, periodic shingled time series containing no anomalies by using rrcforest. Convert the trained model to an incremental learner object, and then incrementally fit the time series and detect anomalies.

    Create Simulated Data Stream

    Create a simulated data stream of observations representing a noisy sinusoid signal.

    rng(0,"twister"); % For reproducibility
    period = 100;
    n = 2001+period;
    sigma = 0.04;
    a = linspace(1,n,n)';
    b = sin(2*pi*(a-1)/period)+sigma*randn(n,1);

    Introduce an anomalous region into the data stream. Plot the data stream portion that contains the anomalous region, and circle the anomalous data points.

    c = 2*(sin(2*pi*(a-35)/period)+sigma*randn(n,1));
    b(1150:1170) = c(1150:1170);
    scatter(a,b,".")
    xlim([900,1200])
    xlabel("Observation")
    hold on
    scatter(a(1150:1170),b(1150:1170),"r")
    hold off

    Figure contains an axes object. The axes object with xlabel Observation contains 2 objects of type scatter.

    Convert the single-featured data set b into a multi-featured data set by shingling [1] with a shingle size equal to the period of the signal. The ith shingled observation is a vector of k features with values bi, bi+1, ..., bi+k-1, where k is the shingle size.

    X = [];
    shingleSize = period;
    for i = 1:n-shingleSize
        X = [X;b(i:i+shingleSize-1)'];
    end

    Train Model and Perform Incremental Anomaly Detection

    Fit a robust random cut forest model to the first 1000 shingled observations, specifying a contamination fraction of 0. Convert the model to an incrementalRobustRandomCutForest model object. Specify to keep the 100 most recent observations relevant for anomaly detection.

    Mdl = rrcforest(X(1:1000,:),ContaminationFraction=0);
    IncrementalMdl = incrementalLearner(Mdl,NumObservationsToKeep=100);

    To simulate a data stream, process the full shingled data set in chunks of 100 observations at a time. At each iteration:

    • Process 100 observations.

    • Calculate scores and detect anomalies using the isanomaly function.

    • Store anomIdx, the indices of shingled observations marked as anomalies.

    • If the chunk contains fewer than three anomalies, fit and update the previous incremental model.

    n = numel(X(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    anomIdx = [];
    allscores = [];
    
    % Incremental fitting
    rng("default"); % For reproducibility
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        idx = ibegin:iend;
        [isanom,scores] = isanomaly(IncrementalMdl,X(idx,:));
        allscores = [allscores;scores];
        anomIdx = [anomIdx;find(isanom)+ibegin-1];
        if (sum(isanom) < 3)
            IncrementalMdl = fit(IncrementalMdl,X(idx,:));
        end
    end

    Analyze Incremental Model During Training

    At each iteration, the software calculates a score value for each observation in the data chunk. A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an anomaly. Plot the anomaly score for the observations in the vicinity of the anomaly. Circle the scores of shingles that the software returns as anomalous.

    figure
    scatter(a(1:2000),allscores,".")
    hold on
    scatter(a(anomIdx),allscores(anomIdx),20,"or")
    xlim([900,1200])
    xlabel("Shingle")
    ylabel("Score")
    hold off

    Figure contains an axes object. The axes object with xlabel Shingle, ylabel Score contains 2 objects of type scatter.

    Because the introduced anomalous region begins at observation 1150, and the shingle size is 100, shingle 1051 is the first to show a high anomaly score. Some shingles between 1050 and 1170 have scores lying just below the anomaly score threshold, due to the noise in the sinusoidal signal. The shingle size affects the performance of the model by defining how many subsequent consecutive data points in the original time series the software uses to calculate the anomaly score for each shingle.

    Plot the unshingled data and highlight the introduced anomalous region. Circle the observation number of the first element in each shingle returned by that the software as anomalous.

    figure
    xlim([900,1200])
    ylim([-1.5 2])
    rectangle(Position=[1150 -1.5 20 3.5],FaceColor=[0.9 0.9 0.9], ...
        EdgeColor=[0.9 0.9 0.9])
    hold on
    scatter(a,b,".")
    scatter(a(anomIdx),b(anomIdx),20,"or")
    xlabel("Observation")
    hold off

    Figure contains an axes object. The axes object with xlabel Observation contains 3 objects of type rectangle, scatter.

    Input Arguments

    collapse all

    Traditionally trained RRCF model for anomaly detection, specified as a RobustRandomCutForest model object returned by rrcforest.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: incrementalLearner(forest,ObservationRemoval="timedecaying",ScoreWarmupPeriod=500) sets the observation removal method to "timedecaying" and specifies to process 500 observations before the incremental fit function returns scores and detects anomalies.

    Number of the most recent observations relevant for anomaly detection, specified as a nonnegative integer.

    Example: NumObservationsToKeep=250

    Data Types: single | double

    Observation removal method, specified as "oldest", "timedecaying", or "random". When the robust random cut trees reach their capacity, the software removes old observations to accommodate the most recent data.

    ValueDescription

    "oldest"

    Oldest observations are removed first.

    "timedecaying"

    Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first.

    "random"

    Observations are removed in random order.

    Data Types: string | char

    Options for computing in parallel and setting random streams, specified as a structure. Create the Options structure using statset. This table lists the option fields and their values.

    Field NameValueDefault
    UseParallelSet this value to true to run computations in parallel.false
    UseSubstreams

    Set this value to true to run computations in a reproducible manner.

    To compute reproducibly, set Streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

    false
    StreamsSpecify this value as a RandStream object or cell array of such objects. Use a single object except when the UseParallel value is true and the UseSubstreams value is false. In that case, use a cell array that has the same size as the parallel pool.If you do not specify Streams, then incrementalLearner uses the default stream or streams.

    Note

    You need Parallel Computing Toolbox™ to run computations in parallel.

    Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))

    Data Types: struct

    Warm-up period before score computation and anomaly detection, specified as a nonnegative integer. This option specifies the number of observations used by the incremental fit function to train the model and estimate the score threshold.

    Note

    When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.

    Example: ScoreWarmupPeriod=200

    Data Types: single | double

    Running window size used to estimate the score threshold (ScoreThreshold), specified as a positive integer. The default ScoreWindowSize value is 1000.

    If ScoreWindowSize is greater than the number of observations in the training data, the software determines ScoreThreshold by subsampling from the training data. Otherwise, ScoreThreshold is set to forest.ScoreThreshold.

    Example: ScoreWindowSize=100

    Data Types: single | double

    Output Arguments

    collapse all

    RRCF model for incremental anomaly detection, returned as an incrementalRobustRandomCutForest model object.

    To initialize IncrementalForest for incremental anomaly detection, incrementalLearner passes the values of the following properties of forest to the corresponding properties of IncrementalForest.

    PropertyDescription
    CategoricalPredictorsCategorical predictor indices, a vector of positive integers
    ContaminationFractionFraction of anomalies in the training data, a numeric scalar in the range [0,1]
    Mu Predictor means of the training data, a numeric vector
    NumLearnersNumber of robust random cut trees, a positive integer scalar
    NumObservationsPerLearner Number of observations for each robust random cut tree, a nonnegative integer
    PredictorNames Predictor variable names, a cell array of character vectors
    ScoreThreshold Threshold score for anomalies in the training data, a numeric scalar in the range [0,Inf). If ScoreWindowSize is greater than the number of observations used to train forest, then incrementalLearner approximates ScoreThreshold by subsampling from the training data. Otherwise, incrementalLearner passes forest.ScoreThreshold to IncrementalForest.ScoreThreshold.
    Sigma Predictor standard deviations of the training data, a numeric vector

    More About

    collapse all

    Incremental Learning for Anomaly Detection

    Incremental learning, or online learning, is a branch of machine learning concerned with processing incoming data from a data stream, possibly given little to no knowledge of the distribution of the predictor variables, aspects of the prediction or objective function (including tuning parameter values), or whether the observations contain anomalies. Incremental learning differs from traditional machine learning, where enough data is available to fit to a model, perform cross-validation to tune hyperparameters, and infer the predictor distribution.

    Anomaly detection is used to identify unexpected events and departures from normal behavior. In situations where the full data set is not immediately available, or new data is arriving, you can use incremental learning for anomaly detection to incrementally train a model so it adjusts to the characteristics of the incoming data.

    Given incoming observations, an incremental learning model for anomaly detection does the following:

    • Computes anomaly scores

    • Updates the anomaly score threshold

    • Detects data points above the score threshold as anomalies

    • Fits the model to the incoming observations

    For more information, see Incremental Anomaly Detection with MATLAB.

    References

    [1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," Proceedings of The 33rd International Conference on Machine Learning 48 (June 2016): 2712–21.

    [2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." Journal of Open Source Software 4, no. 35 (2019): 1336.

    Extended Capabilities

    Version History

    Introduced in R2023b