Main Content

splitAnomalyData

Split data into training, validation and testing sets for anomaly detection

Since R2023a

    Description

    [dsTrain,dsVal,dsTest] = splitAnomalyData(ds,anomalyLabels) automatically splits the input image datastore, ds, into three datastores for training, validation and testing of anomaly detection networks. anomalyLabels indicates which class labels in gtLabels belong to the anomaly (positive) class. By default, 70% of the total input data is used for training, 10% are used for validation, and 20% are used for testing. By default, the training datastore does not include anomaly images.

    Note

    This functionality requires the Automated Visual Inspection Library for Computer Vision Toolbox™. You can install the Automated Visual Inspection Library for Computer Vision Toolbox from Add-On Explorer. For more information about installing add-ons, see Get and Manage Add-Ons.

    example

    [dsTrain,trainLabels,dsVal,valLabels,dsTest,testLabels] = splitAnomalyData(ds,gtLabels,anomalyLabels) automatically splits the input datastore, ds, into three datastores and their labels for training, validation and testing of anomaly detection networks. gtLabels represents the ground-truth labels for each corresponding image in the datastore and anomalyLabels determine which labels in gtLabels belong to the anomaly class.

    [___] = splitAnomalyData(___,Name=Value) uses name-value arguments to perform custom proportioning of the images. You can specify either the proportion of images in each datastore or the proportion of anomaly and normal images.

    If you specify name-value arguments for both proportioning strategies, then splitAnomalyData uses the arguments that specify the proportion of images in each datastore. In this case, the function ignores the arguments that specify the proportion of anomaly and normal images.

    Examples

    collapse all

    Load a data set that consists of images of digits from 0 to 9.

    dataDir = fullfile(toolboxdir("vision"),"visiondata","digits","synthetic");
    ds = imageDatastore(dataDir,IncludeSubfolders=true, ...
        LabelSource="foldernames");

    Specify the digits that count as anomalous. For instance, consider images of the digit 8 to be normal, and all other digits to be anomalous.

    anomalyLabels = ["0","1","2","3","4","5","6","7","9"];

    Split the training data into training, validation, and testing datastores.

    [dsTrain,dsVal,dsTest] = splitAnomalyData(ds,anomalyLabels);
    Splitting anomaly dataset
    -------------------------
    * Finalizing... Done.
    * Number of files and proportions per class in all the datasets:
    
                   Input                Train               Validation                 Test        
             _________________    _________________    ____________________    ____________________
    
             NumFiles    Ratio    NumFiles    Ratio    NumFiles     Ratio      NumFiles     Ratio  
             ________    _____    ________    _____    ________    ________    ________    ________
                                                                                                   
        0      101        0.1         0         0         34        0.10863       67        0.10686
        1      101        0.1         0         0         34        0.10863       67        0.10686
        2      101        0.1         0         0         34        0.10863       67        0.10686
        3      101        0.1         0         0         34        0.10863       67        0.10686
        4      101        0.1         0         0         33        0.10543       68        0.10845
        5      101        0.1         0         0         34        0.10863       67        0.10686
        6      101        0.1         0         0         33        0.10543       68        0.10845
        7      101        0.1         0         0         34        0.10863       67        0.10686
        8      101        0.1        70         1         10       0.031949       21       0.033493
        9      101        0.1         0         0         33        0.10543       68        0.10845
    

    Input Arguments

    collapse all

    Data set of normal and anomaly images, specified as a datastore. If you do not specify the gtLabels argument, then ds must be an image datastore containing labeled images.

    Ground truth labels for each image, specified as a numeric vector, logical vector, or categorical vector. The splitAnomalyData function converts the labels into a logical vector according to the set of anomaly labels in anomalyLabels.

    Anomaly labels, specified as a vector of the same data type as gtLabels. When gtLabels is categorical, anomalyLabels can be of data type string whose values correspond to categories in gtLabels.

    The splitAnomalyData function converts all ground truth labels in gtLabels that belong to the set of anomaly labels to a logical true, indicating an anomaly (positive detection). The function converts all other ground truth labels to a logical false, indicating normality (negative detection).

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: [dsTrain,dsVal,dsTest] = splitAnomalyData(ds,anomalyLabels,DataAllocationRatio=[0.8 0.1 0.1]) assigns 80% of the total input data for training, 10% for validation, and 10% for testing

    Specify Proportion of Images in Each Datastore

    collapse all

    Proportion of input data to include in the train, validation and test datastores respectively, specified as a 1-by-3 numeric vector. The function ignores class labels when splitting the data. The elements of the vector must sum to a number between 0 and 1. If the elements sum to a number less than 1, then the splitAnomalyData function does not allocate the remaining data to a training, validation, or testing datastore.

    Include anomaly images in the training datastore, specified as a numeric or logical false (0) or true (1).

    Include all anomaly images in the validation and testing datastores, specified as a numeric or logical true (1) or false (0). This argument applies only when the value of KeepAnomalyLabelsInTrainingDatastore is false.

    By default, KeepUnusedAnomalyLabels is true and the splitAnomalyData function distributes all anomaly images to the validation and testing datastores. Therefore, the validation datastore dsVal and testing datastore dsTest have a higher proportion of anomaly to normal labels than the input datastore ds.

    When KeepUnusedAnomalyLabels is false, splitAnomalyData function omits some anomaly files from dsVal and dsTest. Therefore, dsVal and dsTest use the same proportion of anomaly to normal images as ds. Specify KeepUnusedAnomalyLabels as false for stratified partitions.

    Specify Proportion of Normal and Anomaly Images

    collapse all

    Ratio of files with normal class labels to include in the training, validation, and testing datastores, specified as a 1-by-3 numeric vector. The elements of the vector must sum to a number between 0 and 1. If the elements sum to a number less than 1, then the splitAnomalyData function does not allocate the remaining normal data to a training, validation, or testing datastore.

    Example: NormalLabelsRatio=[0.8 0.1 0.1] assigns 80% of the normal data for training, 10% for validation, and 10% for testing

    Ratio of files with normal class labels to include in the training, validation, and testing datastores, specified as a 1-by-3 numeric vector. The elements of the vector must sum to a number between 0 and 1. If the elements sum to a number less than 1, then the splitAnomalyData function does not allocate the remaining anomaly data to a training, validation, or testing datastore.

    Example: AnomalyLabelsRatio=[0.1 0.1 0.5] assigns 10% of the anomaly data for training, 10% for validation, and 50% for testing, and omits the remaining 30% of the anomaly data

    Other Partitioning Options

    collapse all

    Display statistics of the partitioned data for each class, specified as a numeric or logical true (1) or false (0).

    Output Arguments

    collapse all

    Training datastore, returned as a datastore of the same type as ds.

    Validation datastore, returned as a datastore of the same type as ds.

    Testing datastore, returned as a datastore of the same type as ds.

    Labels of data in training datastore, returned as a numeric vector, logical vector, or categorical vector.

    Labels of data in validation datastore, returned as a numeric vector, logical vector, or categorical vector.

    Labels of data in testing datastore, returned as a numeric vector, logical vector, or categorical vector.

    Version History

    Introduced in R2023a