binningTabularSynthesizer

Binning-based synthesizer for tabular data synthesis

Since R2024b

Description

To generate synthetic data, you can first create a binningTabularSynthesizer object using an existing multivariate data set. The object uses binning techniques to learn the distribution of the data set. Then, use the synthesizeTabularData object function to synthesize data using the object. After you synthesize data, you can test whether the new data set comes from the same distribution as the original data set. Use the mmdtest function to determine how close the data distributions are to each other.

Creation

Syntax

synthesizer = binningTabularSynthesizer(X)

synthesizer = binningTabularSynthesizer(X,Name=Value)

Description

synthesizer = binningTabularSynthesizer(X) creates a binning-based synthesizer object (synthesizer) using the existing data X.

example

synthesizer = binningTabularSynthesizer(X,Name=Value) specifies additional options using one or more name-value arguments. For example, you can specify the bin method and the variables to use.

example

Input Arguments

expand all

`X` — Existing data set
numeric matrix | table

Existing data set, specified as a numeric matrix or a table. Rows of X correspond to observations, and columns of X correspond to variables. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed in X.

Data Types: single | double | table

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: binningTabularSynthesizer(X,BinMethod="equiprobable",NumBins=10) specifies to use 10 equiprobable bins for each variable in X.

`BinMethod` — Binning algorithm
`"auto"` (default) | `"equal-width"` | `"equiprobable"` | `"dagostino-stephens"` | `"freedman-diaconis"` | `"scott"` | ...

Binning algorithm, specified as one of the values in this table.

Value	Description
`"auto"`	`"auto"` corresponds to: `"dagostino-stephens"` when the existing data `X` contains nonfinite values and you do not specify the `NumBins` name-value argument `"equiprobable"` when the existing data `X` contains nonfinite values and you specify the `NumBins` name-value argument `"equal-width"` when the existing data `X` contains finite values only and you specify the `NumBins` name-value argument `"terrell-scott"` otherwise
`"equal-width"`	Equal-width binning, where you must specify the number of bins using the `NumBins` name-value argument
`"equiprobable"`	Equiprobable binning, where you must specify the number of bins using the `NumBins` name-value argument
`"dagostino-stephens"` or `"ds"`	Equiprobable binning with `ceil(2*m^(2/5))` bins, where `m` is the number of observations in the existing data
`"freedman-diaconis"` or `"fd"`	Equal-width binning, where each bin for variable `k` has a width of `ceil(2iqr(X(:,k))m^(-1/3))` `X` is the existing data set. `m` is the number of observations in variable `k` (that is, in `X(:,k)`).
`"scott"`	Equal-width binning, where each bin for variable `k` has a width of `ceil(3.5std(X(:,k))m^(-1/3))` `X` is the existing data set. `m` is the number of observations in variable `k` (that is, in `X(:,k)`).
`"scott-multivariate"`	Equal-width binning, where each bin for variable `k` has a width of `3.5std(X(:,k))m^(-1/(2+d))` `X` is the existing data set. `m` is the number of observations in variable `k` (that is, in `X(:,k)`). `d` is the number of variables in `X`.
`"terrell-iqr"`	Equal-width binning, where each bin for variable `k` has a width of `2.603iqr(X(:,k))m^(-1/3)` `X` is the existing data set. `m` is the number of observations in variable `k` (that is, in `X(:,k)`).
`"terrell-scott"` or `"ts"`	Equal-width binning with `ceil((2*m)^(1/3))` bins, where `m` is the number of observations in the existing data
`"terrell-std"`	Equal-width binning, where each bin for variable `k` has a width of `3.729std(X(:,k))m^(-1/3)` `X` is the existing data set. `m` is the number of observations in variable `k` (that is, in `X(:,k)`).

Example: BinMethod="scott"

Data Types: char | string

`NumBins` — Number of bins to use for continuous variables
`[]` (default) | positive integer scalar | positive integer vector

Number of bins to use for continuous variables, specified as a positive integer scalar or vector.

If NumBins is a scalar, then the function uses the same number of bins for each continuous variable.
If NumBins is a vector, then the function uses NumBins(k) number of bins for continuous variable k.

Specify this value only when BinMethod is "equal-width" or "equiprobable".

Example: NumBins=[10 25 10 15]

Data Types: single | double

`VariableNames` — Variable names
string array | cell array of character vectors

Variable names, specified as a string array or a cell array of character vectors.

If X is a numeric matrix, then you can use VariableNames to assign names to the variables in X.
- The order of the names in VariableNames must correspond to the order of the variables in X. That is, VariableNames{1} is the name of X(:,1), VariableNames{2} is the name of X(:,2), and so on. size(X,2) and numel(VariableNames) must be equal.
- By default, VariableNames is {'x1','x2',...}.
If X is a table, then you can use VariableNames to choose which variables to use. That is, binningTabularSynthesizer uses only the variables in VariableNames to generate synthetic data.
- VariableNames must be a subset of X.Properties.VariableNames.
- By default, VariableNames contains the names of all variables.

Example: VariableNames=["SepalLength","SepalWidth","PetalLength","PetalWidth"]

Data Types: string | cell

`CategoricalVariables` — List of categorical variables
positive integer vector | logical vector | string array | cell array of character vectors | `"all"`

List of the categorical variables, specified as one of the values in this table.

Value	Description
Positive integer vector	Each entry in the vector is an index value indicating that the corresponding variable is categorical. The index values are between 1 and v, where v is the number of variables listed in `VariableNames`.
Logical vector	A `true` entry means that the corresponding variable is categorical. The length of the vector is v.
String array or cell array of character vectors	Each element in the array is the name of a categorical variable. The names must match the entries in `VariableNames`.
`"all"`	All variables are categorical.

By default, if the variables are in a numeric matrix, the software assumes all the variables are continuous. If the variables are in a table, the software assumes they are categorical if they are logical vectors, categorical vectors, character arrays, string arrays, or cell arrays of character vectors. To identify any other variables as categorical, specify them by using the CategoricalVariables name-value argument.

Do not specify discrete numeric variables as categorical variables. Use the DiscreteNumericVariables name-value argument instead.

Example: CategoricalVariables="all"

Data Types: single | double | logical | string | cell

`DiscreteNumericVariables` — List of discrete numeric variables
`[]` (default) | positive integer vector | logical vector | string array | cell array of character vectors | `"all"`

List of the discrete numeric variables, specified as one of the values in this table.

Value	Description
Positive integer vector	Each entry in the vector is an index value indicating that the corresponding variable is a discrete numeric variable. The index values are between 1 and v, where v is the number of variables listed in `VariableNames`.
Logical vector	A `true` entry means that the corresponding variable is a discrete numeric variable. The length of the vector is v.
String array or cell array of character vectors	Each element in the array is the name of a discrete numeric variable. The names must match the entries in `VariableNames`.
`"all"`	All variables are discrete numeric variables.

You cannot specify categorical variables as discrete numeric variables.

Example: DiscreteNumericVariables=[2 5]

Data Types: single | double | logical | string | cell

Properties

expand all

`VariableNames` — Variable names
string array

This property is read-only.

Variable names, specified as a string array. The order of the elements of VariableNames corresponds to the order in which the variable names appear in the existing data set X.

Data Types: string

`CategoricalVariables` — Indices of categorical variables
positive integer vector | `[]`

This property is read-only.

Indices of the categorical variables, specified as a positive integer vector. Each index value in CategoricalVariables indicates that the corresponding variable listed in VariableNames is categorical. If none of the variables are categorical, then this property is empty ([]).

Data Types: double

`DiscreteNumericVariables` — Indices of discrete numeric variables
positive integer vector | `[]`

This property is read-only.

Indices of the discrete numeric variables, specified as a positive integer vector. Each index value in DiscreteNumericVariables indicates that the corresponding variable listed in VariableNames is a discrete numeric variable. If none of the variables are discrete numeric variables, then this property is empty ([]).

Data Types: double

`BinnedVariables` — Indices of binned variables
positive integer vector | `[]`

This property is read-only.

Indices of the binned variables, specified as a positive integer vector. Each index value in BinnedVariables indicates that the corresponding variable listed in VariableNames is a binned variable. If none of the variables are binned, then this property is empty ([]).

Data Types: double

`BinMethod` — Binning algorithm used to bin continuous variables
`"equal-width"` | `"equiprobable"` | `"dagostino-stephens"` | `"freedman-diaconis"` | `"scott"` | ...

This property is read-only.

Binning algorithm used to bin the continuous variables indicated by BinnedVariables, specified as "equal-width", "equiprobable", "dagostino-stephens", "freedman-diaconis", "scott", "scott-multivariate", "terrell-iqr", "terrell-scott", or "terrell-std". For more information on these binning algorithms, see the BinMethod name-value argument.

If none of the variables are binned, then this property is empty.

Data Types: string

`NumBins` — Number of bins used to bin continuous variables
positive integer vector | `[]`

This property is read-only.

Number of bins used to bin the continuous variables indicated by BinnedVariables, specified as a positive integer vector. Element k in NumBins indicates the number of bins for continuous variable k. If none of the variables are binned, then this property is empty ([]).

Data Types: double

`BinEdges` — Bin edges used to bin continuous variables
cell array

This property is read-only.

Bin edges used to bin the continuous variables indicated by BinnedVariables, specified as a cell array. Element k in BinEdges contains the bin edges for continuous variable k. If none of the variables are binned, then this property is empty.

Data Types: cell

`NumObservations` — Number of observations
positive integer scalar

This property is read-only.

Number of observations in the existing data set X, specified as a positive integer scalar.

Data Types: double

Object Functions

synthesizeTabularData Synthesize tabular data using binning-based synthesizer

Examples

collapse all

Synthesize Data for Model Training

Open Live Script

Use existing training data to create a binningTabularSynthesizer object. Then, synthesize data using the synthesizeTabularData object function. Train a model using the existing training data, and then train the same type of model using the synthetic data. Compare the performance of the two models using test data.

Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Acceleration, Displacement, and so on, as well as the response variable MPG.

load carbig
tbl = table(Acceleration,Cylinders,Displacement,Horsepower, ...
    Model_Year,Origin,MPG,Weight);

Remove rows of tbl where the table has missing values.

tbl = rmmissing(tbl);

Partition the data into training and test sets. Use approximately 60% of the observations for model training and synthesizing new data, and 40% of the observations for model testing. Use cvpartition to partition the data.

rng("default")
cv = cvpartition(size(tbl,1),"Holdout",0.4);
trainTbl = tbl(training(cv),:);
testTbl = tbl(test(cv),:);

Create a binningTabularSynthesizer object by using the trainTbl data set. The binningTabularSynthesizer function uses binning techniques to learn the distribution of the multivariate data set. Use 20 equal-width bins for each continuous variable. Specify the Cylinders and Model_Year variables as discrete numeric variables.

synthesizer = binningTabularSynthesizer(trainTbl, ...
    BinMethod="equal-width",NumBins=20, ...
    DiscreteNumericVariables=["Cylinders","Model_Year"])

synthesizer = 
  binningTabularSynthesizer

               VariableNames: ["Acceleration"    "Cylinders"    "Displacement"    "Horsepower"    "Model_Year"    "Origin"    "MPG"    "Weight"]
        CategoricalVariables: 6
    DiscreteNumericVariables: [2 5]
             BinnedVariables: [1 3 4 7 8]
                   BinMethod: "equal-width"
                     NumBins: [20 20 20 20 20]
                    BinEdges: {[21x1 double]  [21x1 double]  [21x1 double]  [21x1 double]  [21x1 double]}
             NumObservations: 236

synthesizer is a binningTabularSynthesizer object with five binned variables. Each binned variable has the same number of bins.

Synthesize new data by using synthesizer. Specify to generate 1000 observations.

syntheticTbl = synthesizeTabularData(synthesizer,1000);

The synthesizeTabularData object function uses the data distribution information stored in synthesizer to generate syntheticTbl.

To visualize the difference between the existing data and synthetic data, you can use the detectdrift function. The function uses permutation testing to detect drift between trainTbl and syntheticTbl.

dd = detectdrift(trainTbl,syntheticTbl);

dd is a DriftDiagnostics object with plotEmpiricalCDF and plotHistogram object functions for visualization.

For continuous variables, use the plotEmpiricalCDF function to see the difference between the empirical cumulative distribution function (ecdf) of the values in trainTbl and the ecdf of the values in syntheticTbl.

continuousVariable = "Displacement";
plotEmpiricalCDF(dd,Variable=continuousVariable)
legend(["Real Data","Synthetic Data"])

Figure contains an axes object. The axes object with title ECDF for Displacement, xlabel Displacement, ylabel Cumulative Probability contains 2 objects of type stair. These objects represent Real Data, Synthetic Data.

For the Displacement predictor, the ecdf plot for the existing values (in blue) matches the ecdf plot for the synthetic values (in red) fairly well.

For discrete variables, use the plotHistogram function to see the difference between the histogram of the values in trainTbl and the histogram of the values in syntheticTbl.

discreteVariable = "Model_Year";
plotHistogram(dd,Variable=discreteVariable)
legend(["Real Data","Synthetic Data"])

Figure contains an axes object. The axes object with title Histogram for Model_Year, xlabel Model_Year Bins, ylabel Distribution (%) contains 2 objects of type bar. These objects represent Real Data, Synthetic Data.

For the Model_Year predictor, the histogram for the existing values (in blue) matches the histogram for the synthetic values (in red) fairly well.

Train a bagged ensemble of trees using the original training data trainTbl. Specify MPG as the response variable. Then, train the same kind of regression model using the synthetic data syntheticTbl.

originalMdl = fitrensemble(trainTbl,"MPG",Method="Bag");
newMdl = fitrensemble(syntheticTbl,"MPG",Method="Bag");

Evaluate the performance of the two models on the test set by computing the test mean squared error (MSE). Smaller MSE values indicate better performance.

originalMSE = loss(originalMdl,testTbl)

originalMSE = 
7.0784

newMSE = loss(newMdl,testTbl)

newMSE = 
6.1031

The model trained on the synthetic data performs slightly better on the test data.

Evaluate Synthetic Data

Open Live Script

Evaluate data synthesized from an existing data set. Compare the existing and synthetic data sets to determine the similarity between the two multivariate data distributions.

Load the sample file fisheriris.csv, which contains iris data including sepal length, sepal width, petal width, and species type. Read the file into a table, and then convert the Species variable into a categorical variable. Print a summary of the variables in the table.

fisheriris = readtable("fisheriris.csv");
fisheriris.Species = categorical(fisheriris.Species);
summary(fisheriris)

fisheriris: 150x5 table

Variables:

    SepalLength: double
    SepalWidth: double
    PetalLength: double
    PetalWidth: double
    Species: categorical (3 categories)

Statistics for applicable variables:

                   NumMissing       Min          Median         Max           Mean          Std    

    SepalLength        0           4.3000        5.8000        7.9000        5.8433        0.8281  
    SepalWidth         0                2             3        4.4000        3.0573        0.4359  
    PetalLength        0                1        4.3500        6.9000        3.7580        1.7653  
    PetalWidth         0           0.1000        1.3000        2.5000        1.1993        0.7622  
    Species            0

The summary display includes statistics for each variable. For example, the sepal length values range from 4.3 to 7.9, with a median of 5.8.

Create 150 new observations from the data in fisheriris. First, create an object by using the binningTabularSynthesizer function. Then, synthesize the data by using the synthesizeTabularData object function. Print a summary of the variables in the new syntheticData data set.

rng(0,"twister") % For reproducibility
synthesizer = binningTabularSynthesizer(fisheriris);
syntheticData = synthesizeTabularData(synthesizer,150);
summary(syntheticData)

syntheticData: 150x5 table

Variables:

    SepalLength: double
    SepalWidth: double
    PetalLength: double
    PetalWidth: double
    Species: categorical (3 categories)

Statistics for applicable variables:

                   NumMissing       Min          Median         Max           Mean          Std    

    SepalLength        0           4.3079        5.7174        7.6399        5.8280        0.8576  
    SepalWidth         0           2.0236        3.0336        4.2866        3.0819        0.4572  
    PetalLength        0           1.0010        4.4453        6.8538        3.6572        1.8192  
    PetalWidth         0           0.1002        1.3502        2.4759        1.1719        0.7597  
    Species            0

You can compare the variable statistics for syntheticData to the variable statistics for fisheriris. For example, the sepal length values in the synthetic data set range from approximately 4.3 to 7.6, with a median of 5.7. These statistics are similar to the statistics in the fisheriris data set.

Visually compare the observations in fisheriris and syntheticData by using scatter plots. Each point corresponds to an observation. The point color indicates the species of the corresponding iris.

tiledlayout(1,2)
nexttile
gscatter(fisheriris.SepalLength,fisheriris.PetalLength,fisheriris.Species)
xlabel("Sepal Length")
ylabel("Petal Length")
title("Existing Data")
nexttile
gscatter(syntheticData.SepalLength,syntheticData.PetalLength,syntheticData.Species)
xlabel("Sepal Length")
ylabel("Petal Length")
title("Synthetic Data")

The scatter plots indicate that the existing data set and the synthetic data set have similar characteristics.

Compare the existing and synthetic data sets by using the mmdtest function. The function performs a two-sample hypothesis test for the null hypothesis that the data sets come from the same distribution.

[mmd2,p,h] = mmdtest(fisheriris,syntheticData)

mmd2 = 
0.0020

p = 
0.9600

h = 
0

The returned value of h = 0 indicates that mmdtest fails to reject the null hypothesis that the data sets come from different distributions at the significance level of 5%. As with other hypothesis tests, this result does not guarantee that the null hypothesis is true. That is, the data sets do not necessarily come from the same distribution, but the low mmd2 value (square maximum mean discrepancy) and the high p-value indicate that the distributions of the real and synthetic data sets are similar.

Algorithms

expand all

Estimate Multivariate Data Distribution by Binning

The binningTabularSynthesizer function estimates the distribution of the multivariate data set X by performing these steps:

Bin each continuous variable using equiprobable or equal-width binning, as specified by the BinMethod and NumBins name-value arguments.
Encode the continuous variables using the bin indices.
One-hot encode all binned and discrete variables.
Compute the probability of each unique row in the encoded data set.

The synthesizeTabularData function uses the computed probabilities to generate synthetic data.

Alternative Functionality

Instead of creating a binningTabularSynthesizer object and then using the synthesizeTabularData object function to synthesize data, you can generate synthetic data directly by using the synthesizeTabularData function. Create an object if you want to easily generate synthetic data multiple times without having to relearn characteristics of the existing data set.

Version History

Introduced in R2024b

binningTabularSynthesizer

Description

Creation

Syntax

Description

Input Arguments

X — Existing data set numeric matrix | table

BinMethod — Binning algorithm "auto" (default) | "equal-width" | "equiprobable" | "dagostino-stephens" | "freedman-diaconis" | "scott" | ...

NumBins — Number of bins to use for continuous variables [] (default) | positive integer scalar | positive integer vector

VariableNames — Variable names string array | cell array of character vectors

CategoricalVariables — List of categorical variables positive integer vector | logical vector | string array | cell array of character vectors | "all"

DiscreteNumericVariables — List of discrete numeric variables [] (default) | positive integer vector | logical vector | string array | cell array of character vectors | "all"

Properties

VariableNames — Variable names string array

CategoricalVariables — Indices of categorical variables positive integer vector | []

DiscreteNumericVariables — Indices of discrete numeric variables positive integer vector | []

BinnedVariables — Indices of binned variables positive integer vector | []

BinMethod — Binning algorithm used to bin continuous variables "equal-width" | "equiprobable" | "dagostino-stephens" | "freedman-diaconis" | "scott" | ...

NumBins — Number of bins used to bin continuous variables positive integer vector | []

BinEdges — Bin edges used to bin continuous variables cell array

NumObservations — Number of observations positive integer scalar

Object Functions

Examples

Synthesize Data for Model Training

Evaluate Synthetic Data

Algorithms

Estimate Multivariate Data Distribution by Binning

Alternative Functionality

Version History

See Also

`X` — Existing data set
numeric matrix | table

`BinMethod` — Binning algorithm
`"auto"` (default) | `"equal-width"` | `"equiprobable"` | `"dagostino-stephens"` | `"freedman-diaconis"` | `"scott"` | ...

`NumBins` — Number of bins to use for continuous variables
`[]` (default) | positive integer scalar | positive integer vector

`VariableNames` — Variable names
string array | cell array of character vectors

`CategoricalVariables` — List of categorical variables
positive integer vector | logical vector | string array | cell array of character vectors | `"all"`

`DiscreteNumericVariables` — List of discrete numeric variables
`[]` (default) | positive integer vector | logical vector | string array | cell array of character vectors | `"all"`

`VariableNames` — Variable names
string array

`CategoricalVariables` — Indices of categorical variables
positive integer vector | `[]`

`DiscreteNumericVariables` — Indices of discrete numeric variables
positive integer vector | `[]`

`BinnedVariables` — Indices of binned variables
positive integer vector | `[]`

`BinMethod` — Binning algorithm used to bin continuous variables
`"equal-width"` | `"equiprobable"` | `"dagostino-stephens"` | `"freedman-diaconis"` | `"scott"` | ...

`NumBins` — Number of bins used to bin continuous variables
positive integer vector | `[]`

`BinEdges` — Bin edges used to bin continuous variables
cell array

`NumObservations` — Number of observations
positive integer scalar