Main Content

isoutlier

Find outliers in data

Description

TF = isoutlier(A) returns a logical array whose elements are true when an outlier is detected in the corresponding element of A.

  • If A is a matrix, then isoutlier operates on each column of A separately.

  • If A is a multidimensional array, then isoutlier operates along the first dimension of A whose size does not equal 1.

  • If A is a table or timetable, then isoutlier operates on each variable of A separately.

By default, an outlier is a value that is more than three scaled median absolute deviations (MAD) from the median.

You can use isoutlier functionality interactively by adding the Clean Outlier Data task to a live script.

example

TF = isoutlier(A,method) specifies a method for detecting outliers. For example, isoutlier(A,"mean") returns true for all elements more than three standard deviations from the mean.

example

TF = isoutlier(A,"percentiles",threshold) defines outliers as points outside of the percentiles specified in threshold. The threshold argument is a two-element row vector containing the lower and upper percentile thresholds, such as [10 90].

TF = isoutlier(A,movmethod,window) detects local outliers using a moving window method with window length window. For example, isoutlier(A,"movmedian",5) returns true for all elements more than three local scaled MAD from the local median within a five-element window.

example

TF = isoutlier(___,dim) specifies the dimension of A to operate along for any of the previous syntaxes. For example, isoutlier(A,2) operates on each row of a matrix A.

example

TF = isoutlier(___,Name,Value) specifies additional parameters for detecting outliers using one or more name-value arguments. For example, isoutlier(A,"SamplePoints",t) detects outliers in array A relative to the corresponding elements of a time vector t.

example

[TF,L,U,C] = isoutlier(___) also returns the lower threshold L, upper threshold U, and center value C used by the outlier detection method.

example

Examples

collapse all

Find the outliers in a vector of data. A logical 1 in the output indicates the location of an outlier.

A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
TF = isoutlier(A)
TF = 1x15 logical array

   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0

Define outliers as points more than three standard deviations from the mean, and find the locations of outliers in a vector.

A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
TF = isoutlier(A,"mean")
TF = 1x15 logical array

   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0

Use a moving detection method to detect local outliers in a sine wave that corresponds to a time vector.

Create a vector of data containing a local outlier.

x = -2*pi:0.1:2*pi;
A = sin(x);
A(47) = 0;

Create a time vector that corresponds to the data in A.

t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);

Define outliers as points more than three local scaled MAD from the local median within a sliding window. Find the locations of the outliers in A relative to the points in t with a window size of 5 hours. Plot the data and detected outliers.

TF = isoutlier(A,"movmedian",hours(5),"SamplePoints",t);
plot(t,A)
hold on
plot(t(TF),A(TF),"x")
legend("Original Data","Outlier Data")

Figure contains an axes object. The axes object contains 2 objects of type line. One or more of the lines displays its values using only markers These objects represent Original Data, Outlier Data.

Find outliers for each row of a matrix.

Create a matrix of data containing outliers along the diagonal.

A = magic(5) + diag(200*ones(1,5))
A = 5×5

   217    24     1     8    15
    23   205     7    14    16
     4     6   213    20    22
    10    12    19   221     3
    11    18    25     2   209

Find the locations of outliers based on the data in each row.

TF = isoutlier(A,2)
TF = 5x5 logical array

   1   0   0   0   0
   0   1   0   0   0
   0   0   1   0   0
   0   0   0   1   0
   0   0   0   0   1

Locate an outlier in a vector of data and visualize the outlier.

Create a vector of data containing a local outlier.

x = 1:10;
A = [60 59 49 49 58 100 61 57 48 58];

Locate the outlier using the default detection method "median".

[TF,L,U,C] = isoutlier(A);

Plot the original data, the outlier, and the thresholds and center value determined by the detection method. The center value is the median of the data, and the upper and lower thresholds are three scaled MAD above and below the median.

plot(x,A)
hold on
plot(x(TF),A(TF),"x")
yline([L U C],":",["Lower Threshold","Upper Threshold","Center Value"])
legend("Original Data","Outlier Data")

Figure contains an axes object. The axes object contains 5 objects of type line, constantline. One or more of the lines displays its values using only markers These objects represent Original Data, Outlier Data.

Input Arguments

collapse all

Input data, specified as a vector, matrix, multidimensional array, table, or timetable.

  • If A is a table, then its variables must be of type double or single, or you can use the DataVariables argument to list double or single variables explicitly. Specifying variables is useful when you are working with a table that contains variables with data types other than double or single.

  • If A is a timetable, then isoutlier operates only on the table elements. If row times are used as sample points, then they must be unique and listed in ascending order.

Data Types: double | single | table | timetable

Method for detecting outliers, specified as one of these values.

MethodDescription
"median"Outliers are defined as elements more than three scaled MAD from the median. The scaled MAD is defined as c*median(abs(A-median(A))), where c=-1/(sqrt(2)*erfcinv(3/2)).
"mean"Outliers are defined as elements more than three standard deviations from the mean. This method is faster but less robust than "median".
"quartiles"Outliers are defined as elements more than 1.5 interquartile ranges above the upper quartile (75 percent) or below the lower quartile (25 percent). This method is useful when the data in A is not normally distributed.
"grubbs"Outliers are detected using Grubbs’ test for outliers, which removes one outlier per iteration based on hypothesis testing. This method assumes that the data in A is normally distributed.
"gesd"Outliers are detected using the generalized extreme Studentized deviate test for outliers. This iterative method is similar to "grubbs", but can perform better when there are multiple outliers masking each other.

To detect outliers using a specified range, use the isbetween function.

Percentile thresholds, specified as a two-element row vector whose elements are in the interval [0, 100]. The first element indicates the lower percentile threshold, and the second element indicates the upper percentile threshold. The first element of threshold must be less than the second element.

For example, a threshold of [10 90] defines outliers as points below the 10th percentile and above the 90th percentile.

Moving method for detecting outliers, specified as one of these values.

MethodDescription
"movmedian"Outliers are defined as elements more than three local scaled MAD from the local median over a window length specified by window. This method is also known as a Hampel filter.
"movmean"Outliers are defined as elements more than three local standard deviations from the local mean over a window length specified by window.

Window length, specified as a positive integer scalar, a two-element vector of positive integers, a positive duration scalar, or a two-element vector of positive durations.

When window is a positive integer scalar, the window is centered about the current element and contains window-1 neighboring elements. If window is even, then the window is centered about the current and previous elements.

When window is a two-element vector of positive integers [b f], the window contains the current element, b elements backward, and f elements forward.

When A is a timetable or SamplePoints is specified as a datetime or duration vector, window must be of type duration, and the windows are computed relative to the sample points.

Operating dimension, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.

Consider an m-by-n input matrix, A:

  • isoutlier(A,1) detects outliers based on the data in each column of A and returns an m-by-n matrix.

    isoutlier(A,1) column-wise operation

  • isoutlier(A,2) detects outliers based on the data in each row of A and returns an m-by-n matrix.

    isoutlier(A,2) row-wise operation

For table or timetable input data, dim is not supported and operation is along each table or timetable variable separately.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: isoutlier(A,"mean",ThresholdFactor=4)

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: isoutlier(A,"mean","ThresholdFactor",4)

Data Options

collapse all

Sample points, specified as a vector of sample point values or one of the options in the following table when the input data is a table. The sample points represent the x-axis locations of the data, and must be sorted and contain unique elements. Sample points do not need to be uniformly sampled. The vector [1 2 3 ...] is the default.

When the input data is a table, you can specify the sample points as a table variable using one of these options.

Indexing SchemeExamples

Variable name:

  • A string scalar or character vector

  • "A" or 'A' — A variable named A

Variable index:

  • An index number that refers to the location of a variable in the table

  • A logical vector. Typically, this vector is the same length as the number of variables, but you can omit trailing 0 or false values

  • 3 — The third variable from the table

  • [false false true] — The third variable

Function handle:

  • A function handle that takes a table variable as input and returns a logical scalar

  • @isnumeric — One variable containing numeric values

Variable type:

  • A vartype subscript that selects one variable of a specified type

  • vartype("numeric") — One variable containing numeric values

Note

This name-value argument is not supported when the input data is a timetable. Timetables use the vector of row times as the sample points. To use different sample points, you must edit the timetable so that the row times contain the desired sample points.

Moving windows are defined relative to the sample points. For example, if t is a vector of times corresponding to the input data, then isoutlier(rand(1,10),"movmean",3,"SamplePoints",t) has a window that represents the time interval between t(i)-1.5 and t(i)+1.5.

When the sample points vector has data type datetime or duration, the moving window length must have type duration.

Example: isoutlier(A,"SamplePoints",0:0.1:10)

Example: isoutlier(T,"SamplePoints","Var1")

Data Types: single | double | datetime | duration

Table variables to operate on, specified as one of the options in this table. The DataVariables value indicates which variables of the input table to examine for outliers. The data type associated with the indicated variables must be double or single.

The first output TF contains false for variables not specified by DataVariables unless the value of OutputFormat is "tabular".

Indexing SchemeValues to SpecifyExamples

Variable names

  • A string scalar or character vector

  • A string array or cell array of character vectors

  • A pattern object

  • "A" or 'A' — A variable named A

  • ["A" "B"] or {'A','B'} — Two variables named A and B

  • "Var"+digitsPattern(1) — Variables named "Var" followed by a single digit

Variable index

  • An index number that refers to the location of a variable in the table

  • A vector of numbers

  • A logical vector. Typically, this vector is the same length as the number of variables, but you can omit trailing 0 (false) values.

  • 3 — The third variable from the table

  • [2 3] — The second and third variables from the table

  • [false false true] — The third variable

Function handle

  • A function handle that takes a table variable as input and returns a logical scalar

  • @isnumeric — All the variables containing numeric values

Variable type

  • A vartype subscript that selects variables of a specified type

  • vartype("numeric") — All the variables containing numeric values

Example: isoutlier(T,"DataVariables",["Var1" "Var2" "Var4"])

Output data type, specified as one of these values:

  • "logical" — For table or timetable input data, return the output TF as a logical array.

  • "tabular" — For table input data, return the output TF as a table. For timetable input data, return the output TF as a timetable.

For vector, matrix, or multidimensional array input data, OutputFormat is not supported.

Example: isoutlier(T,"OutputFormat","tabular")

Outlier Detection Options

collapse all

Detection threshold factor, specified as a nonnegative scalar.

For methods "median" and "movmedian", the detection threshold factor replaces the number of scaled MAD, which is 3 by default.

For methods "mean" and "movmean", the detection threshold factor replaces the number of standard deviations from the mean, which is 3 by default.

For methods "grubbs" and "gesd", the detection threshold factor is a scalar ranging from 0 to 1. Values close to 0 result in a smaller number of outliers, and values close to 1 result in a larger number of outliers. The default detection threshold factor is 0.05.

For the "quartiles" method, the detection threshold factor replaces the number of interquartile ranges, which is 1.5 by default.

This name-value argument is not supported when the specified method is "percentiles".

Maximum outlier count, for the "gesd" method only, specified as a positive integer scalar. The MaxNumOutliers value specifies the maximum number of outliers returned by the "gesd" method. For example, isoutlier(A,"gesd","MaxNumOutliers",5) returns no more than five outliers.

The default value for MaxNumOutliers is the integer nearest to 10 percent of the number of elements in A. Setting a larger value for the maximum number of outliers makes it more likely that all outliers are detected but at the cost of reduced computational efficiency.

The "gesd" method assumes the nonoutlier input data is sampled from an approximate normal distribution. When the data is not sampled in this way, the number of returned outliers might exceed the MaxNumOutliers value.

Output Arguments

collapse all

Outlier indicator, returned as a vector, matrix, multidimensional array, table, or timetable.

TF is the same size as A unless the value of OutputFormat is "tabular". If the value of OutputFormat is "tabular", then TF has only variables corresponding to the DataVariables specified.

Data Types: logical

Lower threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the lower threshold value of the default outlier detection method is three scaled MAD below the median of the input data.

If method is used for outlier detection, then L has the same size as A in all dimensions except for the operating dimension where the length is 1. If movmethod is used, then L has the same size as A.

Data Types: double | single | table | timetable

Upper threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the upper threshold value of the default outlier detection method is three scaled MAD above the median of the input data.

If method is used for outlier detection, then U has the same size as A in all dimensions except for the operating dimension where the length is 1. If movmethod is used, then U has the same size as A.

Data Types: double | single | table | timetable

Center value used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the center value of the default outlier detection method is the median of the input data.

If method is used for outlier detection, then C has the same size as A in all dimensions except for the operating dimension where the length is 1. If movmethod is used, then C has the same size as A.

Data Types: double | single | table | timetable

More About

collapse all

Median Absolute Deviation

For a finite-length vector A made up of N scalar observations, the median absolute deviation (MAD) is defined as

MAD = median(|Aimedian(A)|)

for i = 1,2,...,N.

The scaled MAD is defined as c*median(abs(A-median(A))), where c=-1/(sqrt(2)*erfcinv(3/2)).

Alternative Functionality

Live Editor Task

You can use isoutlier functionality interactively by adding the Clean Outlier Data task to a live script.

Clean Outlier Data task in the Live Editor

References

[1] NIST/SEMATECH e-Handbook of Statistical Methods, https://www.itl.nist.gov/div898/handbook/, 2013.

Extended Capabilities

Version History

Introduced in R2017a

expand all