Main Content

When you examine a data plot, you might find that some points
appear to differ dramatically from the
rest of the data. In some cases, it is reasonable to consider such
points *outliers*, or data values that appear to be inconsistent with the
rest of the data.

The following example illustrates how to
remove outliers from three data sets in the 24-by-3 matrix `count`

.
In this case, an outlier is defined as a value that is more than three
standard deviations away from the mean.

**Caution**

Be cautious about changing data unless you are confident that you understand the source of the problem you want to correct. Removing an outlier has a greater effect on the standard deviation than on the mean of the data. Deleting one such point leads to a smaller new standard deviation, which might result in making some remaining points appear to be outliers!

% Import the sample data load count.dat; % Calculate the mean and the standard deviation % of each data column in the matrix mu = mean(count) sigma = std(count)

The Command Window displays

mu = 32.0000 46.5417 65.5833 sigma = 25.3703 41.4057 68.0281

When
an *outlier* is considered to be more than three
standard deviations away from the mean, use the following syntax to determine
the number of outliers in each column of the `count`

matrix:

[n,p] = size(count); % Create a matrix of mean values by % replicating the mu vector for n rows MeanMat = repmat(mu,n,1); % Create a matrix of standard deviation values by % replicating the sigma vector for n rows SigmaMat = repmat(sigma,n,1); % Create a matrix of zeros and ones, where ones indicate % the location of outliers outliers = abs(count - MeanMat) > 3*SigmaMat; % Calculate the number of outliers in each column nout = sum(outliers)

The procedure returns the following number of outliers in each column:

nout = 1 0 0

There is one outlier in the first data column of `count`

and
none in the other two columns.

To remove an entire row of data containing the outlier, type

count(any(outliers,2),:) = [];

Here, `any(outliers,2)`

returns a `1`

when
any of the elements in the `outliers`

vector are
nonzero. The argument `2`

specifies
that `any`

works down the second
dimension of the count matrix—its columns.