removing outliers
4 次查看(过去 30 天)
显示 更早的评论
Hi,
I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated
0 个评论
回答(3 个)
Richard Willey
2011-4-1
Automatically detecting outliers is tricky stuff.
You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.
Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.
Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.
Here's a simple example illustrating how this works
% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to
% which a regression model would change if this data point
% were excluded from the regression. Cook's Distance is
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')
2 个评论
Anirudh Thatipelli
2018-5-24
Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.
Matt Fig
2011-3-26
What form is the data? You might be able to use logical indexing. For example:
% x is some data with outliers 99 and -70. We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10); % Take those values less than 10
x = x(x>0); % Take those values greater than zero.
.
.
You could also do this in one shot, as below.
% x is some data with outliers 99 and -70. We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)
Walter Roberson
2011-3-27
"outlier" is mathematically a matter of interpretation.
What is the outlier in this data?
1 2 3 1 2 3
Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.
But if you only had the data, how would you know that?
Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.
5 个评论
Walter Roberson
2011-4-1
Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?
Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Descriptive Statistics 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!