How can I detect and remove outliers from a large dataset?

20 次查看(过去 30 天)
I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.
Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.
thank you.
  2 个评论
Star Strider
Star Strider 2014-3-12
Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.
Arinze
Arinze 2014-3-15
Star Strider, I attached a picture of the plot for better understanding.

请先登录,再进行评论。

回答(4 个)

Shahab B
Shahab B 2016-9-30
How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
note that the outlier data is = 347.666506871168 .
  4 个评论
Image Analyst
Image Analyst 2016-11-21
There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.
Nivodi
Nivodi 2018-8-14
Image Analyst, how can I apply this part of your code to several columns?
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)

请先登录,再进行评论。


Image Analyst
Image Analyst 2014-3-12
That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.
  7 个评论
Arinze
Arinze 2014-3-15
编辑:Arinze 2014-3-15
new plot of the signal, Please my matlab doesnt recognise 'deleteoutliers' command, any idea why?

请先登录,再进行评论。


Tim leonard
Tim leonard 2014-3-12
Trimming your values based on percentiles is quick and powerful -
vector = randi(100,100,1);
percntiles = prctile(vector,[5 95]); %5th and 95th percentile
outlierIndex = vector < percntiles(1) | vector > percntiles(2);
%remove outlier values
vector(outlierIndex) = [];
  1 个评论
Image Analyst
Image Analyst 2014-3-12
But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.

请先登录,再进行评论。


Amir H. Souri
Amir H. Souri 2017-6-26
Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.
Good luck!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by