How to check and remove outliers when it is Non-normal distribution

Question

J1 2015-11-18

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution

评论： Star Strider 2015-11-19

I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?

Can anybody help.Thanks a lot.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Star Strider 2015-11-18

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution#answer_200342

The z-score is frequently used because according to the Central Limit Theorem, when the data are sufficiently numerous, the tend to be normally distributed regardless of the underlying distribution. (There is more to it that this simple statement, but that is the most basic explanation.)

If you know how your data are distributed, you can get the ‘critical values’ of the 0.025 and 0.975 probabilities for it and use them as your decision criteria to reject outliers. Again, outlier detection and rejection is another topic that goes beyond this simple explanation, and I encourage you to explore it on your own. How you decide to implement it with your data is something you will have to experiment with.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Greg Heath 2015-11-19

在 MATLAB Online 中打开

As I have mentioned in my answer

 Using zscore is so useful for detecting outliers in 
nonnormal distributions, I use it most of the time.

Again:

 For outlier detection I recommend using the  
combination of zscore and plots with all non-binary data.

Greg

Star Strider 2015-11-19

在 MATLAB Online 中打开

My pleasure.

A data set n>30 will approximate a normal distribution if it is otherwise t-distributed, but you would have to look at your data to see if they approximate a normal distribution. If you have any doubts as to its distribution, I would use one of the histogram functions, and if you have the Statistics Toolbox, the histfit function.

The most reliable way to determine if your data are normally distributed is to use the Statistics Toolbox Kolmogorov-Smirnov test, implemented in the kstest function. Another related test for the normal and other distributions is the Lilliefors test, implemented in the lillietest function.

If you don’t have the Statistics Toolbox, one simple test is to see if the median approximates the mean. It should for normally-distributed data, but will not for other distributions. (I leave the interpretation of ‘approximates’ to you, in the context of your data. They should be virtually the same for normally-distributed data.) You can also use the randn function with the mean and std of your data, then use a histogram function to compare them. The randn call would be (with ‘data’ being your data):

data_mean = mean(data);
data_std  = std(data);
data_sim  = data_mean + data_std*randn(size(data));

If your data turn out to be normally-distributed, you can certainly use the z-score reliably to scale them or test them with respect to detecting outliers. In the limit (which is to say a huge number of observations), the CLT would certainly apply. However N=89 is not huge, so you will have to analyse your data and see how they are distributed.

请先登录，再进行评论。

Answer 2

Greg Heath 2015-11-18

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution#answer_200385

Regardless of the distribution, I find that a combination of zscore with plots of original and transformed data is sufficient for me to detect outliers. Whether points are deleted or replaced by a reduced value depends on how I interpret the plots.

If you have doubts you can always make multiple models based on original and modified data.

Hope this helps.

Thank you for formally accepting my answer

Greg

2 个评论
显示无隐藏无

J1 2015-11-19

If we found there are outlier, should i find out more variables to predict my output? Such as, I use weather data to predict the sales of product.And I found that the outlier is due to the promotion or other reasons, should i add this new reasons(new variables) into the neural network to do prediction?

Greg Heath 2015-11-19

Outliers are usually isolated points that are the result of bad measurements or bad transcriptions. Therefore they should be removed. However, if you plot the data, very often you can guess the approximate true value of the measurement. Then you have the option of replacing the outlier with the approximation.

请先登录，再进行评论。

How to check and remove outliers when it is Non-normal distribution

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

产品

Community Treasure Hunt

How to check and remove outliers when it is Non-normal distribution

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

2 个评论
显示无隐藏无