Outlier removal from a matrix

23 次查看(过去 30 天)
I removed the outliers from my dataset with rmoutliers(A,'mean') command. It should remove the data 3 standard deviations from the mean of each column. But when I print the histogram of each column, there are still some data as far as 6 standard deviations away. What do you suggest? Here is my code:
A = rmoutliers(table_data,'mean');
Zscores = zscore(A); %(A is a 50000*12 matrix)
figure
histogram(Zscores(:,2))
In the histogram, there are still some data as far as 6 standard deviations away.
  1 个评论
John D'Errico
John D'Errico 2022-10-11
help rmoutliers
RMOUTLIERS Remove outliers from data B = RMOUTLIERS(A) detects and removes outliers from data. A can be a vector, matrix, table, or timetable. If A is a vector, RMOUTLIERS removes the entries detected as outliers. If A is a matrix or a table, RMOUTLIERS detects outliers for each column and then removes the rows containing outliers. B = RMOUTLIERS(A,METHOD) specifies the method used to determine outliers. METHOD must be one of the following: 'median' (default), 'mean', 'quartiles', 'grubbs', or 'gesd'. B = RMOUTLIERS(A,'percentiles',[LP UP]) detects as outliers all elements which are below the lower percentile LP and above the upper percentile UP. LP and UP must be scalars between 0 and 100 with LP <= UP. B = RMOUTLIERS(A,MOVMETHOD,WL) uses a moving window method to determine contextual outliers instead of global outliers. MOVMETHOD can be 'movmedian' or 'movmean'. B = RMOUTLIERS(...,DIM) reduces the size of A along the dimension DIM. Use DIM = 1 to remove rows and DIM = 2 to remove columns. RMOUTLIERS(A,DIM) first calls ISOUTLIER(A) to detect outliers. B = RMOUTLIERS(...,'MinNumOutliers',N) removes rows (columns) that contain at least N outliers. N must be an integer. By default, N = 1. B = RMOUTLIERS(...,'ThresholdFactor',P) modifies the outlier detection thresholds by a factor P. See the documentation for more information. B = RMOUTLIERS(...,'SamplePoints',X) specifies the sample points X representing the location of the data in A for the moving window methods 'movmedian' and 'movmean'. If the first input A is a table, X can also specify a table variable in A. B = RMOUTLIERS(...,'MaxNumOutliers',MAXN) specifies the maximum number of outliers for the 'gesd' method only. B = RMOUTLIERS(...,'OutlierLocations',INDOUTLIER) specifies the outlier locations according to the logical array INDOUTLIER. Elements of INDOUTLIER that are true indicate outliers in the corresponding element of A. INDOUTLIER must have the same size as A. [B,INDRM,INDOUTLIER,LTHRESH,UTHRESH,CENTER] = RMOUTLIERS(...) also returns a logical column (row) vector INDRM indicating which rows (columns) of A were removed, a logical array INDOUTLIER indicating the location of the detected outliers, and the lower threshold, upper threshold, and center value used by the outlier detection method. Arguments supported only for table inputs: B = RMOUTLIERS(...,'DataVariables',DV) removes rows (variables) according to outliers in table variables DV. The default is all table variables in A. DV must be a table variable name, a cell array of table variable names, a vector of table variable indices, a logical vector, a function handle that returns a logical scalar (such as @isnumeric), or a table vartype subscript. Examples: % Remove outliers from a vector a = [1 2 1000 3 4 5] b = rmoutliers(a) % Remove only the rows which contain at least 2 outliers A = [[1 2 1000 3 4 5]', [1 2 1000 3 4 1000]'] [B,removedRows,outlierLocs] = rmoutliers(A,'MinNumOutliers',2) See also ISOUTLIER, FILLOUTLIERS, RMMISSING, ISMISSING, FILLMISSING Documentation for rmoutliers doc rmoutliers Other uses of rmoutliers gpuArray/rmoutliers tall/rmoutliers
I had to go to the doc to check your claim that rmoutliers with the 'mean' option does specifically use 3 standard deviations as the cutoff, away from the mean and then it removes the entire row containing that outlier. This is true. But rmoutliers is not a perfect tool, and any such tool can have problems if you dare to push its limits.
x = [ones(1,5),1 + eps,10]
x = 1×7
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 10.0000
xhat = rmoutliers(x)
xhat = 1×5
1 1 1 1 1
xhat == 1
ans = 1×5 logical array
1 1 1 1 1
So rmoutliers first removed the 10 as being more than 3 sigma out, but then, since the standard deviation of the first 5 elements is exactly zero, 1+eps is ALSO more than 3 sigma out, and a clear outlier. The point is, if you try hard enough, you can always cause any such adaptive tool to exhibit strange behavior.
But if you want to know what happened, then you need to provide your data. Otherwise, anything is just a wild guess.
Attach it to a comment (not as an answer), in a .mat file.

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Descriptive Statistics 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by