finding extreme values in a panel data set

Question

salva 2012-7-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/44230-finding-extreme-values-in-a-panel-data-set

采纳的回答： Image Analyst

在 MATLAB Online 中打开

Dear Matlab experts

I have

A={

    [1]    ' '              [NaN]     [NaN]     [NaN]     
    [1]    '30/11/08'    [11.1447]    [0.2117]    [463.1360]
    [1]    '28/12/08'    [11.7129]    [0.2209]    [436.6316]
    [1]    '25/01/09'    [11.5152]    [0.2212]    [441.5430]
    [1]    '22/02/09'    [11.4854]    [0.2201]    [453.7015]
    [1]    '22/03/09'    [NaN]       [0.2185]    [461.3925]
    [1]    '19/04/09'    [10.9700]    [0.2104]    [486.4095]
    [1]    '17/05/09'    [11.4846]    [0.2162]    [451.4833]
    [1]    '14/06/09'    [10.9967]    [0.2158]    [475.8620]
    [1]    '12/07/09'    [11.1990]    [0.2211]    [449.4574]
    [2]    ' '             [NaN]     [NaN]     [NaN]     
    [2]    '30/11/08'    [10.8638]    [0.2175]    [472.2659]
    [2]    '28/12/08'    [11.0897]    [0.2182]    [479.3408]
    [2]    '25/01/09'    [11.0996]    [0.2286]    [442.6719]
    [2]    '22/02/09'    [11.0473]    [0.2211]    [481.4548]
    [2]    '22/03/09'    [11.1151]    [0.2259]    [468.2757]
    [2]    '19/04/09'    [10.9865]    [0.2300]    [461.5581]
    [2]    '17/05/09'    [10.7931]    [0.2259]    [487.6257]
    [2]    '14/06/09'    [10.5845]    [0.2200]    [529.8777]
    [2]    '12/07/09'    [10.8926]    [0.2315]    [433.3039]
    [3]    ' '             [NaN]     [NaN]     [NaN]     
    [3]    '30/11/08'    [10.7658]    [0.2339]    [487.7729]
    [3]    '28/12/08'    [10.5227]    [0.2305]    [505.4553]
    [3]    '25/01/09'    [10.6700]    [0.2279]    [480.0745]
    [3]    '22/02/09'    [10.7632]    [NaN]    [518.5076]
    [3]    '22/03/09'    [10.5065]    [0.2271]    [470.0111]
    [3]    '19/04/09'    [10.6112]    [0.2215]    [483.5394]
    [3]    '17/05/09'    [10.5855]    [0.2190]    [497.3172]
    [3]    '14/06/09'    [10.3085]    [0.2158]    [515.7492]
    [3]    '12/07/09'    [10.4475]    [0.2211]    [482.3380]
}

A is a panel data set that shows the evolution of some variables (columns 3-5) over time (column 2) for 3 individuals (column 1).

I would like

1) to detect outliers (their position and their corresponding values) in columns 3, 4 and 5 where an outlier is defined as a “a value that it is larger than (or equal to) 3 standard deviations over the mean”.

2) plot a diagram for each of the columns 3, 4 and 5 where outliers will stand out with, say, a red color

Could I do that with some matlab code?

Thank you very much

just to mention that my real A is huge where I have 400 individuals.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Image Analyst 2012-7-22

编辑：Image Analyst 2012-7-22

That's far, far from huge. Do you want this done separately for each group listed in column 1? It looks complicated by the fact that not only do NaNs separate the groups listed in column 1, but there are random Nan's scattered about.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Image Analyst 2012-7-22

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/44230-finding-extreme-values-in-a-panel-data-set#answer_54237

编辑：Image Analyst 2012-7-22

在 MATLAB Online 中打开

What have you tried so far? Did you try anything along the lines of

column3 = cell2mat(A(:,3))
goodIndexes = ~isnan(column3)
goodColumn3 = column3(goodIndexes)
mean3 = mean(goodColumn3)
std3 = std(goodColumn3)
lowerLimit = mean3 - std3
upperLimit = mean3 + std3
outliers = column3 < lowerLimit | column3 > upperLimit
outlierIndexes = find(outliers);
normalIndexes = find(~outliers);
x = (1:length(column3))';
stem(x(normalIndexes), column3(normalIndexes), 'b');
hold on;
grid on;
stem(x(outlierIndexes), column3(outlierIndexes), 'r');
% Enlarge figure to full screen.
set(gcf, 'units','normalized','outerposition',[0 0 1 1]);

and so on for the other columns?

6 个评论
显示 4更早的评论隐藏 4更早的评论

Image Analyst 2012-7-22

在 MATLAB Online 中打开

Well you probably already did this but you could adapt it to find the groups, then process each group:

column3 = cell2mat(A(:, 3))
column4 = cell2mat(A(:, 4))
column5 = cell2mat(A(:, 5))
badIndexes3 = isnan(column3)
badIndexes4 = isnan(column4)
badIndexes5 = isnan(column5)
separatorIndexes = badIndexes3 & badIndexes4 & badIndexes5;
numberOfGroups = length(separatorIndexes)
% Tack on one more.
separatorIndexes = find([separatorIndexes; true]);
% Preallocate cells for outliers.
% Cells because there are different numbers of outliers for each group.
outliers = cell(1,1);
outlierIndexes = cell(1,1);
normalIndexes = cell(1,1);
outlierValues3 = cell(1,1);
for group = 1 : numberOfGroups
  % Go inbetween the separators.
  row1 = separatorIndexes(group)+1;
  row2 = separatorIndexes(group+1) - 1
  extractedGroup3 = column3(row1:row2);
  goodIndexes3 = ~isnan(extractedGroup3)
  goodColumn3 = extractedGroup3(goodIndexes3)
  mean3(group) = mean(extractedGroup3)
  std3(group) = std(extractedGroup3)
  lowerLimit(group) = mean3 - std3
  upperLimit(group) = mean3 + std3
  badLogicalIndexes = extractedGroup3 < lowerLimit | extractedGroup3 > upperLimit;
  outliers(group) = {badLogicalIndexes};
  outlierIndexes(group) = {find(badLogicalIndexes)};
  normalIndexes(group) = {find(~badLogicalIndexes)};
  outlierValues3(group) = extractedGroup3(badLogicalIndexes)
end

But you probably already know that mean+3*SD is not a good method for finding outliers for small groups because a single outlier could really affect the true mean A LOT. There are better methods, such as the Minimum Absolute Deviation (MAD, http://en.wikipedia.org/wiki/Median_absolute_deviation)

Image Analyst 2012-7-22

You first calculate the MAD. Then you multiply that by a factor that says how many of those you need to consider something an outlier, just like you chose 3 for sigma to say that if it were more than 3 sigma away from the mean it's an outlier. Same thing except use MAD instead of the standard deviation sigma. Like, if that data point's value is more than 2 (or 3 or 4 or whatever suits your situation) MADs away from the mean, then that data point is an outlier.

salva 2012-7-22

thanks imager

请先登录，再进行评论。

finding extreme values in a panel data set

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

6 个评论
显示 4更早的评论隐藏 4更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

finding extreme values in a panel data set

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

6 个评论 显示 4更早的评论隐藏 4更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论