Draw a curve defining the bounds of a scatter plot to detect anomalous (sparse or low density) points

25 次查看(过去 30 天)
Dear all, I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot
% the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
Below, you can see 4 examples (xy 1 to 4 are also attached as .mat files). To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.
Any help is grately appreciated!
Any other approach to identify the points outside the red boundaries is more than wellcome!
Thanks a lot in advance

采纳的回答

Matt J
Matt J 2023-9-13
编辑:Matt J 2023-9-13
Sets=["xy1" "xy2" "xy3" "xy4"];
Factors=[20,20,5,5];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
p=envelope(xy,Factors(i));
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
Warning: Polyshape has duplicate vertices, intersections, or other inconsistencies that may produce inaccurate or unexpected results. Input data has been modified to create a well-defined polyshape.
function p=envelope(xy,spreadFactor)
[D,I]=pdist2(xy,xy,'euclidean','Smallest',5);
Dmax=max(D,[],1);
d=median(D(:))*spreadFactor;
XY=xy(Dmax<=d,:);
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
  11 个评论
Simone A.
Simone A. 2023-9-16
编辑:Simone A. 2023-9-16
@Matt J Absolutely amazing! This code is performing extremely well for most of my data distributions and, with a couple of tweaks, such as adding a smooth (as kindly suggested by @Mathieu NOE - see below for future reference), I am mostly achieving what I needed! Thank you very much indeed!!!
As a last thing, could you kindly provide a brief summary on the rationale and on what this code (thus pdist2 function) is actually doing, or better yet, how it determines 'where to draw a boundary'?
Especially, why is
D2=max(pdist2(xy,xy,'euclidean','Smallest', 8),[],1);
running 3 times, keeping the values >1? Is this related to an initial guess of the pdist2 function, which may return different results at every run? What is 8 defining (in the code you wrote it was 10, but I found that 8 better adapts to different distributions)?
Also, do you have any suggestions for further reading so that I can better understand the approach? It would be greatly appreciated!
Thanks a lot for your time, patience, and consideration!
xy= [x,y];
p=envelope(xy);
p = polyshape(p.Vertices(:,1), p.Vertices(:,2));
p = polybuffer(p,0.25);
function [p,XY]=envelope(xy)
for i=1:3
D2=max(pdist2(xy,xy,'euclidean','Smallest', 8),[],1);
xy(D2>1,:)=[];
end
XY=xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
Matt J
Matt J 2023-9-16
编辑:Matt J 2023-9-16
As a last thing, could you kindly provide a brief summary on the rationale and on what this code (thus pdist2 function) is actually doing, or better yet, how it determines 'where to draw a boundary'?
Well, pdist2 finds the 8 nearest neighbors of every point (x0,y0) and if any of the neighbors is at a distance of >1 than (x0,y0) is classified as belonging to the sparse region, and is discarded from the list. Meanwhile, points well-nested in the dense region will survive this elimination process. Ultimately, the idea is that all the sparse points will be stripped away. Some points within the dense central cloud may also be stripped away, but the expectation is that enough points there will survive so that the boundaries of the central cloud can be well-delineated by the boundary command.
To determine the threshold of >1, I had to read your mind :-). It seemed to me from your initial post and where you were drawing the boundaries, that your criterion for classifying points as an outlier was that they were spaced apart by >1 on average.
Especially, why is D2=max(__); running 3 times,
In the cloud of sparse outlying points, you may have a small, isolated clump of dense points, separate from the main lobe. By iterating a few times, it will eat away at the edges of those clumps, eventually consuming them.
Also, do you have any suggestions for further reading so that I can better understand the approach?
I don't. This is a very improvised clustering strategy, and one that is very customized to your specific data sets. I'm sure you will find much better algorithms if you really dive into the literature on clustering. If you do come across this one in the literature, though, I would be interested to know!

请先登录,再进行评论。

更多回答(1 个)

Mathieu NOE
Mathieu NOE 2023-9-13
hello
maybe this ?
I simply decided that I wanted the contour line corresponding to a level 5% of F range above the min value of F
Result :
[hAxes,col,ctrs1,ctrs2,F] = dscatter(xy4(:,1),xy4(:,2),'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
threshold = lb+0.05*(ub-lb); % threshold is 5% of data range above min value
contour(ctrs1,ctrs2,F,[threshold threshold],'k-'); %
hold off
  9 个评论
Mathieu NOE
Mathieu NOE 2023-9-18
No problem
I was conscient that my proposal is not as good as @Matt Jone. He's done an amazing job
I just need now to find an alternative to pdist2 as I don't have the relevant toolbox
Mathieu NOE
Mathieu NOE 2023-9-18
so finally (and for my own fun) I tried some variants to replace pdist2
so for those like me that lack the right toolbox , here two codes that do the same process .
the differences are minor, one keeps valid points and the other one remove invalid points like Matt did in his code.
One thing I removed is the 3 iterations in the envelope function ; I didn't see any benefit of doing this , but maybe it's specific to my own code
solution 1
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% find cols where we can have n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
f = all(D2<=1); % keep those points
xy=xy(f,:);
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
solution 2
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% remove rows where we haven't n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
D2=max(D2,[],1);
xy(D2>1,:)=[];
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end

请先登录,再进行评论。

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by