Why are negative GAP statistic values, provided by evalclusters, allowed as solution?

9 次查看(过去 30 天)
Dear fellow Matlab users and developers,
Facing the question of an optimal number of cluster of a data set, I wondered why sometimes negativ GAP values are allowed as solution of the evalcluster function of the statistics toolbox.
Example:
data = [[25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46];...
[79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]]';
% get optimal number of cluster
eva1 = evalclusters(data,'kmeans','Gap','KList',[1:5],'SearchMethod','firstMaxSE');
figure()
plot(eva1)
The original implementation of the evalcluster function respectively class evaluates the optimal number of clusters by the criterion . The Gap value is defined as .
This is in accordance with the original paper provided by
Tibshirani, R.;Walther, G. and Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B 63: pp. 411–423.
For the example above this is true. The Gap value of two clusters minus the error would be below the Gap value of one cluster.
However, if the Gap value is negative the curve is still above .
In the original paper it further states: "Our estimate of the optimal number of clusters is then the value of k for which falls the farthest below this reference curve."
In my interpretation, negative Gap values should not be allowed as solution, since the condition below is not fullfilled.
This implementation of the GapEvaluation.m function reads as followed:
if ( isempty(nextValid) ...
|| this.CriterionValues(j) >=...
(this.CriterionValues(j+nextValid)-this.SE(j+nextValid)))
this.OptimalK = NC;
this.OptimalY = IDX;
end
A further condition should be applied to the if condition this.CriterionValues(j)>0. This would ensure that the actual is below .
if (( isempty(nextValid) ...
|| this.CriterionValues(j) >=...
(this.CriterionValues(j+nextValid)-this.SE(j+nextValid)))) && this.CriterionValues(j)>0
this.OptimalK = NC;
this.OptimalY = IDX;
end
This would lead to the optimal number of clusters of three.
Would you agree or am I missing something important?
How to treat the outcome of Gap values if all are negative?
Cheers

采纳的回答

James Hong
James Hong 2020-4-27
I have forwarded Matthias's points to our development team, and they agree. We will be looking into this further.
For the second question:
Per Matthias's suggestion in an email written to us, when all GAP values are negative, the optimal number of clusters should be k=1, indicating no clustering. I think this is a reasonable suggestion.
Thanks again for your feedback!

更多回答(0 个)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by