# GapEvaluation

Gap criterion clustering evaluation object

## Description

`GapEvaluation`

is an object consisting of sample data (`X`

), clustering data (`OptimalY`

), and gap criterion values
(`CriterionValues`

) used to
evaluate the optimal number of clusters (`OptimalK`

). The gap criterion values
correspond to the difference `ExpectedLogW`

–
`LogW`

, where *W* is the within-cluster dispersion,
`ExpectedLogW`

is determined by Monte Carlo sampling from a reference
distribution, and `LogW`

is computed from the sample data. The optimal
number of clusters corresponds to the solution with the largest local or global gap value
within a tolerance range (`SearchMethod`

). For
more information, see Gap Value.

## Creation

Create a gap criterion clustering evaluation object by using the `evalclusters`

function and specifying the criterion as
`"gap"`

.

You can then use `compact`

to create a compact version of the gap
criterion clustering evaluation object. The function removes the contents of the properties
`X`

, `OptimalY`

, and
`Missing`

.

## Properties

### Clustering Evaluation Properties

`ClusteringFunction`

— Clustering algorithm

`'kmeans'`

| `'linkage'`

| `'gmdistribution'`

| function handle

This property is read-only.

Clustering algorithm used to cluster the sample data, returned as
`'kmeans'`

, `'linkage'`

,
`'gmdistribution'`

, or a function handle.

Value | Description |
---|---|

`'kmeans'` | Cluster the data in `X` using the `kmeans` clustering algorithm, with
`EmptyAction` set to `"singleton"` and
`Replicates` set to `5` . |

`'linkage'` | Cluster the data in `X` using the `clusterdata` agglomerative
clustering algorithm, with `Linkage` set to
`"ward"` . |

`'gmdistribution'` | Cluster the data in `X` using the `gmdistribution` Gaussian mixture
distribution algorithm, with `SharedCov` set to
`true` and `Replicates` set to
`5` . |

**Data Types: **`char`

| `function_handle`

`CriterionName`

— Name of criterion

`'Gap'`

This property is read-only.

Name of the criterion used for clustering evaluation, returned as
`'Gap'`

.

`CriterionValues`

— Criterion values

numeric vector

This property is read-only.

Criterion values, returned as a numeric vector. Each value corresponds to a proposed
number of clusters in `InspectedK`

.

**Data Types: **`double`

`Distance`

— Distance metric

`'sqEuclidean'`

| `'Euclidean'`

| `'cityblock'`

| `'cosine'`

| `'correlation'`

| function handle

This property is read-only.

Distance metric used for clustering data and computing the criterion values, returned as one of the values in this table or a function handle.

Value | Description |
---|---|

`'sqEuclidean'` | Squared Euclidean distance |

`'Euclidean'` | Euclidean distance |

`'cityblock'` | Sum of absolute differences |

`'cosine'` | One minus the cosine of the included angle between points (treated as vectors) |

`'correlation'` | One minus the sample correlation between points (treated as sequences of values) |

**Data Types: **`char`

| `function_handle`

`InspectedK`

— List of number of proposed clusters

positive integer vector

This property is read-only.

List of the number of proposed clusters for which to compute criterion values, returned as a positive integer vector.

**Data Types: **`double`

`OptimalK`

— Optimal number of clusters

positive integer scalar

This property is read-only.

Optimal number of clusters, returned as a positive integer scalar.

**Data Types: **`double`

`OptimalY`

— Optimal clustering solution

positive integer column vector | `[]`

This property is read-only.

Optimal clustering solution corresponding to `OptimalK`

, returned
as a positive integer column vector. Each row of `OptimalY`

represents the cluster index of the corresponding observation (or row) in
`X`

. If you specify the clustering solutions as an input argument
to `evalclusters`

when you create the clustering evaluation object,
or if the clustering evaluation object is compact (see `compact`

), then `OptimalY`

is empty.

**Data Types: **`double`

`SearchMethod`

— Method for selecting optimal number of clusters

`'globalMaxSE'`

| `'firstMaxSE'`

This property is read-only.

Method for selecting the optimal number of clusters, returned as
`'globalMaxSE'`

or `'firstMaxSE'`

.

Value | Description |
---|---|

`'globalMaxSE'` | Evaluate each proposed number of clusters in
$$\text{Gap}\left(K\right)\ge GAPMAX-\text{SE}(GAPMAX),$$ where |

`'firstMaxSE'` | Evaluate each proposed number of clusters in
$$\text{Gap}(K)\ge \text{Gap}(K+1)-\text{SE}(K+1),$$ where |

### Sample Data Properties

`LogW`

— Natural logarithm of within-cluster dispersion

numeric vector

This property is read-only.

Natural logarithm of the within-cluster dispersion *W* based on
the sample data `X`

, returned as a numeric vector.
*W* is the within-cluster dispersion computed using the distance
metric `Distance`

. Each element of `LogW`

corresponds to a specific number of proposed clusters (an element of
`InspectedK`

).

**Data Types: **`double`

`Missing`

— Excluded data

logical column vector | `[]`

This property is read-only.

Excluded data, returned as a logical column vector. If an element of
`Missing`

is `true`

, then the corresponding
observation (or row) in the data matrix `X`

is not used in the
clustering solutions. If the clustering evaluation object is compact (see `compact`

), then `Missing`

is empty.

**Data Types: **`double`

| `logical`

`NumObservations`

— Number of observations

positive integer scalar

This property is read-only.

Number of observations in the data matrix `X`

, ignoring
observations with missing (`NaN`

) values, returned as a positive
integer scalar.

**Data Types: **`double`

`X`

— Data used for clustering

numeric matrix | `[]`

This property is read-only.

Data used for clustering, returned as a numeric matrix. Rows correspond to
observations, and columns correspond to variables. If the clustering evaluation object
is compact (see `compact`

), then `X`

is
empty.

**Data Types: **`single`

| `double`

### Reference Data Properties

`B`

— Number of reference data sets

positive integer scalar

This property is read-only.

Number of reference data sets generated from the reference distribution
`ReferenceDistribution`

, returned as a positive integer
scalar.

**Data Types: **`double`

`ExpectedLogW`

— Expectation of natural logarithm of within-cluster dispersion

numeric vector

This property is read-only.

Expectation of the natural logarithm of the within-cluster dispersion
*W* based on the generated reference data, returned as a numeric
vector. *W* is the within-cluster dispersion computed using the
distance metric `Distance`

. Each element of
`ExpectedLogW`

corresponds to a specific number of proposed
clusters (an element of `InspectedK`

).

**Data Types: **`double`

`ReferenceDistribution`

— Reference data generation method

`'PCA'`

| `'uniform'`

This property is read-only.

Reference data generation method, returned as `'PCA'`

or
`'uniform'`

.

Value | Description |
---|---|

`'PCA'` | Generate reference data from a uniform distribution over a box aligned
with the principal components of the data matrix
`X` . |

`'uniform'` | Generate reference data uniformly over the range of each feature in the
data matrix `X` . |

`SE`

— Standard error of natural logarithm of within-cluster dispersion

numeric vector

This property is read-only.

Standard error of the natural logarithm of the within-cluster dispersion
*W* with respect to the reference data, returned as a numeric
vector. *W* is the within-cluster dispersion computed using the
distance metric `Distance`

. Each element of `SE`

corresponds to a specific number of proposed clusters (an element of
`InspectedK`

).

**Data Types: **`double`

`StdLogW`

— Standard deviation of natural logarithm of within-cluster dispersion

numeric vector

This property is read-only.

Standard deviation of the natural logarithm of the within-cluster dispersion
*W* with respect to the reference data, returned as a numeric
vector. *W* is the within-cluster dispersion computed using the
distance metric `Distance`

. Each element of
`StdLogW`

corresponds to a specific number of proposed clusters
(an element of `InspectedK`

).

**Data Types: **`double`

## Object Functions

## Examples

### Evaluate Clustering Solution Using Gap Criterion

Evaluate the optimal number of clusters using the gap clustering evaluation criterion.

Load the `fisheriris`

data set. The data contains length and width measurements from the sepals and petals of three species of iris flowers.

`load fisheriris`

Evaluate the optimal number of clusters based on the gap criterion values. Cluster the data using `kmeans`

.

rng("default") % For reproducibility evaluation = evalclusters(meas,"kmeans","gap","KList",1:6)

evaluation = GapEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [0.0720 0.5928 0.8762 1.0114 1.0534 1.0720] OptimalK: 5

The `OptimalK`

value indicates that, based on the gap criterion, the optimal number of clusters is five.

Plot the gap criterion values for each number of clusters tested.

plot(evaluation)

Based on the plot, the maximum value of the gap criterion occurs at six clusters. However, the value at five clusters is within one standard error of the maximum, so the suggested optimal number of clusters is five.

Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by the suggested clusters.

```
PetalLength = meas(:,3);
PetalWidth = meas(:,4);
clusters = evaluation.OptimalY;
gscatter(PetalLength,PetalWidth,clusters,[],"xod^*");
```

The plot shows cluster 4 in the lower-left corner, completely separated from the other four clusters. Cluster 4 contains flowers with the smallest petal widths and lengths. Cluster 2 is in the upper-right corner, and contains flowers with the largest petal widths and lengths. Cluster 5 is next to cluster 2, and contains flowers with similar petal widths but smaller petal lengths compared to the flowers in cluster 2. Clusters 1 and 3 are near the center of the plot, and contain flowers with measurements between the extremes.

## More About

### Gap Value

A common graphical approach to clustering evaluation involves plotting an error measurement versus several proposed numbers of clusters, and locating the “elbow” of this plot. The “elbow” occurs at the most dramatic decrease in error measurement. The gap criterion formalizes this approach by estimating the “elbow” location as the number of clusters with the largest gap value. Therefore, under the gap criterion, the optimal number of clusters corresponds to the solution with the largest local or global gap value within a tolerance range.

The gap value is defined as

$$Ga{p}_{n}\left(k\right)={E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}-\mathrm{log}\left({W}_{k}\right),$$

where *n* is the sample size, *k* is
the number of clusters being evaluated, and *W*_{k} is
the pooled within-cluster dispersion measurement

$${W}_{k}={\displaystyle \sum _{r=1}^{k}\frac{1}{2{n}_{r}}{D}_{r},}$$

where *n*_{r} is
the number of data points in cluster *r*, and
*D*_{r} is the sum of the pairwise distances for all
points in cluster *r*.

The expected value $${E}_{n}^{*}\left\{\mathrm{log}\left({W}_{k}\right)\right\}$$ is determined by Monte Carlo sampling from a reference distribution, and
`log(`

is computed from
the sample data.*W*_{k})

The gap value is defined even for clustering solutions that contain only one cluster, and can be used with any distance metric. However, the gap criterion is more computationally expensive than other clustering evaluation criteria, because the clustering algorithm must be applied to the reference data for each proposed clustering solution.

## References

[1] Tibshirani, R., G. Walther, and T.
Hastie. “Estimating the number of clusters in a data set via the gap statistic.”
*Journal of the Royal Statistical Society: Series B*. Vol. 63, Part 2,
2001, pp. 411–423.

## Version History

**Introduced in R2013b**

## MATLAB 命令

您点击的链接对应于以下 MATLAB 命令：

请在 MATLAB 命令行窗口中直接输入以执行命令。Web 浏览器不支持 MATLAB 命令。

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)