silhouette
Silhouette plot
Syntax
Description
silhouette(
accepts one or more additional distance metric parameter values when you specify
X
,clust
,Distance
,DistParameter
)Distance
as a custom distance function handle
@
that accepts the additional
parameter values.distfun
Examples
Create Silhouette Plot
Create silhouette plots from clustered data using different distance metrics.
Generate random sample data.
rng('default') % For reproducibility X = [randn(10,2)+3;randn(10,2)-3];
Create a scatter plot of the data.
scatter(X(:,1),X(:,2));
title('Randomly Generated Data');
The scatter plot shows that the data appears to be split into two clusters of equal size.
Partition the data into two clusters using kmeans
with the default squared Euclidean distance metric.
clust = kmeans(X,2);
clust
contains the cluster indices of the data.
Create a silhouette plot from the clustered data using the default squared Euclidean distance metric.
silhouette(X,clust)
The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.8 or greater), indicating that the clusters are well separated.
Create a silhouette plot from the clustered data using the Euclidean distance metric.
silhouette(X,clust,'Euclidean')
The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.6 or greater), indicating that the clusters are well separated.
Compute Silhouette Values
Compute the silhouette values from clustered data.
Generate random sample data.
rng('default') % For reproducibility X = [randn(10,2)+1;randn(10,2)-1];
Cluster the data in X
based on the sum of absolute differences in distance by using kmeans
.
clust = kmeans(X,2,'distance','cityblock');
clust
contains the cluster indices of the data.
Compute the silhouette values from the clustered data. Specify the distance metric as 'cityblock'
to indicate that the kmeans
clustering is based on the sum of absolute differences.
s = silhouette(X,clust,'cityblock')
s = 20×1
0.0816
0.5848
0.1906
0.2781
0.3954
0.4050
0.0897
0.5416
0.6203
0.6664
⋮
Find Silhouette Values Using Custom Distance Metric
Find silhouette values from clustered data using a custom chi-square distance metric. Verify that the chi-square distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.
Generate random sample data.
rng('default'); % For reproducibility X = [randn(10,2)+3;randn(10,2)-3];
Cluster the data in X
using kmeans
with the default squared Euclidean distance metric.
clust = kmeans(X,2);
Find silhouette values and create a silhouette plot from the clustered data using the Euclidean distance metric.
[s,h] = silhouette(X,clust,'Euclidean')
s = 20×1
0.6472
0.7241
0.5682
0.7658
0.7864
0.6397
0.7253
0.7783
0.7054
0.7442
⋮
h = Figure (1) with properties: Number: 1 Name: '' Color: [1 1 1] Position: [348 376 583 437] Units: 'pixels' Use GET to show all properties
The chi-square distance between J
-dimensional points x and z is
where is the weight associated with dimension j.
Set weights for each dimension and specify the chi-square distance function. The distance function must:
Take as input arguments the n-by-p input data matrix
X
, one row ofX
(for example,x
), and a scaling (or weight) parameterw
.Calculate the distance from
x
to each row ofX
.Return a vector of length n. Each element of the vector is the distance between the observation corresponding to
x
and the observations corresponding to each row ofX
.
w = [0.4; 0.6]; % Set arbitrary weights for illustration
chiSqrDist = @(x,Z,w)sqrt(((x-Z).^2)*w);
Find silhouette values from the clustered data using the custom distance metric chiSqrDist
.
s1 = silhouette(X,clust,chiSqrDist,w)
s1 = 20×1
0.6288
0.7239
0.6244
0.7696
0.7957
0.6688
0.7386
0.7865
0.7223
0.7572
⋮
Set the weight for both dimensions to 1 to use chiSqrDist
as the Euclidean distance metric. Find silhouette values and verify that they are the same as the values in s
.
w2 = [1; 1]; s2 = silhouette(X,clust,chiSqrDist,w2); AreValuesEqual = isequal(s2,s)
AreValuesEqual = logical
1
The silhouette values are the same in s
and s2
.
Input Arguments
X
— Input data
numeric matrix
Input data, specified as a numeric matrix of size n-by-p. Rows correspond to points, and columns correspond to coordinates.
Data Types: single
| double
clust
— Cluster assignment
categorical variable | numeric vector | character matrix | string array | cell array of character vectors
Cluster assignment, specified as a categorical variable, numeric vector, character
matrix, string array, or cell array of character vectors containing a cluster name for
each point in X
.
silhouette
treats NaN
s and empty values in
clust
as missing values and ignores the corresponding rows of
X
.
Data Types: single
| double
| char
| string
| cell
| categorical
Distance
— Distance metric
'sqEuclidean'
(default) | 'Euclidean'
| 'cityblock'
| function handle | vector of pairwise distances | ...
Distance metric, specified as a character vector, string scalar, or function handle, as described in this table.
Metric | Description |
---|---|
'Euclidean' | Euclidean distance |
'sqEuclidean' | Squared Euclidean distance (default) |
'cityblock' | Sum of absolute differences |
'cosine' | One minus the cosine of the included angle between points (treated as vectors) |
'correlation' | One minus the sample correlation between points (treated as sequences of values) |
'Hamming' | Percentage of coordinates that differ |
'Jaccard' | Percentage of nonzero coordinates that differ |
Vector | A numeric row vector of pairwise distances, in the form created by the
pdist function.
X is not used in this case, and can safely be set to
[] . |
@ | Custom distance function handle. A distance function has the form function D = distfun(X0,X,
|
For more information, see Distance Metrics.
Example: 'cosine'
Data Types: char
| string
| function_handle
| single
| double
DistParameter
— Distance metric parameter value
positive scalar | numeric vector | numeric matrix
Distance metric parameter value, specified as a positive scalar, numeric vector, or
numeric matrix. This argument is valid only when you specify a custom distance function
handle @
that accepts one or more
parameter values in addition to the input parameters distfun
X0
and
X
.
Example:
silhouette(X,clust,distfun,p1,p2)
where p1
and
p2
are additional distance metric parameter values for
@
distfun
Data Types: single
| double
Output Arguments
s
— Silhouette values
n-by-1
vector of values ranging from –1
to 1
Silhouette values, returned as an n-by-1
vector of values ranging from –1
to 1
. A
silhouette value measures how similar a point is to points in its own cluster, when
compared to points in other clusters. Values range from –1
to 1
. A high silhouette value indicates that a point is well
matched to its own cluster, and poorly matched to other clusters.
Data Types: single
| double
h
— Figure handle
scalar
Figure handle, returned as a scalar. You can use the figure handle to query and modify figure properties. For more information, see Figure Properties.
More About
Silhouette Value
The silhouette value for each point is a measure of how similar that point is to other points in the same cluster, compared to points in other clusters.
The silhouette value si for the ith point is defined as
where ai is the average distance from the ith point to the other points in the same cluster as i, and bi is the minimum average distance from the ith point to points in a different cluster, minimized over the clusters. If the ith point is the only point in its cluster, then the silhouette value si is set to 1.
The silhouette values range from –1 to 1. A high silhouette value indicates that the point is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.
References
[1] Kaufman L., and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.
Version History
Introduced before R2006a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)