This topic provides a brief overview of the available clustering methods in Statistics and Machine Learning Toolbox™.
Cluster analysis, also called segmentation analysis or taxonomy analysis, is a common unsupervised learning method. Unsupervised learning is used to draw inferences from data sets consisting of input data without labeled responses. For example, you can use cluster analysis for exploratory data analysis to find hidden patterns or groupings in unlabeled data.
Cluster analysis creates groups, or clusters, of data. Objects that belong to the same cluster are similar to one another and distinct from objects that belong to different clusters. To quantify "similar" and "distinct," you can use a dissimilarity measure (or distance metric) that is specific to the domain of your application and your data set. Also, depending on your application, you might consider scaling (or standardizing) the variables in your data to give them equal importance during clustering.
Statistics and Machine Learning Toolbox provides functionality for these clustering methods:
Hierarchical clustering groups data over a variety of scales by creating a cluster tree, or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level combine to form clusters at the next level. This multilevel hierarchy allows you to choose the level, or scale, of clustering that is most appropriate for your application. Hierarchical clustering assigns every point in your data to a cluster.
clusterdata to perform
hierarchical clustering on input data.
cluster functions, which you
can use separately for more detailed analysis. The
dendrogram function plots the
cluster tree. For more information, see Introduction to Hierarchical Clustering.
k-means clustering and k-medoids clustering partition data into k mutually exclusive clusters. These clustering methods require that you specify the number of clusters k. Both k-means and k-medoids clustering assign every point in your data to a cluster; however, unlike hierarchical clustering, these methods operate on actual observations (rather than dissimilarity measures), and create a single level of clusters. Therefore, k-means or k-medoids clustering is often more suitable than hierarchical clustering for large amounts of data.
DBSCAN is a density-based algorithm that identifies arbitrarily shaped clusters and outliers (noise) in data. During clustering, DBSCAN identifies points that do not belong to any cluster, which makes this method useful for density-based outlier detection. Unlike k-means and k-medoids clustering, DBSCAN does not require prior knowledge of the number of clusters.
A Gaussian mixture model (GMM) forms clusters as a mixture of multivariate normal density components. For a given observation, the GMM assigns posterior probabilities to each component density (or cluster). The posterior probabilities indicate that the observation has some probability of belonging to each cluster. A GMM can perform hard clustering by selecting the component that maximizes the posterior probability as the assigned cluster for the observation. You can also use a GMM to perform soft, or fuzzy, clustering by assigning the observation to multiple clusters based on the scores or posterior probabilities of the observation for the clusters. A GMM can be a more appropriate method than k-means clustering when clusters have different sizes and different correlation structures within them.
fitgmdist to fit a
gmdistribution object to your data.
You can also use
gmdistribution to create a GMM object
by specifying the distribution parameters. When you have a fitted GMM, you can
cluster query data by using the
cluster function. For more
information, see Cluster Using Gaussian Mixture Model.
k-nearest neighbor search finds the k closest points in your data to a query point or set of query points. In contrast, radius search finds all points in your data that are within a specified distance from a query point or set of query points. The results of these methods depend on the distance metric that you specify.
knnsearch function to find
k-nearest neighbors or the
rangesearch function to find all
neighbors within a specified distance of your input data. You can also create a
searcher object using a training data set, and pass the object and query data
sets to the object functions (
rangesearch). For more information,
see Classification Using Nearest Neighbors.
Spectral clustering is a graph-based algorithm for finding k arbitrarily shaped clusters in data. The technique involves representing the data in a low dimension. In the low dimension, clusters in the data are more widely separated, enabling you to use algorithms such as k-means or k-medoids clustering. This low dimension is based on eigenvectors of a Laplacian matrix. A Laplacian matrix is one way of representing a similarity graph that models the local neighborhood relationships between data points as an undirected graph.
spectralcluster to perform spectral clustering on an input data
matrix or on a similarity matrix of a similarity graph.
spectralcluster requires that you specify the number of
clusters. However, the algorithm for spectral clustering also provides a way to
estimate the number of clusters in your data. For more information, see Partition Data Using Spectral Clustering.
This table compares the features of available clustering methods in Statistics and Machine Learning Toolbox.
|Method||Basis of Algorithm||Input to Algorithm||Requires Specified Number of Clusters||Cluster Shapes Identified||Useful for Outlier Detection|
|Hierarchical Clustering||Distance between objects||Pairwise distances between observations||No||Arbitrarily shaped clusters, depending on the specified
|k-Means Clustering and k-Medoids Clustering||Distance between objects and centroids||Actual observations||Yes||Spheroidal clusters with equal diagonal covariance||No|
|Density-Based Spatial Clustering of Algorithms with Noise (DBSCAN)||Density of regions in the data||Actual observations or pairwise distances between observations||No||Arbitrarily shaped clusters||Yes|
|Gaussian Mixture Models||Mixture of Gaussian distributions||Actual observations||Yes||Spheroidal clusters with different covariance structures||Yes|
|Nearest Neighbors||Distance between objects||Actual observations||No||Arbitrarily shaped clusters||Yes, depending on the specified number of neighbors|
|Spectral Clustering (Partition Data Using Spectral Clustering)||Graph representing connections between data points||Actual observations or similarity matrix||Yes, but the algorithm also provides a way to estimate the number of clusters||Arbitrarily shaped clusters||No|