Machine Learning with MATLAB

Chapter 3 Applying Unsupervised Learning

When to Consider Unsupervised Learning

Unsupervised learning is useful when you want to explore your data but don’t yet have a specific goal or are not sure what information the data contains. It’s also a good way to reduce the dimensions of your data.

Most unsupervised learning techniques are a form of cluster analysis, as we saw in Chapter 1.

In cluster analysis, data is partitioned into groups based on some measure of similarity or shared characteristic. Clusters are formed so that objects in the same cluster are very similar and objects in different clusters are very distinct.

Clustering algorithms fall into two broad groups:

Hard clustering, where each data point belongs to only one cluster.
Soft clustering, where each data point can belong to more than one cluster. You can use hard or soft clustering techniques if you already know the possible data groupings.

Graph of a cluster model using Gaussian technique

Gaussian mixture model used to separate data into two clusters.

If you don’t yet know how the data might be grouped:

Use self-organizing feature maps or hierarchical clustering to look for possible structures in the data.
Use cluster evaluation to look for the “best” number of groups for a given clustering algorithm.

Common Hard Clustering Algorithms

k-Means

HOW IT WORKS
Partitions data into k number of mutually exclusive clusters. How well a point fits into a cluster is determined by the distance from that point to the cluster’s center.

BEST USED...

When the number of clusters is known

For fast clustering of large data sets
RESULT
Cluster centers
k-Medoids

HOW IT WORKS
Similar to k-means, but with the requirement that the cluster centers coincide with points in the data.

BEST USED...

When the number of clusters is known

For fast clustering of categorical data

To scale to large data sets
RESULT
Cluster centers that coincide with data points
Hierarchical Clustering

HOW IT WORKS
Produces nested sets of clusters by analyzing similarities between pairs of points and grouping objects into a binary, hierarchical tree.

BEST USED...

When you don’t know in advance how many clusters are in your data

When you want visualization to guide your selection
RESULT
Dendrogram showing the hierarchical relationship between clusters
Self-Organizing Map

HOW IT WORKS
Neural-network based clustering that transforms a data set into a topology-preserving 2D map.

BEST USED...

To visualize high-dimensional data in 2D or 3D

To deduce the dimensionality of data by preserving its topology (shape)
RESULT
Lower-dimensional (typically 2D) representation

Common Soft Clustering Algorithms

Fuzzy c-Means

HOW IT WORKS
Partition-based clustering when data points may belong to more than one cluster.

BEST USED...

When the number of clusters is known

For pattern recognition

When clusters overlap
RESULT
Cluster centers (similar to k-means) but with fuzziness so that points may belong to more than one cluster
Gaussian Mixture Model

HOW IT WORKS
Partition-based clustering where data points come from different multivariate normal distributions with certain probabilities.

BEST USED...

When a data point might belong to more than one cluster

When clusters have different sizes and correlation structures within them
RESULT
A model of Gaussian distributions that give probabilities of a point being in a cluster