# pdist

Pairwise distance between pairs of observations

## Syntax

``D = pdist(X)``
``D = pdist(X,Distance)``
``D = pdist(X,Distance,DistParameter)``

## Description

example

````D = pdist(X)` returns the Euclidean distance between pairs of observations in `X`. ```

example

````D = pdist(X,Distance)` returns the distance by using the method specified by `Distance`. ```

example

````D = pdist(X,Distance,DistParameter)` returns the distance by using the method specified by `Distance` and `DistParameter`. You can specify `DistParameter` only when `Distance` is `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.```

## Examples

collapse all

Compute the Euclidean distance between pairs of observations, and convert the distance vector to a matrix using `squareform`.

Create a matrix with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2);```

Compute the Euclidean distance.

`D = pdist(X)`
```D = 1×3 0.2954 1.0670 0.9448 ```

The pairwise distances are arranged in the order (2,1), (3,1), (3,2). You can easily locate the distance between observations `i` and `j` by using `squareform`.

`Z = squareform(D)`
```Z = 3×3 0 0.2954 1.0670 0.2954 0 0.9448 1.0670 0.9448 0 ```

`squareform` returns a symmetric matrix where `Z(i,j)` corresponds to the pairwise distance between observations `i` and `j`. For example, you can find the distance between observations 2 and 3.

`Z(2,3)`
```ans = 0.9448 ```

Pass `Z` to the `squareform` function to reproduce the output of the `pdist` function.

`y = squareform(Z)`
```y = 1×3 0.2954 1.0670 0.9448 ```

The outputs `y` from `squareform` and `D` from `pdist` are the same.

Create a matrix with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2);```

Compute the Minkowski distance with the default exponent 2.

`D1 = pdist(X,'minkowski')`
```D1 = 1×3 0.2954 1.0670 0.9448 ```

Compute the Minkowski distance with an exponent of 1, which is equal to the city block distance.

`D2 = pdist(X,'minkowski',1)`
```D2 = 1×3 0.3721 1.5036 1.3136 ```
`D3 = pdist(X,'cityblock')`
```D3 = 1×3 0.3721 1.5036 1.3136 ```

Define a custom distance function that ignores coordinates with `NaN` values, and compute pairwise distance by using the custom distance function.

Create a matrix with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2);```

Assume that the first element of the first observation is missing.

`X(1,1) = NaN;`

Compute the Euclidean distance.

`D1 = pdist(X)`
```D1 = 1×3 NaN NaN 0.9448 ```

If observation `i` or `j` contains `NaN` values, the function `pdist` returns `NaN` for the pairwise distance between `i` and `j`. Therefore, D1(1) and D1(2), the pairwise distances (2,1) and (3,1), are `NaN` values.

Define a custom distance function `naneucdist` that ignores coordinates with `NaN` values and returns the Euclidean distance.

```function D2 = naneucdist(XI,XJ) %NANEUCDIST Euclidean distance ignoring coordinates with NaNs n = size(XI,2); sqdx = (XI-XJ).^2; nstar = sum(~isnan(sqdx),2); % Number of pairs that do not contain NaNs nstar(nstar == 0) = NaN; % To return NaN if all pairs include NaNs D2squared = sum(sqdx,2,'omitnan').*n./nstar; % Correction for missing coordinates D2 = sqrt(D2squared); ```

Compute the distance with `naneucdist` by passing the function handle as an input argument of `pdist`.

`D2 = pdist(X,@naneucdist)`
```D2 = 1×3 0.3974 1.1538 0.9448 ```

## Input Arguments

collapse all

Input data, specified as a numeric matrix of size m-by-n. Rows correspond to individual observations, and columns correspond to individual variables.

Data Types: `single` | `double`

Distance metric, specified as a character vector, string scalar, or function handle, as described in the following table.

ValueDescription
`'euclidean'`

Euclidean distance (default).

`'squaredeuclidean'`

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, `S = std(X,'omitnan')`. Use `DistParameter` to specify another value for `S`.

`'mahalanobis'`

Mahalanobis distance using the sample covariance of `X`, `C = cov(X,'omitrows')`. Use `DistParameter` to specify another value for `C`, where the matrix `C` is symmetric and positive definite.

`'cityblock'`

City block distance.

`'minkowski'`

Minkowski distance. The default exponent is 2. Use `DistParameter` to specify a different exponent `P`, where `P` is a positive scalar value of the exponent.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'hamming'`

Hamming distance, which is the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

`'spearman'`

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

`@distfun`

Custom distance function handle. A distance function has the form

```function D2 = distfun(ZI,ZJ) % calculation of distance ...```
where

• `ZI` is a `1`-by-`n` vector containing a single observation.

• `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `ZJ` with an arbitrary number of observations.

• `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`.

If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

For definitions, see Distance Metrics.

When you use `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`, you can specify an additional input argument `DistParameter` to control these metrics. You can also use these metrics in the same way as the other metrics with a default value of `DistParameter`.

Example: `'minkowski'`

Distance metric parameter values, specified as a positive scalar, numeric vector, or numeric matrix. This argument is valid only when you specify `Distance` as `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.

• If `Distance` is `'seuclidean'`, `DistParameter` is a vector of scaling factors for each dimension, specified as a positive vector. The default value is `std(X,'omitnan')`.

• If `Distance` is `'minkowski'`, `DistParameter` is the exponent of Minkowski distance, specified as a positive scalar. The default value is 2.

• If `Distance` is `'mahalanobis'`, `DistParameter` is a covariance matrix, specified as a numeric matrix. The default value is `cov(X,'omitrows')`. `DistParameter` must be symmetric and positive definite.

Example: `'minkowski',3`

Data Types: `single` | `double`

## Output Arguments

collapse all

Pairwise distances, returned as a numeric row vector of length m(m–1)/2, corresponding to pairs of observations, where m is the number of observations in `X`.

The distances are arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1), i.e., the lower-left triangle of the m-by-m distance matrix in column order. The pairwise distance between observations i and j is in D((i-1)*(m-i/2)+j-i) for ij.

You can convert `D` into a symmetric matrix by using the `squareform` function. `Z = squareform(D)` returns an m-by-m matrix where `Z(i,j)` corresponds to the pairwise distance between observations i and j.

If observation i or j contains `NaN`s, then the corresponding value in `D` is `NaN` for the built-in distance functions.

`D` is commonly used as a dissimilarity matrix in clustering or multidimensional scaling. For details, see Hierarchical Clustering and the function reference pages for `cmdscale`, `cophenet`, `linkage`, `mdscale`, and `optimalleaforder`. These functions take `D` as an input argument.

collapse all

### Distance Metrics

A distance metric is a function that defines a distance between two observations. `pdist` supports various distance metrics: Euclidean distance, standardized Euclidean distance, Mahalanobis distance, city block distance, Minkowski distance, Chebychev distance, cosine distance, correlation distance, Hamming distance, Jaccard distance, and Spearman distance.

Given an m-by-n data matrix `X`, which is treated as m (1-by-n) row vectors x1, x2, ..., xm, the various distances between the vector xs and xt are defined as follows:

• Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right)\left({x}_{s}-{x}_{t}{\right)}^{\prime }.$`

The Euclidean distance is a special case of the Minkowski distance, where p = 2.

• Standardized Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){V}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime },$`

where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2, where S is a vector of scaling factors for each dimension.

• Mahalanobis distance

`${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){C}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime },$`

where C is the covariance matrix.

• City block distance

`${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{x}_{tj}|.$`

The city block distance is a special case of the Minkowski distance, where p = 1.

• Minkowski distance

`${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{x}_{tj}|}^{p}}.$`

For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.

• Chebychev distance

`${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{x}_{tj}|\right\}.$`

The Chebychev distance is a special case of the Minkowski distance, where p = ∞.

• Cosine distance

`${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({x}_{t}{{x}^{\prime }}_{t}\right)}}.$`
• Correlation distance

`${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}},$`

where

${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$ and ${\overline{x}}_{t}=\frac{1}{n}\sum _{j}{x}_{tj}$.

• Hamming distance

`${d}_{st}=\left(#\left({x}_{sj}\ne {x}_{tj}\right)/n\right).$`
• Jaccard distance

`${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}.$`
• Spearman distance

`${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}},$`

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmj, as computed by `tiedrank`.

• rs and rt are the coordinate-wise rank vectors of xs and xt, i.e., rs = (rs1, rs2, ... rsn).

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$.

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$.