seqpdist

Calculate pairwise distance between sequences

collapse all in page

Syntax

D = seqpdist(Seqs)

D = seqpdist(Seqs,Name=Value)

Description

D = seqpdist(Seqs) returns the biological distances between each pair of sequences stored in Seqs.

D = seqpdist(Seqs,Name=Value) adjusts aspects of the calculation using name-value arguments.

example

Examples

collapse all

Measure Distance Between Amino Acid Sequences

Open Live Script

Read amino acid alignment data into a structure.

seqs = fastaread("pf00002.fa");

For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.

dist = seqpdist(seqs,Method="alignment-score", ...
    Indels="pairwise-delete",...
    ScoringMatrix="pam250");

Force the realignment of each sequence pair ignoring the provided multiple alignment.

dist = seqpdist(seqs,Method="alignment-score",...
    Indels="pairwise-delete", ...
    ScoringMatrix="pam250", ...
    PairwiseAlignment=1);

Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.

dist = seqpdist(seqs,Method="jukes-cantor", ...
    Indels="score", ...
    ScoringMatrix="pam250", ...
    PairwiseAlignment=true);

Input Arguments

collapse all

`Seqs` — Nucleotide or amino acid sequences
cell array of character vectors | vector of strings | matrix of characters | vector of structures

Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.

You can specify:

Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences.
Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence.
Vector of structures containing a Sequence field.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: D = seqpdist(Seqs,Method="p-distance") calculates the pairwise distance using the p-distance method.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: D = seqpdist(Seqs,"Method","p-distance")

`Method` — Method to calculate pairwise distances
`"Jukes-Cantor"` (default) | character vector | string scalar | function handle

Method to calculate pairwise distances, specified by a character vector or string scalar in the table.

Methods for Nucleotides and Amino Acids

Method Description

Method	Description
`p-distance`	Proportion of sites at which the two sequences are different. `p` is close to `1` for poorly related sequences, and `p` is close to `0` for similar sequences. d = p
`Jukes-Cantor` (default)	Maximum likelihood estimate of the number of substitutions between two sequences. `p` is described with the method `p-distance`. For nucleotides: `d = -3/4 log(1-p * 4/3)` For amino acids: `d = -19/20 log(1-p * 20/19)`
`alignment-score`	Distance (`d`) between two sequences (`1, 2`) is computed from the pairwise alignment score between the two sequences (`score12`), and the pairwise alignment score between each sequence and itself (`score11`, `score22`) as follows: d = (1-score12/score11)* (1-score12/score22) This method does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then `d = 0`.

p-distance

Proportion of sites at which the two sequences are different. p is close to 1 for poorly related sequences, and p is close to 0 for similar sequences.

d = p

Jukes-Cantor (default)

Maximum likelihood estimate of the number of substitutions between two sequences. p is described with the method p-distance.

For nucleotides:

d = -3/4 log(1-p * 4/3)

For amino acids:

d = -19/20 log(1-p * 20/19)

alignment-score

Distance (d) between two sequences (1, 2) is computed from the pairwise alignment score between the two sequences (score12), and the pairwise alignment score between each sequence and itself (score11, score22) as follows:

d = (1-score12/score11)* (1-score12/score22)

This method does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then d = 0.

Methods with No Scoring of Gaps (Nucleotides Only)

Method	Description
`Tajima-Nei`	Maximum likelihood estimate considering the background nucleotide frequencies. The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to `[gA gC gG gT]`. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.
`Kimura`	Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution.
`Tamura`	Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content. The `seqpdist` function can calculate the GC content from the input sequences. Instead, if you want to specify the proportion of GC content, set the `OptArgs` name-value argument to a numeric scalar in the range [0, 1].
`Hasegawa`	Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies. The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to `[gA gC gG gT]`. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.
`Nei-Tamura`	Considers separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies. The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to `[gA gC gG gT]`. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.

Methods with No Scoring of Gaps (Amino Acids Only)

Method Description

Poisson Assumes that the number of amino acid substitutions at each site has a Poisson distribution.

Method	Description
`Poisson`	Assumes that the number of amino acid substitutions at each site has a Poisson distribution.
`Gamma`	Assumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter `a`. By default, the `seqpdist` function sets the value of `a` as `2`. Instead, if you want to specify the value of `a`, set the `OptArgs` name-value argument to a numeric scalar.

Gamma

Assumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter a.

By default, the seqpdist function sets the value of a as 2. Instead, if you want to specify the value of a, set the OptArgs name-value argument to a numeric scalar.

You can also specify a user-defined distance function using @, for example, @distfun. The distance function must have the form:

function D = distfun(S1, S2, OptionalArgs)

The distfun function takes the following arguments:

S1 , S2 — Two sequences of the same length (nucleotide or amino acid).
OptionalArgs — Optional problem-dependent arguments.

The distfun function returns a scalar that represents the distance between S1 and S2.

Data Types: char | string | function_handle

`Indels` — How to treat sites with gaps
`"score"` (default) | `"pairwise-del"` | `"complete-del"`

How to treat sites with gaps, specified as one of these strings or character vectors.

score — Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.
pairwise-del — For every pairwise comparison, it ignores the sites with gaps.
complete-del — Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the input Seqs.

Data Types: char | string

`OptArgs` — Optional arguments for distance method
numeric scalar | four-element numeric vector

Optional arguments for the distance method, specified as a numeric scalar or a four-element numeric vector.

Methods with No Scoring of Gaps (Nucleotides Only)

Method	Description
`Tajima-Nei`	Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.
`Tamura`	Proportion of GC content, specified as a number in the range [0, 1].
`Hasegawa`	Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.
`Nei-Tamura`	Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.

Methods with No Scoring of Gaps (Amino Acids Only)

Method	Description
`Gamma`	Shape parameter `a` of the Gamma distribution, specified as a numeric scalar.

`PairwiseAlignment` — Global pairwise alignment
`true` or `1` | `false` or `0`

Controls the global pairwise alignment of input sequences, specified as a numeric or logical true (1) or false (0). When true, the seqpdist function performs global pairwise alignment using the nwalign function, while ignoring the multiple alignment of the input sequences (if any).

The default value depends on the length of the sequences:

true — When all input sequences do not have the same length.
false — When all input sequences have the same length.

Tip

If your input sequences are the same length, then seqpdist assumes they are aligned. If they are not aligned, do one of the following:

Align the sequences before passing them to seqpdist, for example, using the multialign function.
Set PairwiseAlignment to true when using seqpdist.

`UseParallel` — Option to perform computations in parallel
`"off"` (default) | `"auto"` | `"on"`

Option to perform computations in parallel using a parallel pool of workers, specified as one of these values:

"off" — Run in serial on the MATLAB^® client.
"auto" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, run in serial on the MATLAB client.
"on" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, throw an error.

If you do not have a parallel pool open and automatic pool creation is enabled, MATLAB opens a pool using the default cluster profile. To use a parallel pool to run computations in MATLAB, you must have Parallel Computing Toolbox™.

Before R2026a: You can specify this argument as true or false only. The default value is false. To run computations in parallel, set this argument to true.

`SquareForm` — Return output as square matrix
`false` or `0` (default) | `true` or `1`

Return the output D as a square matrix, specified as a numeric or logical false (0) or true (1).

When true, the seqpdist function converts the output into a square matrix such that D(I,J) denotes the distance between the Ith and Jth sequences. The square matrix is symmetric and has a zero diagonal. Setting Squareform to true is the same as using the squareform function in Statistics and Machine Learning Toolbox™.

`Alphabet` — Type of sequence
`"AA"` (default) | `"NT"`

Type of sequence, specified as "AA" (amino acid) or "NT" (nucleotide).

Data Types: char | string

`ScoringMatrix` — Scoring matrix
character vector | string scalar | numeric matrix

Scoring matrix to use for the global pairwise alignment, specified as a character vector, string scalar, or numeric matrix.

You can specify a scoring matrix name. Valid choices are:

"BLOSUM50" (default for amino acid sequences)
"NUC44" (default for nucleotide sequences). This choice is not supported for amino acid sequences.
"BLOSUM62"
"BLOSUM30" increasing by 5 up to "BLOSUM90"
"BLOSUM100"
"PAM10" increasing by 10 up to "PAM500"
"DAYHOFF"
"GONNET"

Note

The above scoring matrices, provided with the software, also include a scale factor that converts the units of the output score to bits. You can also specify the Scale name-value argument to specify an additional scale factor to convert the output score from bits to another unit.

You can also specify a numeric matrix, such as the one returned by the blosum, pam, dayhoff, gonnet, or nuc44 function.

Note

If you use a scoring matrix that you created or was created by one of these scoring matrix functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. You can use the Scale name-value argument to specify a scale factor to convert the output score to another unit.
If you need to compile seqpdist into a standalone application or software component using MATLAB Compiler™, use a matrix instead of a character vector or string for ScoringMatrix.

You can specify this argument only when the Method argument is "alignment-score" or the PairwiseAlignment argument is true.

`Scale` — Scale factor applied to output distance
`1` (default) | positive number

Scale factor applied to the output distance, specified as a positive number. By default, there is no scaling or change in the units of the output distance. If the scoring matrix information also provides a scale factor, then both are used.

Use this argument to control the units of the output distance.

You can specify this argument only when the Method argument is "alignment-score" or the PairwiseAlignment argument is true.

`GapOpen` — Penalty for opening gap
`8` (default) | positive integer

Penalty for opening a gap in the alignment, specified as a positive integer.

You can specify this argument only when the Method argument is "alignment-score" or the PairwiseAlignment argument is true.

`ExtendGap` — Penalty for extending gap
positive integer

Penalty for extending a gap in the alignment, specified as a positive integer. The default is equal to GapOpen.

You can specify this argument only when the Method argument is "alignment-score" or the PairwiseAlignment argument is true.

Output Arguments

collapse all

`D` — Biological distance between all pairs of sequences
numeric row vector | numeric matrix

Biological distance between all pairs of sequences stored in the M elements of Seqs, returned as a numeric row vector or a numeric matrix.

By default, D is a row vector of length 1-by-(M*(M-1)/2). The elements are arranged in the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1)). This is the lower-left triangle of the full M-by-M distance matrix. To get the distance between the Ith and the Jth sequences for I > J, use the formula D((J-1)*(M-J/2)+I-J).

When you specify the SquareForm name-value argument as true, D is the full M-by-M distance matrix.

Data Types: double

Extended Capabilities

expand all

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

seqpdist has automatic parallel support.

To run computations in parallel, set the UseParallel argument to "on" or "auto".

Version History

Introduced before R2006a

expand all

R2026a: Enhanced control over parallel execution with `UseParallel` argument

The UseParallel name-value argument now accepts "off", "auto", or "on" instead of true or false. This change gives you more control over when to use a parallel pool for parallel execution.

Specifying the UseParallel argument as true or false is not recommended.

This table shows how to update your code depending on your goal.

Goal	Not recommended	Recommended
Write code that runs on the MATLAB client	`D = seqpdist(Seqs,UseParallel=false)`	`D = seqpdist(Seqs,UseParallel="off")` (default)
Write portable code that runs on a parallel pool and, if a pool is not available runs on the MATLAB client.	`D = seqpdist(Seqs,UseParallel=true)`	`D = seqpdist(Seqs,UseParallel="auto")`
Write code that runs on a parallel pool and errors if a pool is not available.	N/A	`D = seqpdist(Seqs,UseParallel="on")`

There are no plans to remove support for true or false values.

seqpdist

Syntax

Description

Examples

Measure Distance Between Amino Acid Sequences

Input Arguments

Seqs — Nucleotide or amino acid sequences cell array of character vectors | vector of strings | matrix of characters | vector of structures

Name-Value Arguments

Method — Method to calculate pairwise distances "Jukes-Cantor" (default) | character vector | string scalar | function handle

Indels — How to treat sites with gaps "score" (default) | "pairwise-del" | "complete-del"

OptArgs — Optional arguments for distance method numeric scalar | four-element numeric vector

PairwiseAlignment — Global pairwise alignment true or 1 | false or 0

UseParallel — Option to perform computations in parallel "off" (default) | "auto" | "on"

SquareForm — Return output as square matrix false or 0 (default) | true or 1

Alphabet — Type of sequence "AA" (default) | "NT"

ScoringMatrix — Scoring matrix character vector | string scalar | numeric matrix

Scale — Scale factor applied to output distance 1 (default) | positive number

GapOpen — Penalty for opening gap 8 (default) | positive integer

ExtendGap — Penalty for extending gap positive integer

Output Arguments

D — Biological distance between all pairs of sequences numeric row vector | numeric matrix

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

R2026a: Enhanced control over parallel execution with UseParallel argument

See Also

`Seqs` — Nucleotide or amino acid sequences
cell array of character vectors | vector of strings | matrix of characters | vector of structures

`Method` — Method to calculate pairwise distances
`"Jukes-Cantor"` (default) | character vector | string scalar | function handle

`Indels` — How to treat sites with gaps
`"score"` (default) | `"pairwise-del"` | `"complete-del"`

`OptArgs` — Optional arguments for distance method
numeric scalar | four-element numeric vector

`PairwiseAlignment` — Global pairwise alignment
`true` or `1` | `false` or `0`

`UseParallel` — Option to perform computations in parallel
`"off"` (default) | `"auto"` | `"on"`

`SquareForm` — Return output as square matrix
`false` or `0` (default) | `true` or `1`

`Alphabet` — Type of sequence
`"AA"` (default) | `"NT"`

`ScoringMatrix` — Scoring matrix
character vector | string scalar | numeric matrix

`Scale` — Scale factor applied to output distance
`1` (default) | positive number

`GapOpen` — Penalty for opening gap
`8` (default) | positive integer

`ExtendGap` — Penalty for extending gap
positive integer

`D` — Biological distance between all pairs of sequences
numeric row vector | numeric matrix

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

R2026a: Enhanced control over parallel execution with `UseParallel` argument