seqpdist
Calculate pairwise distance between sequences
Syntax
D
= seqpdist(Seqs
)
D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)
D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)
D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)
D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)
D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)
D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)
D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)
Input Arguments
Seqs  Any of the following:

MethodValue  Character vector or string that specifies the method to calculate pairwise distances. Default
is 'JukesCantor' . 
IndelsValue  Character vector or string that specifies how to treat sites with gaps. Default is
'score' . 
OptArgsValue  Character vector or cell array that specifies one or more input arguments required or
accepted by the distance method specified by the
Method property. 
PairwiseAlignmentValue  Controls the global pairwise alignment of input sequences (using
the nwalign function), while
ignoring the multiple alignment of the input sequences (if any). Choices
are true or false . Default is:
Tip If your input sequences are the same length,

UseParallelValue  Controls the calculation of the pairwise distances using parfor loops.
When true , and Parallel Computing Toolbox™ is
installed and a parpool is open, computation occurs
in parallel. If there are no open parpool , but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel Computing Toolbox is installed, but there are no open parpool and
automatic creation is disabled, then computation uses parfor loops
in serial mode. If Parallel Computing Toolbox is not installed,
then computation uses parfor loops in serial mode.
Default is false , which uses forloops in serial
mode. 
SquareFormValue  Controls the conversion of the output into a square matrix.
Choices are 
AlphabetValue  Character vector or string specifying the type of sequence (nucleotide or amino acid).
Choices are 'NT' or
'AA' (default). 
ScoringMatrixValue  Either of the following:
Note If you need to compile

ScaleValue  Positive value that specifies the scale factor used to return the score in arbitrary units. If the scoring matrix information also provides a scale factor, then both are used. 
GapOpenValue  Positive integer that specifies the penalty for opening a gap
in the alignment. Default is 8 . 
ExtendedGapValue  Positive integer that specifies the penalty for extending a
gap. Default is equal to GapOpenValue . 
Output Arguments
D  Vector that contains biological distances between each pair
of sequences stored in the M elements of Seqs . 
Description
returns D
= seqpdist(Seqs
)D
,
a vector containing biological distances between each pair of sequences
stored in the M
sequences of Seqs
,
a cell array of sequences, a vector of structures, or a matrix or
sequences.
is a D
1
by(M*(M1)/2)
row
vector corresponding to the M*(M1)/2
pairs of
sequences in Seqs
. The output
is
arranged in the order D
((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M1))
.
This is the lowerleft triangle of the full M
byM
distance
matrix. To get the distance between the I
th
and the J
th sequences for I >
J
, use the formula D((J1)*(MJ/2)+IJ)
.
calls D
= seqpdist(Seqs
,
...'PropertyName
', PropertyValue
,
...)seqpdist
with optional properties
that use property name/property value pairs. Specify one or more properties
in any order. Enclose each PropertyName
in
single quotation marks. Each PropertyName
is
case insensitive. These property name/property value pairs are as
follows:
specifies
a method to compute distances between each sequence pair. Choices
are shown in the following tables.D
= seqpdist(Seqs
,
...'Method', MethodValue
, ...)
Methods for Nucleotides and Amino Acids
Method  Description 

pdistance  Proportion of sites at which the two sequences are different. p is
close to 1 for poorly related sequences, and p is
close to 0 for similar sequences.d = p 
JukesCantor (default)  Maximum likelihood estimate of the number of substitutions
between two sequences. For nucleotides:
For amino acids:

alignmentscore  Distance (d ) between two sequences (1,
2 ) is computed from the pairwise alignment score between
the two sequences (score12 ), and the pairwise alignment
score between each sequence and itself (score11 , score22 )
as follows:d = (1score12/score11)* (1score12/score22) d = 0 
Methods with No Scoring of Gaps (Nucleotides Only)
Method  Description 

TajimaNei  Maximum likelihood estimate considering the background nucleotide
frequencies. It can be computed from the input sequences or given
by setting OptArgs to [gA gC gG gT] . gA , gC , gG , gT are
scalar values for the nucleotide frequencies. 
Kimura  Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution. 
Tamura  Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the GC content. GC
content can be computed from the input sequences or given by setting OptArgs to
the proportion of GC content (scalar value from 0 to 1 ). 
Hasegawa  Considers separately the transitional nucleotide substitution,
the transversional nucleotide substitution, and the background nucleotide
frequencies. Background frequencies can be computed from the input
sequences or given by setting the OptArgs property
to [gA gC gG gT] . 
NeiTamura  Considers separately the transitional nucleotide substitution
between purines, the transitional nucleotide substitution between
pyrimidines, the transversional nucleotide substitution, and the background
nucleotide frequencies. Background frequencies can be computed from
the input sequences or given by setting the OptArgs property
to [gA gC gG gT] . 
Methods with No Scoring of Gaps (Amino Acids Only)
Method  Description 

Poisson  Assumes that the number of amino acid substitutions at each site has a Poisson distribution. 
Gamma  Assumes that the number of amino acid substitutions at each
site has a Gamma distribution with parameter a .
Set a using the OptArgs property.
Default is 2 . 
You can also specify a userdefined distance function using @
,
for example, @distfun
. The distance function must
have the form:
function D = distfun(S1, S2, OptArgsValue)
The distfun
function takes the following
arguments:
S1
,S2
— Two sequences of the same length (nucleotide or amino acid).OptArgsValue
— Optional problemdependent arguments.
The distfun
function returns a scalar that
represents the distance between S1
and S2
.
specifies
how to treat sites with gaps. Choices are:D
= seqpdist(Seqs
,
...'Indels', IndelsValue
, ...)
score
(default) — Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.pairwisedel
— For every pairwise comparison, it ignores the sites with gaps.completedel
— Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the inputSeqs
.
passes
one or more arguments required or accepted by the distance method
specified by the D
= seqpdist(Seqs
,
...'OptArgs', OptArgsValue
, ...)Method
property. Use a character
vector or cell array to pass one or more input arguments. For example,
provide the nucleotide frequencies for the TajimaNei
distance
method, instead of computing them from the input sequences.
controls the global pairwise alignment of input sequences
(using the D
= seqpdist(Seqs
,
...'PairwiseAlignment', PairwiseAlignmentValue
,
...)nwalign
function),
while ignoring the multiple alignment of the input sequences (if any).
Default is:
true
— When all input sequences do not have the same length.false
— When all input sequences have the same length.
Tip
If your input sequences have the same length, seqpdist
assumes
they are aligned. If they are not aligned, do one of the following:
Align the sequences before passing them to
seqpdist
, for example, using themultialign
function.Set
PairwiseAlignment
totrue
when usingseqpdist
.
specifies
whether to use D
= seqpdist(Seqs
,
...'UseParallel', UseParallelValue
, ...)parfor
loops when calculating the
pairwise distances. When true
, and Parallel Computing Toolbox is
installed and a parpool
is open, computation occurs
in parallel. If there are no open parpool
, but
automatic creation is enabled in the Parallel Preferences, the default
pool will be automatically open and computation occurs in parallel.
If Parallel Computing Toolbox is installed, but there are no open parpool
and
automatic creation is disabled, then computation uses parfor
loops
in serial mode. If Parallel Computing Toolbox is not installed,
then computation uses parfor
loops in serial mode.
Default is false
, which uses forloops in serial
mode.
controls the
conversion of the output into a square matrix such that
D
= seqpdist(Seqs
,
...'SquareForm', SquareFormValue
...)
denotes the distance between the D
(I
,J
)I
th and
J
th sequences. The square matrix is symmetric and has a zero
diagonal. Choices are true
or false
(default). Setting
Squareform
to true
is the same as using the
squareform
function in Statistics and Machine Learning Toolbox™.
specifies
the type of sequence (nucleotide or amino acid). Choices are D
= seqpdist(Seqs
,
...'Alphabet', AlphabetValue
, ...)'NT'
or 'AA'
(default).
The remaining input properties are available when the Method
property
equals 'alignmentscore'
or the PairwiseAlignment
property
equals true
.
specifies the scoring matrix to use for
the global pairwise alignment. Default is:D
= seqpdist(Seqs
,
...'ScoringMatrix', ScoringMatrixValue
, ...)
'NUC44'
— WhenAlphabetValue
equals'NT'
.'BLOSUM50'
— WhenAlphabetValue
equals'AA'
.
specifies
the scale factor used to return the score in arbitrary units. Choices
are any positive value. If the scoring matrix information also provides
a scale factor, then both are used.D
= seqpdist(Seqs
,
...'Scale', ScaleValue
, ...)
specifies
the penalty for opening a gap in the alignment. Choices are any positive
integer. Default is D
= seqpdist(Seqs
,
...'GapOpen', GapOpenValue
, ...)8
.
specifies
the penalty for extending a gap in the alignment. Choices are any
positive integer. Default is equal to D
= seqpdist(Seqs
,
...'ExtendGap', ExtendGapValue
, ...)GapOpenValue
.
Examples
Read amino acid alignment data into a MATLAB structure.
seqs = fastaread('pf00002.fa');
For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix
PAM250
.dist = seqpdist(seqs,'Method','alignmentscore',... 'Indels','pairwisedelete',... 'ScoringMatrix','pam250');
Force the realignment of each sequence pair ignoring the provided multiple alignment.
dist = seqpdist(seqs,'Method','alignmentscore',... 'Indels','pairwisedelete',... 'ScoringMatrix','pam250',... 'PairwiseAlignment',true);
Measure the JukesCantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.
dist = seqpdist(seqs,'Method','jukescantor',... 'Indels','score',... 'Scoringmatrix','pam250',... 'PairwiseAlignment',true);
Extended Capabilities
Version History
Introduced before R2006a