seq2regexp

Convert sequence with ambiguous characters to regular expression

Syntax

RegExp = seq2regexp(Seq) RegExp = seq2regexp(Seq, ...'Alphabet', AlphabetValue, ...) RegExp = seq2regexp(Seq, ...'Ambiguous', AmbiguousValue, ...)

Input Arguments

`Seq`	Either of the following: Character vector or string containing codes specifying an amino acid or nucleotide sequence. Structure containing a `Sequence` field that contains an amino acid or nucleotide sequence, such as returned by `fastaread`, `fastqread`, `getembl`, `getgenbank`, `getgenpept`, or `getpdb`.
`AlphabetValue`	Character vector or string specifying the sequence alphabet. Choices are: `'NT'` (default) — Nucleotide `'AA'` — Amino acid
`AmbiguousValue`	Controls whether ambiguous characters are included in `RegExp`, the regular expression return value. Choices are: `true` (default) — Include ambiguous characters in the return value `false` — Return only unambiguous characters

Output Arguments

RegExp

Character vector of codes specifying an amino acid or nucleotide sequence in regular expression format using IUB/IUPAC codes.

Description

RegExp = seq2regexp(Seq) converts ambiguous amino acid or nucleotide symbols in a sequence to a regular expression format using IUB/IUPAC codes.

RegExp = seq2regexp(Seq, ...'PropertyName', PropertyValue, ...) calls seq2regexp with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

RegExp = seq2regexp(Seq, ...'Alphabet', AlphabetValue, ...) specifies the sequence alphabet. AlphabetValue can be either 'NT' for nucleotide sequences or 'AA' for amino acid sequences. Default is 'NT'.

RegExp = seq2regexp(Seq, ...'Ambiguous', AmbiguousValue, ...) controls whether ambiguous characters are included in RegExp, the regular expression return value. Choices are true (default) or false. For example:

If Seq = 'ACGTK', and AmbiguousValue is true , the MATLAB^® software returns ACGT[GTK] with the unambiguous characters G and T and the ambiguous character K.
If Seq = 'ACGTK', and AmbiguousValue is false, the MATLAB software returns ACGT[GT] with only the unambiguous characters.

Nucleotide Conversion

Nucleotide Code	Nucleotide	Conversion
`A`	Adenosine	`A`
`C`	Cytosine	`C`
`G`	Guanine	`G`
`T`	Thymidine	`T`
`U`	Uridine	`U`
`R`	Purine	`[AG]`
`Y`	Pyrimidine	`[TC]`
`K`	Keto	`[GT]`
`M`	Amino	`[AC]`
`S`	Strong interaction (3 H bonds)	`[GC]`
`W`	Weak interaction (2 H bonds)	`[AT]`
`B`	Not `A`	`[CGT]`
`D`	Not `C`	`[AGT]`
`H`	Not `G`	`[ACT]`
`V`	Not `T` or `U`	`[ACG]`
`N`	Any nucleotide	`[ACGT]`
`-`	Gap of indeterminate length	`-`
`?`	Unknown	`?`

Amino Acid Conversion

Amino Acid Code	Amino Acid	Conversion
`B`	Asparagine or Aspartic acid (Aspartate)	`[DN]`
`Z`	Glutamine or Glutamic acid (Glutamate)	`[EQ]`
`X`	Any amino acid	`[A R N D C Q E G H I L K M F P S T W Y V]`

Examples

Convert a nucleotide sequence to a regular expression.

seq2regexp('ACWTMAN')

ans =
AC[ATW]T[ACM]A[ACGTRYKMSWBDHVN]

Convert the same nucleotide sequence, but remove ambiguous characters from the regular expression.
```
seq2regexp('ACWTMAN', 'ambiguous', false)

ans =
AC[AT]T[AC]A[ACGT]
```

Version History

Introduced before R2006a