Manage Sequence Read Data in Objects
Overview
High-throughput sequencing instruments produce large amounts of sequence read data that can be challenging to store and manage. Using objects to contain this data lets you easily access, manipulate, and filter the data.
Bioinformatics Toolbox™ includes two objects for working with sequence read data.
Object | Contains This Information | Construct from One of These |
---|---|---|
BioRead
|
| |
BioMap
|
|
Represent Sequence and Quality Data in a BioRead Object
Prerequisites
A BioRead
object represents a collection of sequence reads. Each
element in the object is associated with a sequence, sequence header, and sequence quality
information.
Construct a BioRead
object in one of two ways:
Indexed — The data remains in the source file. Constructing the object and accessing its contents is memory efficient. However, you cannot modify object properties, other than the
Name
property. This is the default method if you construct aBioRead
object from a FASTQ- or SAM-formatted file.In Memory — The data is read into memory. Constructing the object and accessing its contents is limited by the amount of available memory. However, you can modify object properties. When you construct a
BioRead
object from a FASTQ structure or cell arrays, the data is read into memory. When you construct aBioRead
object from a FASTQ- or SAM-formatted file, use theInMemory
name-value pair argument to read the data into memory.
Construct a BioRead Object from a FASTQ- or SAM-Formatted File
Note
This example constructs a BioRead
object from a FASTQ-formatted
file. Use similar steps to construct a BioRead
object from a
SAM-formatted file.
Use the BioRead
constructor function to construct a
BioRead
object from a FASTQ-formatted file and set the
Name
property:
BRObj1 = BioRead('SRR005164_1_50.fastq', 'Name', 'MyObject')
BRObj1 = BioRead with properties: Quality: [50x1 File indexed property] Sequence: [50x1 File indexed property] Header: [50x1 File indexed property] NSeqs: 50 Name: 'MyObject'
The constructor function construct a BioRead
object and, if an
index file does not already exist, it also creates an index file with the same file name,
but with an .IDX extension. This index file, by default, is stored in the same location as
the source file.
Caution
Your source file and index file must always be in sync.
After constructing a
BioRead
object, do not modify the index file, or you can get invalid results when using the existing object or constructing new objects.If you modify the source file, delete the index file, so the object constructor creates a new index file when constructing new objects.
Note
Because you constructed this BioRead
object from a source file,
you cannot modify the properties (except for Name
) of the
BioRead
object.
Represent Sequence, Quality, and Alignment/Mapping Data in a BioMap Object
Prerequisites
A BioMap
object represents a collection of sequence reads that
map against a single reference sequence. Each element in the object is associated with a
read sequence, sequence header, sequence quality information, and alignment/mapping
information.
When constructing a BioMap
object from a BAM file, the maximum
size of the file is limited by your operating system and available memory.
Construct a BioMap
object in one of two ways:
Indexed — The data remains in the source file. Constructing the object and accessing its contents is memory efficient. However, you cannot modify object properties, other than the
Name
property. This is the default method if you construct aBioMap
object from a SAM- or BAM-formatted file.In Memory — The data is read into memory. Constructing the object and accessing its contents is limited by the amount of available memory. However, you can modify object properties. When you construct a
BioMap
object from a structure, the data stays in memory. When you construct aBioMap
object from a SAM- or BAM-formatted file, use theInMemory
name-value pair argument to read the data into memory.
Construct a BioMap Object from a SAM- or BAM-Formatted File
Note
This example constructs a BioMap
object from a SAM-formatted
file. Use similar steps to construct a BioMap
object from a
BAM-formatted file.
If you do not know the number and names of the reference sequences in your source file, determine them using the
saminfo
orbaminfo
function and theScanDictionary
name-value pair argument.samstruct = saminfo('ex2.sam', 'ScanDictionary', true); samstruct.ScannedDictionary
ans = 'seq1' 'seq2'
Tip
The previous syntax scans the entire SAM file, which is time consuming. If you are confident that the Header information of the SAM file is correct, omit the
ScanDictionary
name-value pair argument, and inspect theSequenceDictionary
field instead.Use the
BioMap
constructor function to construct aBioMap
object from the SAM file and set theName
property. Because the SAM-formatted file in this example,ex2.sam
, contains multiple reference sequences, use theSelectRef
name-value pair argument to specify one reference sequence,seq1
:BMObj2 = BioMap('ex2.sam', 'SelectRef', 'seq1', 'Name', 'MyObject')
BMObj2 = BioMap with properties: SequenceDictionary: 'seq1' Reference: [1501x1 File indexed property] Signature: [1501x1 File indexed property] Start: [1501x1 File indexed property] MappingQuality: [1501x1 File indexed property] Flag: [1501x1 File indexed property] MatePosition: [1501x1 File indexed property] Quality: [1501x1 File indexed property] Sequence: [1501x1 File indexed property] Header: [1501x1 File indexed property] NSeqs: 1501 Name: 'MyObject'
The constructor function constructs a BioMap
object and, if index
files do not already exist, it also creates one or two index files:
If constructing from a SAM-formatted file, it creates one index file that has the same file name as the source file, but with an .IDX extension. This index file, by default, is stored in the same location as the source file.
If constructing from a BAM-formatted file, it creates two index files that have the same file name as the source file, but one with a .BAI extension and one with a .LINEARINDEX extension. These index files, by default, are stored in the same location as the source file.
Caution
Your source file and index files must always be in sync.
After constructing a
BioMap
object, do not modify the index files, or you can get invalid results when using the existing object or constructing new objects.If you modify the source file, delete the index files, so the object constructor creates new index files when constructing new objects.
Note
Because you constructed this BioMap
object from a source file,
you cannot modify the properties (except for Name
and
Reference
) of the BioMap
object.
Construct a BioMap Object from a SAM or BAM Structure
Note
This example constructs a BioMap
object from a SAM structure
using samread
. Use similar steps to construct a
BioMap
object from a BAM structure using
bamread
.
Use the
samread
function to create a SAM structure from a SAM-formatted file:SAMStruct = samread('ex2.sam');
To construct a valid
BioMap
object from a SAM-formatted file, the file must contain only one reference sequence. Determine the number and names of the reference sequences in your SAM-formatted file using theunique
function to find unique names in theReferenceName
field of the structure:unique({SAMStruct.ReferenceName})
ans = 'seq1' 'seq2'
Use the
BioMap
constructor function to construct aBioMap
object from a SAM structure. Because the SAM structure contains multiple reference sequences, use theSelectRef
name-value pair argument to specify one reference sequence,seq1
:BMObj1 = BioMap(SAMStruct, 'SelectRef', 'seq1')
BMObj1 = BioMap with properties: SequenceDictionary: {'seq1'} Reference: {1501x1 cell} Signature: {1501x1 cell} Start: [1501x1 uint32] MappingQuality: [1501x1 uint8] Flag: [1501x1 uint16] MatePosition: [1501x1 uint32] Quality: {1501x1 cell} Sequence: {1501x1 cell} Header: {1501x1 cell} NSeqs: 1501 Name: ''
Retrieve Information from a BioRead or BioMap Object
You can retrieve all or a subset of information from a BioRead
or
BioMap
object.
Retrieve a Property from a BioRead or BioMap Object
You can retrieve a specific property from elements in a BioRead
or BioMap
object.
For example, to retrieve all headers from a BioRead
object, use
the Header
property as follows:
allHeaders = BRObj1.Header;
This syntax returns a cell array containing the headers for all elements in the
BioRead
object.
Similarly, to retrieve all start positions of aligned read sequences from a
BioMap
object, use the Start
property of the
object:
allStarts = BMObj1.Start;
This syntax returns a vector containing the start positions of aligned read sequences
with respect to the position numbers in the reference sequence in a
BioMap
object.
Retrieve Multiple Properties from a BioRead or BioMap Object
You can retrieve multiple properties from a BioRead
or
BioMap
object in a single command using the get
method. For example, to retrieve both start positions and headers
information of a BioMap
object, use the get
method as
follows:
multiProp = get(BMObj1, {'Start', 'Header'});
This syntax returns a cell array containing all start positions and headers
information of a BioMap
object.
Retrieve a Subset of Information from a BioRead or BioMap Object
Use specialized get
methods with a numeric vector, logical
vector, or cell array of headers to retrieve a subset of information from an object. For
example, to retrieve the first 10 elements from a BioRead
object, use
the getSubset
method:
newBRObj = getSubset(BRObj1, [1:10]);
This syntax returns a new BioRead
object containing the first 10
elements in the original BioRead
object.
For example, to retrieve the first 12 positions of sequences with headers SRR005164.1,
SRR005164.7, and SRR005164.16, use the getSubsequence
method:
subSeqs = getSubsequence(BRObj1, ... {'SRR005164.1', 'SRR005164.7', 'SRR005164.16'}, [1:12]')
subSeqs = 'TGGCTTTAAAGC' 'CCCGAAAGCTAG' 'AATTTTGCGGCT'
For example, to retrieve information about the third element in a
BioMap
object, use the getInfo
method:
Info_3 = getInfo(BMObj1, 3);
This syntax returns a tab-delimited character vector containing this information for the third element:
Sequence header
SAM flags for the sequence
Start position of the aligned read sequence with respect to the reference sequence
Mapping quality score for the sequence
Signature (CIGAR-formatted character vector) for the sequence
Sequence
Quality scores for sequence positions
Set Information in a BioRead or BioMap Object
Prerequisites
To modify properties (other than Name
and
Reference
) of a BioRead
or
BioMap
object, the data must be in memory, and not indexed. To
ensure the data is in memory, do one of the following:
Construct the object from a structure as described in Construct a BioMap Object from a SAM or BAM Structure.
Construct the object from a source file using the
InMemory
name-value pair argument.
Provide Custom Headers for Sequences
First, create an object with the data in memory:
BRObj1 = BioRead('SRR005164_1_50.fastq','InMemory',true);
To provide custom headers for sequences of interest (in this case sequences 1 to 5), do the following:
BRObj1.Header(1:5) = {'H1', 'H2', 'H3', 'H4', 'H5'};
Alternatively, you can use the setHeader
method:
BRObj1 = setHeader(BRObj1, {'H1', 'H2', 'H3', 'H4', 'H5'}, [1:5]);
Several other specialized set
methods let you set the properties
of a subset of elements in a BioRead
or BioMap
object.
Determine Coverage of a Reference Sequence
When working with a BioMap
object, you can determine the number of
read sequences that:
Align within a specific region of the reference sequence
Align to each position within a specific region of the reference sequence
For example, you can compute the number, indices, and start positions of the read
sequences that align within the first 25 positions of the reference sequence. To do so, use
the getCounts
, getIndex
, and getStart
methods:
Cov = getCounts(BMObj1, 1, 25)
Cov = 12
Indices = getIndex(BMObj1, 1, 25)
Indices = 1 2 3 4 5 6 7 8 9 10 11 12
startPos = getStart(BMObj1, Indices)
startPos = 1 3 5 6 9 13 13 15 18 22 22 24
The first two syntaxes return the number and indices of the read sequences that align within the specified region of the reference sequence. The last syntax returns a vector containing the start position of each aligned read sequence, corresponding to the position numbers of the reference sequence.
For example, you can also compute the number of the read sequences that align to
each of the first 10 positions of the reference sequence. For this
computation, use the getBaseCoverage
method:
Cov = getBaseCoverage(BMObj1, 1, 10)
Cov = 1 1 2 2 3 4 4 4 5 5
Construct Sequence Alignments to a Reference Sequence
It is useful to construct and view the alignment of the read sequences that align to a
specific region of the reference sequence. It is also helpful to know which read sequences
align to this region in a BioMap
object.
For example, to retrieve the alignment of read sequences to the first 12 positions of
the reference sequence in a BioMap
object, use the getAlignment
method:
[Alignment_1_12, Indices] = getAlignment(BMObj2, 1, 12)
Alignment_1_12 = CACTAGTGGCTC CTAGTGGCTC AGTGGCTC GTGGCTC GCTC Indices = 1 2 3 4 5
Return the headers of the read sequences that align to a specific region of the reference sequence:
alignedHeaders = getHeader(BMObj2, Indices)
alignedHeaders = 'B7_591:4:96:693:509' 'EAS54_65:7:152:368:113' 'EAS51_64:8:5:734:57' 'B7_591:1:289:587:906' 'EAS56_59:8:38:671:758'
Filter Read Sequences Using SAM Flags
SAM- and BAM-formatted files include the status of 11 binary flags for each read
sequence. These flags describe different sequencing and alignment aspects of a read
sequence. For more information on the flags, see the SAM Format
Specification. The filterByFlag
method lets you filter the read
sequences in a BioMap
object by using these flags.
Filter Unmapped Read Sequences
Construct a
BioMap
object from a SAM-formatted file.BMObj2 = BioMap('ex1.sam');
Use the
filterByFlag
method to create a logical vector indicating the read sequences in aBioMap
object that are mapped.LogicalVec_mapped = filterByFlag(BMObj2, 'unmappedQuery', false);
Use this logical vector and the
getSubset
method to create a newBioMap
object containing only the mapped read sequences.filteredBMObj_1 = getSubset(BMObj2, LogicalVec_mapped);
Filter Read Sequences That Are Not Mapped in a Pair
Construct a
BioMap
object from a SAM-formatted file.BMObj2 = BioMap('ex1.sam');
Use the
filterByFlag
method to create a logical vector indicating the read sequences in aBioMap
object that are mapped in a proper pair, that is, both the read sequence and its mate are mapped to the reference sequence.LogicalVec_paired = filterByFlag(BMObj2, 'pairedInMap', true);
Use this logical vector and the
getSubset
method to create a newBioMap
object containing only the read sequences that are mapped in a proper pair.filteredBMObj_2 = getSubset(BMObj2, LogicalVec_paired);