Work with Next-Generation Sequencing Data
Overview
Many biological experiments produce huge data files that are difficult to access due to
their size, which can cause memory issues when reading the file into the MATLAB® Workspace. You can construct a BioIndexedFile
object to access the contents of a large text file containing
nonuniform size entries, such as sequences, annotations, and cross-references to data sets.
The BioIndexedFile
object lets you quickly and efficiently access this
data without loading the source file into memory.
You can use the BioIndexedFile
object to access individual entries
or a subset of entries when the source file is too big to fit into memory. You can access
entries using indices or keys. You can read and parse one or more entries using provided
interpreters or a custom interpreter function.
Use the BioIndexedFile
object in conjunction with your large source
file to:
Access a subset of the entries for validation or further analysis.
Parse entries using a custom interpreter function.
What Files Can You Access?
You can use the BioIndexedFile
object to access large text files.
Your source file can have these application-specific formats:
FASTA
FASTQ
SAM
Your source file can also have these general formats:
Table — Tab-delimited table with multiple columns. Keys can be in any column. Rows with the same key are considered separate entries.
Multi-row Table — Tab-delimited table with multiple columns. Keys can be in any column. Contiguous rows with the same key are considered a single entry. Noncontiguous rows with the same key are considered separate entries.
Flat — Flat file with concatenated entries separated by a character vector, typically
//
. Within an entry, the key is separated from the rest of the entry by a white space.
Before You Begin
Before constructing a BioIndexedFile
object, locate your source
file on your hard drive or a local network.
When you construct a BioIndexedFile
object from your source file
for the first time, you also create an auxiliary index file, which by default is saved to
the same location as your source file. However, if your source file is in a read-only
location, you can specify a different location to save the index file.
Tip
If you construct a BioIndexedFile
object from your source file on
subsequent occasions, it takes advantage of the existing index file, which saves time.
However, the index file must be in the same location or a location specified by the
subsequent construction syntax.
Tip
If insufficient memory is not an issue when accessing your source file, you may want
to try an appropriate read function, such as genbankread
, for importing data from GenBank® files.
Additionally, several read functions such as fastaread
, fastqread
,
samread
, and sffread
include a Blockread
property, which lets you read
a subset of entries from a file, thus saving memory.
Create a BioIndexedFile Object to Access Your Source File
To construct a BioIndexedFile
object from a multi-row table
file:
Create a variable containing the full absolute path of your source file. For your source file, use the
yeastgenes.sgd
file, which is included with the Bioinformatics Toolbox™ software.sourcefile = which('yeastgenes.sgd');
Use the
BioIndexedFile
constructor function to construct aBioIndexedFile
object from theyeastgenes.sgd
source file, which is a multi-row table file. Save the index file in the Current Folder. Indicate that the source file keys are in column 3. Also, indicate that the header lines in the source file are prefaced with!
, so the constructor ignores them.gene2goObj = BioIndexedFile('mrtab', sourcefile, '.', ... 'KeyColumn', 3, 'HeaderPrefix','!')
The
BioIndexedFile
constructor function constructsgene2goObj
, aBioIndexedFile
object, and also creates an index file with the same name as the source file, but with an IDX extension. It stores this index file in the Current Folder because we specified this location. However, the default location for the index file is the same location as the source file.Caution
Do not modify the index file. If you modify it, you can get invalid results. Also, the constructor function cannot use a modified index file to construct future objects from the associated source file.
Determine the Number of Entries Indexed by a BioIndexedFile Object
To determine the number of entries indexed by a BioIndexedFile
object, use the NumEntries
property of the
BioIndexedFile
object. For example, for the
gene2goObj
object:
gene2goObj.NumEntries
ans = 6476
Note
For a list and description of all properties of the object, see BioIndexedFile
.
Retrieve Entries from Your Source File
Retrieve entries from your source file using either:
The index of the entry
The entry key
Retrieve Entries Using Indices
Use the getEntryByIndex
method to retrieve a subset of entries
from your source file that correspond to specified indices. For example, retrieve the
first 12 entries from the yeastgenes.sgd
source file:
subset_entries = getEntryByIndex(gene2goObj, [1:12]);
Retrieve Entries Using Keys
Use the getEntryByKey
method to retrieve a subset of entries from
your source file that are associated with specified keys. For example, retrieve all
entries with keys of AAC1 and AAD10 from the yeastgenes.sgd
source
file:
subset_entries = getEntryByKey(gene2goObj, {'AAC1' 'AAD10'});
The output subset_entries
is a character vector of concatenated
entries. Because the keys in the yeastgenes.sgd
source file are not
unique, this method returns all entries that have a key of AAC1 or AAD10.
Read Entries from Your Source File
The BioIndexedFile
object includes a read
method, which you can use to read and parse a subset of entries from your source file. The
read
method parses the entries using an interpreter function
specified by the Interpreter
property of the
BioIndexedFile
object.
Set the Interpreter Property
Before using the read
method, make sure the
Interpreter
property of the BioIndexedFile
object is set appropriately.
If you constructed a BioIndexedFile object from ... | The Interpreter property ... |
---|---|
A source file with an application-specific format (FASTA, FASTQ, or SAM) | By default is a handle to a function appropriate for that file type and typically does not require you to change it. |
A source file with a table, multi-row table, or flat format | By default is [] , which means the interpreter is an
anonymous function in which the output is equivalent to the input. You can change
this to a handle to a function that accepts a character vector of one or more
concatenated entries and returns a structure or an array of structures containing
the interpreted data. |
There are two ways to set the Interpreter
property of the
BioIndexedFile
object:
When constructing the
BioIndexedFile
object, use theInterpreter
property name/property value pairAfter constructing the
BioIndexedFile
object, set theInterpreter
property
Note
For more information on setting the Interpreter
property of the
object, see BioIndexedFile
.
Read a Subset of Entries
The read
method reads and parses a subset of entries that you
specify using either entry indices or keys.
Example
To quickly find all the gene ontology (GO) terms associated with a particular gene because the entry keys are gene names:
Set the
Interpreter
property of thegene2goObj
BioIndexedFile
object to a handle to a function that reads entries and returns only the column containing the GO term. In this case the interpreter is a handle to an anonymous function that accepts character vectors and extracts those that start with the charactersGO
.gene2goObj.Interpreter = @(x) regexp(x,'GO:\d+','match')
Read only the entries that have a key of YAT2, and return their GO terms.
GO_YAT2_entries = read(gene2goObj, 'YAT2')
GO_YAT2_entries = 'GO:0004092' 'GO:0005737' 'GO:0006066' 'GO:0006066' 'GO:0009437'