Bioinformatics Pipeline SplitDimension

Bioinformatics Pipeline `SplitDimension`

Some of the blocks in a bioinformatics pipeline operate on their input data arrays as one single input while other blocks can operate on individual elements or slices of the input data array independently. The SplitDimension property of a block input controls how to split the block input data (or input array) across multiple runs of the same block in a pipeline. In other words, SplitDimension allows you to control how to parallelize independent runs of the same block (with a different input for each run).

Specify `SplitDimension` to Select Which Input Array Dimensions to Split

By default, the values of the input array are passed unchanged (that is, there is no dimensional splitting of the input data) to the run method of the block, which means that the block runs once for all of the input data.

You can specify a vector of integers to indicate which dimensions (such the row or column dimension) of the input array to split and pass to the block run method. By splitting the input data, you are specifying how many times you want to run the same block with different inputs.

For example, the bioinfo.pipeline.block.SeqSplit block can apply the same trimming operation on an array of input FASTQ files. To specify that SeqTrim runs on each input file in the array independently, set the SplitDimension property of the block input to a specific dimension (such as 1 for the row dimension or 2 for the column dimension of the array).

Specify "all" to pass all elements of the input array to the run method of the block independently. For instance, if there are n elements, the block runs n times independently.

For an example of how to use SplitDimension, see Split Input SAM Files and Assemble Transcriptomes Using Bioinformatics Pipeline.

Note

If you are running the Bioinformatics Toolbox Software Support Packages (such as Bowtie2, BWA, or Cufflinks) remotely, ensure that these support packages are installed in the remote clusters that you are running the pipeline.

Provide Compatible Array sizes

A block can have different split dimensions for each input (port), but inputs that share split dimensions must have compatible sizes. As with binary operations on MATLAB arrays, two inputs have a compatible size for a dimension if the size of the inputs is the same or one of the dimension sizes is 1. For an input whose size is 1 (or scalar) in a split dimension, the value in that dimension is implicitly expanded to match the same size as the other dimensions. For MATLAB^® arrays, dimension one refers to the number of rows and dimension two refers to the number of columns.

The total number of times the block runs within a pipeline is the product of the sizes of the input value in the split dimensions. For example, consider a block with two input ports X and Y. The following table shows the total number of runs (or processes) for various values of SplitDimension.

X array size	Y array size	X.`SplitDimension`	Y.`SplitDimension`	Total number of runs
1-by-1	2-by-2	[]	[]	1⁢⨉1 = 1. This is the default (no dimensional splitting).
1-by-1	2-by-3	[]	1	2⨉1 = 2
5-by-1	1-by-3	1	2	5⨉3 = 15
2-by-2	3-by-3	2	2	0 because of dimension mismatch
2-by-3	2-by-4	2	`"all"`	0 because of dimension mismatch
3-by-1-by-4	1-by-3	`"all"`	2	3⨉3⨉4 = 36
0-by-1	1-by-1	[]	[]	1⨉1 = 1
0-by-1	1-by-1	1	[]	0 because of size 0 in dimension 1

Empty sizes are allowed only in non-SplitDimension. If no inputs specify a SplitDimension, there will always be exactly one run, regardless of the input array sizes. You can merge the output results from multiple block runs with cell arrays. For details, see UniformOutput.