## Reproducibility in Parallel Statistical Computations

### Issues and Considerations in Reproducing Parallel Computations

A *reproducible* computation is one that gives the same
results every time it runs. Reproducibility is important for:

Debugging — To correct an anomalous result, you need to reproduce the result.

Confidence — When you can reproduce results, you can investigate and understand them.

Modifying existing code — When you change existing code, you want to ensure that you do not break anything.

Generally, you do not need to ensure reproducibility for your computation. Often,
when you want reproducibility, the simplest technique is to run in serial instead of
in parallel. In serial computation you can simply call the `rng`

function as follows:

s = rng % Obtain the current state of the random stream % run the statistical function rng(s) % Reset the stream to the previous state % run the statistical function again, obtain identical results

This section addresses the case when your function uses random numbers, and you want reproducible results in parallel. This section also addresses the case when you want the same results in parallel as in serial.

### Running Reproducible Parallel Computations

To run a Statistics and Machine Learning Toolbox™ function reproducibly:

Set the

`UseSubstreams`

option to`true`

using`statset`

.Set the

`Streams`

option to a type that supports substreams:`'mlfg6331_64'`

or`'mrg32k3a'`

. For information on these streams, see`RandStream.list`

.To compute in parallel, set the

`UseParallel`

option to`true`

.To fit an ensemble in parallel using

`fitcensemble`

or`fitrensemble`

, create a tree template with the`'Reproducible'`

name-value pair set to`true`

:t = templateTree('Reproducible',true); ens = fitcensemble(X,Y,'Method','bag','Learners',t,... 'Options',options);

Call the function with the options structure.

To reproduce the computation, reset the stream, then call the function again.

To understand why this technique gives reproducibility, see How Substreams Enable Reproducible Parallel Computations.

For example, to use the `'mlfg6331_64'`

stream for reproducible
computation:

Create an appropriate options structure:

s = RandStream('mlfg6331_64'); options = statset('UseParallel',true, ... 'Streams',s,'UseSubstreams',true);

Run your parallel computation. For instructions, see Quick Start Parallel Computing for Statistics and Machine Learning Toolbox.

Reset the random stream:

reset(s);

Rerun your parallel computation. You obtain identical results.

For examples of parallel computation run this reproducible way, see Reproducible Parallel Bootstrap and Train Classification Ensemble in Parallel.

### Parallel Statistical Computation Using Random Numbers

#### What Are Substreams?

A *substream* is a portion of a random stream that
`RandStream`

can access quickly. There is a
number `M`

such that for any positive integer
`k`

, `RandStream`

can go to
the `kM`

th pseudorandom number in the stream. From that point,
`RandStream`

can generate the subsequent
entries in the stream. Currently, `RandStream`

has `M`

= 2^{72}, about 5e21,
or more.

The entries in different substreams have good statistical properties, similar
to the properties of entries in a single stream: independence, and lack of
*k*-way correlation at various lags. The substreams are so
long that you can view the substreams as being independent streams, as in the
following picture.

Two `RandStream`

stream types support
substreams: `'mlfg6331_64'`

and `'mrg32k3a'`

.

#### How Substreams Enable Reproducible Parallel Computations

When MATLAB^{®} performs computations in parallel with
`parfor`

, each worker receives loop iterations in an
unpredictable order. Therefore, you cannot predict which worker gets which
iteration, so cannot determine the random numbers associated with each
iteration.

Substreams allow MATLAB to tie each iteration to a particular sequence of random numbers.
`parfor`

gives each iteration an index. The iteration
uses the index as the substream number. Since the random numbers are associated
with the iterations, not with the workers, the entire computation is
reproducible.

To obtain reproducible results, simply reset the stream, and all the substreams generate identical random numbers when called again. This method succeeds when all the workers use the same stream, and the stream supports substreams. This concludes the discussion of how the procedure in Running Reproducible Parallel Computations gives reproducible parallel results.

#### Random Numbers on the Client or Workers

A few functions generate random numbers on the client before distributing them to parallel workers. The workers do not use random numbers, so operate purely deterministically. For these functions, you can run a parallel computation reproducibly using any random stream type.

The functions that operate this way include:

To obtain identical results, reset the random stream on the client, or the random stream you pass to the client. For example:

s = rng % Obtain the current state of the random stream % run the statistical function rng(s) % Reset the stream to the previous state % run the statistical function again, obtain identical results

While this method enables you to run reproducibly in parallel, the results can
differ from a serial computation. The reason for the difference is
`parfor`

loops run in reverse order from
`for`

loops. Therefore, a serial computation can generate
random numbers in a different order than a parallel computation. For unequivocal
reproducibility, use the technique in Running Reproducible Parallel Computations.

#### Distributing Streams Explicitly

For testing or comparison using particular random number algorithms, you must set the random number generators. How do you set these generators in parallel, or initialize streams on each worker in a particular way? Or you might want to run a computation using a different sequence of random numbers than any other you have run. How can you ensure the sequence you use is statistically independent?

Parallel Statistics and Machine Learning Toolbox functions allow you to set random streams on each worker
explicitly. For information on *creating* multiple streams,
enter `help RandStream/create`

at the command line. To
create four independent streams using the `'mrg32k3a'`

generator:

s = RandStream.create('mrg32k3a','NumStreams',4,... 'CellOutput',true);

Pass these streams to a statistical function using the
`Streams`

option. For example:

parpool(4) % if you have at least 4 cores s = RandStream.create('mrg32k3a','NumStreams',4,... 'CellOutput',true); % create 4 independent streams paroptions = statset('UseParallel',true,... 'Streams',s); % set the 4 different streams x = [randn(700,1); 4 + 2*randn(300,1)]; latt = -4:0.01:12; myfun = @(X) ksdensity(X,latt); pdfestimate = myfun(x); B = bootstrp(200,myfun,x,'Options',paroptions);

This method of distributing streams gives each worker a different stream for the computation. However, it does not allow for a reproducible computation, because the workers perform the 200 bootstraps in an unpredictable order. If you want to perform a reproducible computation, use substreams as described in Running Reproducible Parallel Computations.

If you set the `UseSubstreams`

option to
`true`

, then set the `Streams`

option to a
single random stream of the type that supports substreams
(`'mlfg6331_64'`

or `'mrg32k3a'`

). This
setting gives reproducible computations.