Partition a Datastore in Parallel

Partitioning a datastore in parallel, with a portion of the datastore on each worker in a parallel pool, can provide benefits in many cases:

Perform some action on only one part of the whole datastore, or on several defined parts simultaneously.
Search for specific values in the data store, with all workers acting simultaneously on their own partitions.
Perform a reduction calculation on the workers across all partitions.

Read Data from Datastore in Parallel

Open Live Script

This example shows how to use the partition function to parallelize the reading of data from a datastore. It uses a small datastore of airline data provided in MATLAB®, and finds the mean of the non-NaN values from its 'ArrDelay' column.

Serial Execution

A simple way to calculate the mean is to divide the sum of all the non-NaN values by the number of non-NaN values. The code in the sumAndCountArrivalDelay helper function does this for the datastore first in a non-parallel way.

First, delete any existing parallel pools.

delete(gcp('nocreate'));

Create a datastore from the collection of worksheets in airlinesmall_subset.xlsx and select the ArrDelay variables to import.

Use the function sumAndCountArrivalDelay to calculate the mean without any parallel execution. Use the tic and toc functions to time the execution, here and in the later parallel cases.

ds = spreadsheetDatastore(repmat({'airlinesmall_subset.xlsx'},20,1));
ds.SelectedVariableNames = 'ArrDelay';
reset(ds);
tic
  [total,count] = sumAndCountArrivalDelay(ds)

total = 3098060

count = 394940

sumtime = toc

sumtime = 36.6618

mean = total/count

mean = 7.8444

Parallel Execution

The partition function allows you to partition the datastore into smaller parts, each represented as a datastore itself. These smaller datastores work completely independently of each other, so that you can work with them inside of parallel language features such as parfor loops and spmd blocks.

Use Automatic Partitions

You can use the numpartitions function to specify the number of partitions, which is based on the datastore itself and the parallel pool size. This does not necessarily equal the number of workers in the pool. Set the number of loop iterations to the number of partitions (N).

The following code starts a parallel pool on a local cluster, then partitions the datastore among workers for iterating over the loop. This code calls the helper function parforSumAndCountArrivalDelay, which includes a parfor loop to amass the count and sum totals in parallel loop iterations.

p = parpool('Processes',4);

Starting parallel pool (parpool) using the 'Processes' profile ...
Connected to parallel pool with 4 workers.

reset(ds);
tic
  [total,count] = parforSumAndCountArrivalDelay(ds)

total = 3098060

count = 394940

parfortime = toc

parfortime = 11.6383

mean = total/count

mean = 7.8444

Specify Number of Partitions

Rather than let the software calculate the number of partitions, you can explicitly set this value, so that the data can be appropriately partitioned to fit your algorithm. For example, to parallelize data from within an spmd block, you can specify the number of workers (spmdSize) as the number of partitions to use. The spmdSumAndCountArrivalDelay helper function uses an spmd block to perform a parallel read, and explicitly sets the number of partitions equal to the number of workers.

reset(ds);
tic
[total,count] = spmdSumAndCountArrivalDelay(ds)

total = 3098060

count = 394940

spmdtime = toc

spmdtime = 11.7520

mean = total/count

mean = 7.8444

When you are done with your computation, you can delete the current parallel pool.

delete(p);

Helper Functions

Create a helper function to amass the count and sum in a non-parallel way.

function [total,count] = sumAndCountArrivalDelay(ds)
    total = 0;
    count = 0;
    while hasdata(ds)
        data = read(ds);
        total = total + sum(data.ArrDelay,1,'OmitNaN');
        count = count + sum(~isnan(data.ArrDelay));
    end
end

Create a helper function to amass the count and sum in parallel using parfor.

function [total, count] = parforSumAndCountArrivalDelay(ds)
    N = numpartitions(ds,gcp);
    total = 0;
    count = 0;    
    parfor ii = 1:N
        % Get partition ii of the datastore.
        subds = partition(ds,N,ii);
    
        [localTotal,localCount] = sumAndCountArrivalDelay(subds);
        total = total + localTotal;
        count = count + localCount;
    end
end

Create a helper function to amass the count and sum in parallel using spmd.

function [total,count] = spmdSumAndCountArrivalDelay(ds)
    spmd
        subds = partition(ds,spmdSize,spmdIndex);
        [total,count] = sumAndCountArrivalDelay(subds);    
    end
    total = sum([total{:}]);
    count = sum([count{:}]);
end