Partition a Datastore in Parallel
Partitioning a datastore in parallel, with a portion of the datastore on each worker in a parallel pool, can provide benefits in many cases:
Perform some action on only one part of the whole datastore, or on several defined parts simultaneously.
Search for specific values in the data store, with all workers acting simultaneously on their own partitions.
Perform a reduction calculation on the workers across all partitions.
Read Data from Datastore in Parallel
This example shows how to use the partition
function to parallelize the reading of data from a datastore. It uses a small datastore of airline data provided in MATLAB®, and finds the mean of the non-NaN values from its 'ArrDelay'
column.
Serial Execution
A simple way to calculate the mean is to divide the sum of all the non-NaN values by the number of non-NaN values. The code in the sumAndCountArrivalDelay
helper function does this for the datastore first in a non-parallel way.
First, delete any existing parallel pools.
delete(gcp('nocreate'));
Create a datastore from the collection of worksheets in airlinesmall_subset.xlsx
and select the ArrDelay
variables to import.
Use the function sumAndCountArrivalDelay
to calculate the mean without any parallel execution. Use the tic
and toc
functions to time the execution, here and in the later parallel cases.
ds = spreadsheetDatastore(repmat({'airlinesmall_subset.xlsx'},20,1)); ds.SelectedVariableNames = 'ArrDelay'; reset(ds); tic [total,count] = sumAndCountArrivalDelay(ds)
total = 3098060
count = 394940
sumtime = toc
sumtime = 36.6618
mean = total/count
mean = 7.8444
Parallel Execution
The partition
function allows you to partition the datastore into smaller parts, each represented as a datastore itself. These smaller datastores work completely independently of each other, so that you can work with them inside of parallel language features such as parfor
loops and spmd
blocks.
Use Automatic Partitions
You can use the numpartitions
function to specify the number of partitions, which is based on the datastore itself and the parallel pool size. This does not necessarily equal the number of workers in the pool. Set the number of loop iterations to the number of partitions (N
).
The following code starts a parallel pool on a local cluster, then partitions the datastore among workers for iterating over the loop. This code calls the helper function parforSumAndCountArrivalDelay
, which includes a parfor
loop to amass the count and sum totals in parallel loop iterations.
p = parpool('Processes',4);
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to parallel pool with 4 workers.
reset(ds); tic [total,count] = parforSumAndCountArrivalDelay(ds)
total = 3098060
count = 394940
parfortime = toc
parfortime = 11.6383
mean = total/count
mean = 7.8444
Specify Number of Partitions
Rather than let the software calculate the number of partitions, you can explicitly set this value, so that the data can be appropriately partitioned to fit your algorithm. For example, to parallelize data from within an spmd
block, you can specify the number of workers (spmdSize
) as the number of partitions to use. The spmdSumAndCountArrivalDelay
helper function uses an spmd
block to perform a parallel read, and explicitly sets the number of partitions equal to the number of workers.
reset(ds); tic [total,count] = spmdSumAndCountArrivalDelay(ds)
total = 3098060
count = 394940
spmdtime = toc
spmdtime = 11.7520
mean = total/count
mean = 7.8444
When you are done with your computation, you can delete the current parallel pool.
delete(p);
Helper Functions
Create a helper function to amass the count and sum in a non-parallel way.
function [total,count] = sumAndCountArrivalDelay(ds) total = 0; count = 0; while hasdata(ds) data = read(ds); total = total + sum(data.ArrDelay,1,'OmitNaN'); count = count + sum(~isnan(data.ArrDelay)); end end
Create a helper function to amass the count and sum in parallel using parfor
.
function [total, count] = parforSumAndCountArrivalDelay(ds) N = numpartitions(ds,gcp); total = 0; count = 0; parfor ii = 1:N % Get partition ii of the datastore. subds = partition(ds,N,ii); [localTotal,localCount] = sumAndCountArrivalDelay(subds); total = total + localTotal; count = count + localCount; end end
Create a helper function to amass the count and sum in parallel using spmd
.
function [total,count] = spmdSumAndCountArrivalDelay(ds) spmd subds = partition(ds,spmdSize,spmdIndex); [total,count] = sumAndCountArrivalDelay(subds); end total = sum([total{:}]); count = sum([count{:}]); end
See Also
datastore
| spreadsheetDatastore