Run mapreduce
on a Parallel Pool
Start Parallel Pool
If you have Parallel Computing Toolbox™ installed, execution of mapreduce
can open a parallel pool on the cluster specified by your
default profile, for use as the execution environment.
You can set your parallel preferences so that a pool does not automatically open.
In this case, you must explicitly start a pool if you want
mapreduce
to use it for parallelization of its work. To learn
more about parallel preferences, see Specify Your Parallel Preferences.
For example, this conceptual code starts a pool with 12 workers. Then it sets the
execution environment to the pool using mapreducer
, which creates the
MapReducer
object mr
. Finally, it uses
mr
to run mapreduce
on the transformed datastore
tds
.
p = parpool('Processes',12);
mr = mapreducer(p);
outds = mapreduce(tds,@MeanDistMapFun,@MeanDistReduceFun,mr)
Note
mapreduce
can run on any cluster that supports parallel
pools. The examples in this topic use a local cluster, which works for all
Parallel Computing Toolbox installations.
Compare Parallel mapreduce
The following example calculates the mean arrival delay from a datastore of
airline data. First it runs mapreduce
in the MATLAB client
session, then it runs in parallel on a local cluster. The
mapreducer
function explicitly controls the execution
environment.
Begin by starting a parallel pool on a local cluster.
p = parpool('Processes',4);
Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.
Create two MapReducer objects for specifying the different execution environments
for mapreduce
.
inMatlab = mapreducer(0); inPool = mapreducer(p);
Create and preview the datastore. The data set used in this example is available
in
.matlabroot
/toolbox/matlab/demos
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); preview(ds)
ArrDelay ________ 8 8 21 13 4 59 3 11
Next, run the mapreduce
calculation in the MATLAB® client session. The map and reduce functions are available in
.matlabroot
/toolbox/matlab/demos
meanDelay = mapreduce(ds,@meanArrivalDelayMapper,...
@meanArrivalDelayReducer,inMatlab);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 100%
readall(meanDelay)
Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
Then, run the calculation on the current parallel pool. Note that the output text
indicates a parallel mapreduce
.
meanDelay = mapreduce(ds,@meanArrivalDelayMapper,...
@meanArrivalDelayReducer,inPool);
Parallel mapreduce execution on the parallel pool: ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 100% Reduce 50% Map 100% Reduce 100%
readall(meanDelay)
Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
With this relatively small data set, a performance improvement with the parallel
pool is not likely. This example is to show the mechanism for running
mapreduce
on a parallel pool. As the data set grows, or the
map and reduce functions themselves become more computationally intensive, you might
expect to see improved performance with the parallel pool, compared to running
mapreduce
in the MATLAB client session.
Note
When running parallel mapreduce
on a cluster, the order of
the key-value pairs in the output is different compared to running
mapreduce
in MATLAB. If your application depends on the arrangement of data in the
output, you must sort the data according to your own requirements.