Getting Started with MapReduce
As the number and type of data acquisition devices grows annually, the sheer size and rate of data being collected is rapidly expanding. These big data sets can contain gigabytes or terabytes of data, and can grow on the order of megabytes or gigabytes per day. While the collection of this information presents opportunities for insight, it also presents many challenges. Most algorithms are not designed to process big data sets in a reasonable amount of time or with a reasonable amount of memory. MapReduce allows you to meet many of these challenges to gain important insights from large data sets.
What Is MapReduce?
MapReduce is a programming technique for analyzing data sets that do not fit in memory. You
may be familiar with Hadoop® MapReduce, which is a popular implementation that works with the Hadoop Distributed File System (HDFS™). MATLAB® provides a slightly different implementation of the MapReduce technique with
the mapreduce
function.
mapreduce
uses a datastore to process data in small blocks that
individually fit into memory. Each block goes through a Map phase, which formats the data to
be processed. Then the intermediate data blocks go through a Reduce phase, which aggregates
the intermediate results to produce a final result. The Map and Reduce phases are encoded by
map and reduce functions, which are primary
inputs to mapreduce
. There are endless combinations of map and reduce
functions to process data, so this technique is both flexible and extremely powerful for
tackling large data processing tasks.
mapreduce
lends itself to being extended
to run in several environments. For more information about these capabilities,
see Speed Up and Deploy MapReduce Using Other Products.
The utility of the mapreduce
function lies
in its ability to perform calculations on large collections of data.
Thus, mapreduce
is not well-suited for performing
calculations on normal sized data sets which
can be loaded directly into computer memory and analyzed with traditional
techniques. Instead, use mapreduce
to perform
a statistical or analytical calculation on a data set that does not
fit in memory.
Each call to the map or reduce function by mapreduce
is
independent of all others. For example, a call to the map function
cannot depend on inputs or results from a previous call to the map
function. It is best to break up such calculations into multiple calls
to mapreduce
.
MapReduce Algorithm Phases
mapreduce
moves each block of data in the input datastore through several
phases before reaching the final output. The following figure outlines the phases of the
algorithm for mapreduce
.
The algorithm has the following steps:
mapreduce
reads a block of data from the input datastore using[data,info] = read(ds)
, and then calls the map function to work on that block.The map function receives the block of data, organizes it or performs a precursory calculation, and then uses the
add
andaddmulti
functions to add key-value pairs to an intermediate data storage object called aKeyValueStore
. The number of calls to the map function bymapreduce
is equal to the number of blocks in the input datastore.After the map function works on all of the blocks of data in the datastore,
mapreduce
groups all of the values in the intermediateKeyValueStore
object by unique key.Next,
mapreduce
calls the reduce function once for each unique key added by the map function. Each unique key can have many associated values.mapreduce
passes the values to the reduce function as aValueIterator
object, which is an object used to iterate over the values. TheValueIterator
object for each unique key contains all the associated values for that key.The reduce function uses the
hasnext
andgetnext
functions to iterate through the values in theValueIterator
object one at a time. Then, after aggregating the intermediate results from the map function, the reduce function adds final key-value pairs to the output using theadd
andaddmulti
functions. The order of the keys in the output is the same as the order in which the reduce function adds them to the finalKeyValueStore
object. That is,mapreduce
does not explicitly sort the output.Note
The reduce function writes the final key-value pairs to a final
KeyValueStore
object. From this object,mapreduce
pulls the key-value pairs into the output datastore, which is aKeyValueDatastore
object by default.
Example MapReduce Calculation
This example uses a simple calculation (the
mean travel distance in a set of flight data) to illustrate the steps
needed to run mapreduce
.
Prepare Data
The first step to using mapreduce
is to construct a datastore for the
data set. Along with the map and reduce functions, the datastore for a data set is a
required input to mapreduce
, since it allows
mapreduce
to process the data in blocks.
mapreduce
works with most types of datastores. For example, create a
TabularTextDatastore
object for the
airlinesmall.csv
data set.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA')
ds = TabularTextDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\airlinesmall.csv' } Folders: { ' ...\matlab\toolbox\matlab\demos' } FileEncoding: 'UTF-8' AlternateFileSystemRoots: {} PreserveVariableNames: false ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\r\n' TreatAsMissing: 'NA' MissingValue: NaN Advanced Text Format Properties: TextscanFormats: {'%f', '%f', '%f' ... and 26 more} TextType: 'char' ExponentCharacters: 'eEdD' CommentStyle: '' Whitespace: ' \b\t' MultipleDelimitersAsOne: false Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedFormats: {'%f', '%f', '%f' ... and 26 more} ReadSize: 20000 rows OutputType: 'table' RowTimes: [] Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "txt"
Several of the previously described options are useful in the context of
mapreduce
. The mapreduce
function executes
read
on the datastore to retrieve data to pass to the map function.
Therefore, you can use the SelectedVariableNames
,
SelectedFormats
, and ReadSize
options to directly
configure the block size and type of data that mapreduce
passes to
the map function.
For example, to select the Distance
(total
flight distance) variable as the only variable of interest, specify SelectedVariableNames
.
ds.SelectedVariableNames = 'Distance';
Now, whenever the read
, readall
,
or preview
functions act on ds
,
they will return only information for the Distance
variable.
To confirm this, you can preview the first few rows of data in the
datastore. This allows you to examine the format of the data that
the mapreduce
function will pass to the map function.
preview(ds)
ans = 8×1 table Distance ________ 308 296 480 296 373 308 447 954
To view the exact data that mapreduce
will
pass to the map function, use read
.
For additional information and a complete summary of the available options, see Datastore.
Write Map and Reduce Functions
The mapreduce
function automatically calls
the map and reduce functions during execution, so these functions
must meet certain requirements to run properly.
The inputs to the map function are
data
,info
, andintermKVStore
:data
andinfo
are the result of a call to theread
function on the input datastore, whichmapreduce
executes automatically before each call to the map function.intermKVStore
is the name of the intermediateKeyValueStore
object to which the map function needs to add key-value pairs. Theadd
andaddmulti
functions use this object name to add key-value pairs. If none of the calls to the map function add key-value pairs tointermKVStore
, thenmapreduce
does not call the reduce function and the resulting datastore is empty.
A simple example of a map function is:
function MeanDistMapFun(data, info, intermKVStore) distances = data.Distance(~isnan(data.Distance)); sumLenValue = [sum(distances) length(distances)]; add(intermKVStore, 'sumAndLength', sumLenValue); end
This map function has only three lines, which perform some straightforward roles. The first line filters out all
NaN
values in the block of distance data. The second line creates a two-element vector with the total distance and count for the block, and the third line adds that vector of values tointermKVStore
with the key,'sumAndLength'
. After this map function runs on all of the blocks of data inds
, theintermKVStore
object contains the total distance and count for each block of distance data.Save this function in your current folder as
MeanDistMapFun.m
.The inputs to the reduce function are
intermKey
,intermValIter
, andoutKVStore
:intermKey
is for the active key added by the map function. Each call to the reduce function bymapreduce
specifies a new unique key from the keys in the intermediateKeyValueStore
object.intermValIter
is theValueIterator
associated with the active key,intermKey
. ThisValueIterator
object contains all of the values associated with the active key. Scroll through the values using thehasnext
andgetnext
functions.outKVStore
is the name for the finalKeyValueStore
object to which the reduce function needs to add key-value pairs.mapreduce
takes the output key-value pairs fromoutKVStore
and returns them in the output datastore, which is aKeyValueDatastore
object by default. If none of the calls to the reduce function add key-value pairs tooutKVStore
, thenmapreduce
returns an empty datastore.
A simple example of a reduce function is:
function MeanDistReduceFun(intermKey, intermValIter, outKVStore) sumLen = [0 0]; while hasnext(intermValIter) sumLen = sumLen + getnext(intermValIter); end add(outKVStore, 'Mean', sumLen(1)/sumLen(2)); end
This reduce function loops through each of the distance and count values in
intermValIter
, keeping a running total of the distance and count after each pass. After this loop, the reduce function calculates the overall mean flight distance with a simple division, and then adds a single key tooutKVStore
.Save this function in your current folder as
MeanDistReduceFun.m
.
For information about writing more advanced map and reduce functions, see Write a Map Function and Write a Reduce Function.
Run mapreduce
After you have a datastore, a map function, and a reduce function, you can call
mapreduce
to perform the calculation. To calculate the average
flight distance in the data set, call mapreduce
using
ds
, MeanDistMapFun
, and
MeanDistReduceFun
.
outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 16% Reduce 0% Map 32% Reduce 0% Map 48% Reduce 0% Map 65% Reduce 0% Map 81% Reduce 0% Map 97% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%
By default, the mapreduce
function displays progress information
at the command line and returns a KeyValueDatastore
object that points
to files in the current folder. You can adjust all three of these options using the
Name,Value
pair arguments for 'OutputFolder'
,
'OutputType'
, and 'Display'
. For more information,
see the reference page for mapreduce
.
View Results
Use the readall
function to read the key-value
pairs from the output datastore.
readall(outds)
ans = 1×2 table Key Value ________ ____________ {'Mean'} {[702.1630]}
See Also
tabularTextDatastore
| mapreduce