Analyze Large Data in Database Using MapReduce
This example determines the mean arrival delay of a large set of flight data that is stored in a database. You can access large data sets using a
object with Database Toolbox™. After creating a databaseDatastore
DatabaseDatastore
object, you can write a MapReduce algorithm that defines the chunking and reduction of the data. Alternatively, you can use a tall array to run algorithms on large data sets.
The DatabaseDatastore
object does not support using a parallel pool with Parallel Computing Toolbox™ installed. To analyze data using tall arrays or run MapReduce algorithms, set the global execution environment to be the local MATLAB® session.
This example uses a preconfigured JDBC data source to create the database connection. For more information, see the
function.configureJDBCDataSource
Create DatabaseDatastore
Object
Set the global execution environment to be the local MATLAB® session.
mapreducer(0);
The file airlinesmall.csv
contains a large set of flight data. Load this file into the Microsoft® SQL Server® database table airlinesmall
. This table contains 123,523 records.
Create a database connection to the JDBC data source MSSQLServerJDBCAuth
. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.
datasource = "MSSQLServerJDBCAuth"; username = ""; password = ""; conn = database(datasource,username,password);
Create a DatabaseDatastore
object using the database connection and an SQL query. This SQL query retrieves arrival-delay data from the airlinesmall
table.
sqlquery = 'select ArrDelay from airlinesmall';
dbds = databaseDatastore(conn,sqlquery);
Define Mapper and Reducer Functions
To process large data sets in chunks, you can write your own mapper function. For each chunk in this example, use meanArrivalDelayMapper.m
to:
Read arrival-delay data from the
DatabaseDatastore
object.Determine the number of delays and the total delay in the chunk.
Store both values in
KeyValueDatastore
.
The meanArrivalDelayMapper.m
file contains this code.
function meanArrivalDelayMapper (data, info, intermKVStore) % Mapper function for the MeanMapReduceExample. % Copyright 2014 The MathWorks, Inc. % Data is an n-by-1 table of the ArrDelay. Remove missing value first: data(isnan(data.ArrDelay),:) = []; % Record the partial counts and sums and the reducer will accumulate them. partCountSum = [length(data.ArrDelay), sum(data.ArrDelay)]; add(intermKVStore, 'PartialCountSumDelay',partCountSum);
You also can write your own reducer function. In this example, use meanArrivalDelayReducer.m
to read intermediate values for the number of delays and the total arrival delay. Then, determine the overall mean arrival delay. mapreduce
calls this reducer function only once because the mapper function adds just one key to KeyValueStore
. The meanArrivalDelayReducer.m
file contains this code.
function meanArrivalDelayReducer(intermKey, intermValIter, outKVStore) % Reducer function for the MeanMapReduceExample. % Copyright 2014 The MathWorks, Inc. % intermKey is 'PartialCountSumDelay' count = 0; sum = 0; while hasnext(intermValIter) countSum = getnext(intermValIter); count = count + countSum(1); sum = sum + countSum(2); end meanDelay = sum/count; % The key-value pair added to outKVStore will become the output of mapreduce add(outKVStore,'MeanArrivalDelay',meanDelay);
Run MapReduce Using Mapper and Reducer Functions
To determine the mean arrival delay in the flight data, run MapReduce with the DatabaseDatastore
object, mapper function, and reducer function.
outds = mapreduce(dbds,@meanArrivalDelayMapper,@meanArrivalDelayReducer);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 15% Reduce 0% Map 30% Reduce 0% Map 46% Reduce 0% Map 61% Reduce 0% Map 76% Reduce 0% Map 92% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%
Display Output from MapReduce
Read the table from the output datastore using readall
.
outtab = readall(outds)
outtab = 1×2 table Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
The table has only one row containing one key-value pair.
Display the mean arrival delay from the table.
meanArrDelay = outtab.Value{1}
meanArrDelay = 7.1201
Close DatabaseDatastore
Object and Database Connection
close(dbds)
See Also
setdbprefs
| database
| databaseDatastore
| readall
| close
| mapreduce
| TabularTextDatastore
Related Topics
- Analyze Large Data in Database Using Tall Arrays
- Compute Mean Value with MapReduce
- Import Large Data Using DatabaseDatastore Object
- Getting Started with MapReduce