Tall Skinny QR (TSQR) Matrix Factorization Using MapReduce
This example shows how to compute a tall skinny QR (TSQR) factorization using mapreduce
. It demonstrates how to chain mapreduce
calls to perform multiple iterations of factorizations, and uses the info
argument of the map function to compute numeric keys.
Prepare Data
Create a datastore using the airlinesmall.csv
data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest are ArrDelay
(flight arrival delay), DepDelay
(flight departure delay) and Distance
(total flight distance).
ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.ReadSize = 1000; ds.SelectedVariableNames = {'ArrDelay', 'DepDelay', 'Distance'};
The datastore treats 'NA'
values as missing and replaces the missing values with NaN
values by default. The ReadSize
property lets you specify how to partition the data into blocks. Additionally, the SelectedVariableNames
property allows you to work with only the specified variables of interest, which you can verify using preview
.
preview(ds)
ans=8×3 table
ArrDelay DepDelay Distance
________ ________ ________
8 12 308
8 1 296
21 20 480
13 12 296
4 -1 373
59 63 308
3 -2 447
11 -1 954
Chain MapReduce Calls
The implementation of the multi-iteration TSQR algorithm needs to chain consecutive mapreduce
calls. To demonstrate the general chaining design pattern, this example uses two mapreduce
iterations. The output from the map function calls is passed into a large set of reducers, and then the output of these reducers becomes the input for the next mapreduce
iteration.
First MapReduce Iteration
In the first iteration, the map function, tsqrMapper
, receives one block (the ith) of data, which is a table of size . The mapper computes the matrix of this block of data and stores it as an intermediate result. Then, mapreduce
aggregates the intermediate results by unique key before sending them to the reduce function. Thus, mapreduce
sends all intermediate matrices with the same key to the same reducer.
Since the reducer uses qr
, which is an in-memory MATLAB® function, it's best to first make sure that the matrices fit in memory. This example divides the dataset into eight partitions. The mapreduce
function reads the data in blocks and passes the data along with some meta information to the map function. The info
input argument is the second input to the map function and it contains the read offset and file size information that are necessary to generate the key,
key = ceil(offset/fileSize/numPartitions).
Display the map function file.
function tsqrMapper(data, info, intermKVStore) x = data{:,:}; x(any(isnan(x),2),:) = [];% Remove missing values [~, r] = qr(x,0); % intermKey = randi(4); % random integer key for partitioning intermediate results intermKey = computeKey(info, 8); add(intermKVStore,intermKey, r); function key = computeKey(info, numPartitions) fileSize = info.FileSize; % total size of the underlying data file partitionSize = fileSize/numPartitions; % size in bytes of each partition offset = info.Offset; % offset in bytes of the current read key = ceil(offset/partitionSize); end end
The reduce function receives a list of the intermediate matrices, vertically concatenates them, and computes the matrix of the concatenated matrix.
Display the reduce function file.
function tsqrReducer(intermKey, intermValIter, outKVStore) x = []; while (intermValIter.hasnext) x = [x;intermValIter.getnext]; end % Note that this approach assumes the concatenated intermediate values fit % in memory. Consider increasing the number of reduce tasks (increasing the % number of partitions in the tsqrMapper) and adding more iterations if it % does not fit in memory. [~, r] =qr(x,0); add(outKVStore,intermKey,r); end
Use mapreduce
to apply the map and reduce functions to the datastore, ds
.
outds1 = mapreduce(ds, @tsqrMapper, @tsqrReducer);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 11% Map 100% Reduce 22% Map 100% Reduce 33% Map 100% Reduce 44% Map 100% Reduce 56% Map 100% Reduce 67% Map 100% Reduce 78% Map 100% Reduce 89% Map 100% Reduce 100%
mapreduce
returns an output datastore, outds1
, with files in the current folder.
Second MapReduce Iteration
The second iteration uses the output of the first iteration, outds1
, as its input. This iteration uses an identity mapper, identityMapper
, which simply copies over the data using a single key, 'Identity'
.
Display the identity mapper file.
function identityMapper(data, info, intermKVStore) % This mapper function simply copies the data and add them to the % intermKVStore as intermediate values. x = data.Value{:,:}; add(intermKVStore,'Identity', x); end
The reducer function is the same in both iterations. The use of a single key by the map function means that mapreduce
only calls the reduce function once in the second iteration.
Use mapreduce
to apply the identity mapper and the same reducer to the output from the first mapreduce
call.
outds2 = mapreduce(outds1, @identityMapper, @tsqrReducer);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 11% Reduce 0% Map 22% Reduce 0% Map 33% Reduce 0% Map 44% Reduce 0% Map 55% Reduce 0% Map 66% Reduce 0% Map 77% Reduce 0% Map 88% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%
View Results
Read the final results from the output datastore.
r = readall(outds2); r.Value{:}
ans = 3×3
105 ×
0.1091 0.0893 0.5564
0 -0.0478 -0.4890
0 0 3.0130
References
Constantine, Paul G., and David F. Gleich. “Tall and Skinny QR Factorizations in MapReduce Architectures.” Proceedings of the Second International Workshop on MapReduce and Its Applications, ACM, 2011, pp. 43–50. https://dl.acm.org/doi/10.1145/1996092.1996103
See Also
mapreduce
| tabularTextDatastore