Analyze Data Using MDF Datastore and Tall Arrays
This example shows how to work with a big data set using tall arrays and the MDF datastore feature. Tall arrays are commonly used to perform calculations on different types of data that do not fit in memory.
This example first operates on a small subset of data and then scales up to analyze the entire data set. Although the data set used here might not represent the actual size in real-world applications, the same analysis technique can scale up further to work on data sets so large that they cannot be read into memory.
To learn more about tall arrays, see the example Analyze Big Data in MATLAB Using Tall Arrays.
Introduction to Tall Arrays
Tall arrays and tall tables are used to work with out-of-memory data that has any number of rows. Using tall arrays and tables, you can work with large data sets in a manner similar to in-memory MATLAB arrays.
The difference is that tall arrays typically remain unevaluated until the calculations are requested to be performed. This deferred evaluation enables MATLAB to combine the queued calculations where possible and take the minimum number of passes through the data.
Create an MDF Datastore
An MDF datastore can be used to read and process homogeneous data stored in multiple MDF files as a single entity. If the data set is too large to fit in memory, a datastore also makes it possible to work with the data set in smaller blocks that individually fit in memory. This capability can be further extended by tall arrays which enable working with out-of-memory data backed up by a datastore using common functions.
Create an MDF datastore using the mdfDatastore
function by selecting MDF file EngineData_MDF_TallArray.mf4
in the current workflow directory. This file contains time-stamped data logged from a Simulink model representing an engine plant and controller connected to a dynamometer.
mds = mdfDatastore("EngineData_MDF_TallArray.mf4")
mds = MDFDatastore with properties: Datastore Details Files: { ' ...\michellw.Bdoc24a_MDF\vnt-ex08773747\EngineData_MDF_TallArray.mf4' } ChannelGroups: GroupNumber AcquisitionName Comment ... and 10 more columns ___________ _______________ __________ 1 {[<undefined>]} {[Python]} Channels: Name GroupNumber DisplayName ... and 17 more columns _______________ ___________ ___________ "EngineSpeed" 1 "" "EngineTorque" 1 "" "TorqueCommand" 1 "" ... and 1 more rows Options SelectedChannelNames: { 'EngineSpeed'; 'EngineTorque'; 'TorqueCommand' ... and 1 more } SelectedChannelGroupNumber: 1 ReadSize: "file" ReadRaw: 0
It is possible to further configure the MDF datastore to control what and how data is read from the MDF file. By default, the first channel group is selected and all channels from the group are read.
mds.SelectedChannelGroupNumber
ans = 1
mds.SelectedChannelNames
ans = 4×1 string
"EngineSpeed"
"EngineTorque"
"TorqueCommand"
"t"
Configure the MDF datastore to select only three variables of interest: EngineSpeed
, TorqueCommand
, and EngineTorque
.
mds.SelectedChannelNames = ["EngineSpeed", "TorqueCommand", "EngineTorque"]
mds = MDFDatastore with properties: Datastore Details Files: { ' ...\michellw.Bdoc24a_MDF\vnt-ex08773747\EngineData_MDF_TallArray.mf4' } ChannelGroups: GroupNumber AcquisitionName Comment ... and 10 more columns ___________ _______________ __________ 1 {[<undefined>]} {[Python]} Channels: Name GroupNumber DisplayName ... and 17 more columns _______________ ___________ ___________ "EngineSpeed" 1 "" "EngineTorque" 1 "" "TorqueCommand" 1 "" ... and 1 more rows Options SelectedChannelNames: { 'EngineSpeed'; 'TorqueCommand'; 'EngineTorque' } SelectedChannelGroupNumber: 1 ReadSize: "file" ReadRaw: 0
Preview the selected data using the preview
function.
preview(mds)
ans=8×3 timetable
t EngineSpeed TorqueCommand EngineTorque
______________ ___________ _____________ ____________
0 sec 0 0 47.153
0 sec 2.37e-26 0 47.153
1.47e-05 sec 0.11056 47.158 47.158
8.85e-05 sec 0.66312 48.708 48.708
0.00010107 sec 0.75762 49.77 49.77
0.00010107 sec 0.75762 49.77 49.77
0.0001405 sec 1.053 39.967 39.967
0.00017993 sec 1.3482 23.143 23.143
Create Tall Array
Tall arrays are similar to in-memory MATLAB arrays, except that they can have any number of rows. Because the MDF datastore mds
contains time-stamped tabular data, the tall
function returns a tall timetable containing data from the datastore.
tt = tall(mds)
tt = M×3 tall timetable t EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : :
The display includes the first several rows of data. The timetable size may display as M×3
to indicate that the number of rows is not yet known to MATLAB.
Perform Calculations on Tall Array
You can work with tall arrays and tall tables similar to in-memory MATLAB arrays and tables. However, MATLAB does not perform most operations on tall arrays, and defers the actual computations until the output is requested.
It is common to work with unevaluated tall arrays and request output only when required. MATLAB does not know the content or size of an unevaluated tall array until you request that it be evaluated and displayed.
Calculate median, minimum, and maximum values of the TorqueCommand
variable. Note that the results are not immediately evaluated.
medianTorqueCommand = median(tt.TorqueCommand)
medianTorqueCommand = tall double ? Preview deferred. Learn more.
minTorqueCommand = min(tt.TorqueCommand)
minTorqueCommand = tall double ? Preview deferred. Learn more.
maxTorqueCommand = max(tt.TorqueCommand)
maxTorqueCommand = tall double ? Preview deferred. Learn more.
Gather Results into Workspace
The gather
function forces evaluation of all queued operations and brings the resulting output back into memory.
Perform the queued operations, median
, min
, max
, and evaluate the answers. If the calculation requires several passes through the data, MATLAB determines the minimum number of passes to save execution time and displays this information at the command line.
[medianTorqueCommand, minTorqueCommand, maxTorqueCommand] = gather(medianTorqueCommand, minTorqueCommand, maxTorqueCommand)
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 4: Completed in 0.74 sec - Pass 2 of 4: Completed in 0.37 sec - Pass 3 of 4: Completed in 0.61 sec - Pass 4 of 4: Completed in 0.47 sec Evaluation completed in 3.2 sec
medianTorqueCommand = 116.2799
minTorqueCommand = 0
maxTorqueCommand = 232.9807
Select Subset of Tall Array
Use head
to select a subset of 10,000 rows from the data for prototyping code before scaling to the full data set.
ttSubset = head(tt, 10000)
ttSubset = 10,000×3 tall timetable t EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : :
Remove Duplicate Rows in Tall Array
Timetable rows are duplicates if they have the same row times and the same data values. Use the unique
function to remove duplicate rows from the subset tall timetable.
ttSubset = unique(ttSubset)
ttSubset = 9,968×3 tall timetable t EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 0.00037708 sec 2.8228 23.143 -0.021071 : : : : : : : :
Calculate Engine Power
Calculate engine power in kilowatts (kW) with EngineSpeed
and EngineTorque
using the formula . Save the results to a new variable named EnginePower
in the tall timetable.
ttSubset.EnginePower = (pi * ttSubset.EngineSpeed .* ttSubset.EngineTorque) / (30 * 1000)
ttSubset = 9,968×4 tall timetable t EngineSpeed TorqueCommand EngineTorque EnginePower ______________ ___________ _____________ ____________ ___________ 0 sec 0 0 47.153 0 0 sec 2.37e-26 0 47.153 1.1703e-28 1.47e-05 sec 0.11056 47.158 47.158 0.00054599 8.85e-05 sec 0.66312 48.708 48.708 0.0033824 0.00010107 sec 0.75762 49.77 49.77 0.0039487 0.0001405 sec 1.053 39.967 39.967 0.0044072 0.00017993 sec 1.3482 23.143 23.143 0.0032675 0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06 : : : : : : : : : :
The topkrows
function for tall arrays returns the top k
rows in sorted order. Obtain the top 20 rows with maximum EnginePower
values.
maxEnginePower = topkrows(ttSubset, 20, "EnginePower")
maxEnginePower = 20×4 tall timetable t EngineSpeed TorqueCommand EngineTorque EnginePower _________ ___________ _____________ ____________ ___________ 15.17 sec 750 78.052 78.052 6.1302 15.16 sec 750 77.841 77.841 6.1136 15.15 sec 750 77.556 77.556 6.0912 15.14 sec 750 77.326 77.326 6.0732 15.18 sec 750 77.277 77.277 6.0693 15.13 sec 750 77.157 77.157 6.0599 15.12 sec 750 77.082 77.082 6.054 15.11 sec 750 77.067 77.075 6.0534 : : : : : : : : : :
Call the gather
function to execute all queued operations and collect the results into memory.
[ttSubset, maxEnginePower] = gather(ttSubset, maxEnginePower)
ttSubset=9968×4 timetable
t EngineSpeed TorqueCommand EngineTorque EnginePower
______________ ___________ _____________ ____________ ___________
0 sec 0 0 47.153 0
0 sec 2.37e-26 0 47.153 1.1703e-28
1.47e-05 sec 0.11056 47.158 47.158 0.00054599
8.85e-05 sec 0.66312 48.708 48.708 0.0033824
0.00010107 sec 0.75762 49.77 49.77 0.0039487
0.0001405 sec 1.053 39.967 39.967 0.0044072
0.00017993 sec 1.3482 23.143 23.143 0.0032675
0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06
0.00076951 sec 5.7492 15 -0.042938 -2.5851e-05
0.0014014 sec 10.437 15 -0.078013 -8.5265e-05
0.0023449 sec 17.382 15 -0.13009 -0.00023679
0.0036773 sec 27.079 15 -0.20304 -0.00057575
0.0054808 sec 40 15 -0.30067 -0.0012595
0.0072843 sec 52.691 15 -0.39703 -0.0021907
0.01 sec 71.373 15 -0.53973 -0.0040341
0.013562 sec 95.119 15 51.176 0.50976
⋮
maxEnginePower=20×4 timetable
t EngineSpeed TorqueCommand EngineTorque EnginePower
_________ ___________ _____________ ____________ ___________
15.17 sec 750 78.052 78.052 6.1302
15.16 sec 750 77.841 77.841 6.1136
15.15 sec 750 77.556 77.556 6.0912
15.14 sec 750 77.326 77.326 6.0732
15.18 sec 750 77.277 77.277 6.0693
15.13 sec 750 77.157 77.157 6.0599
15.12 sec 750 77.082 77.082 6.054
15.11 sec 750 77.067 77.075 6.0534
15.1 sec 750 77.067 77.067 6.0528
15.09 sec 750 77.059 77.059 6.0522
15.08 sec 750 77.051 77.051 6.0516
15.07 sec 750 77.042 77.042 6.0509
15.06 sec 750 77.034 77.034 6.0502
15.05 sec 750 77.025 77.025 6.0495
15.04 sec 750 77.016 77.016 6.0488
15.03 sec 750 77.006 77.006 6.0481
⋮
Visualize Data in Tall Array
Visualize the EngineTorque
and EnginePower
signals over time in a plot with two y-axes.
figure yyaxis left plot(ttSubset.t, ttSubset.EngineTorque) title("Engine Torque and Engine Power Over Time") xlabel("Time") ylabel("Engine Torque [Nm]") yyaxis right plot(ttSubset.t, ttSubset.EnginePower) ylabel("Engine Power [kW]")
Scale to Entire Data Set
Instead of using the smaller data returned from head
, scale up to apply the same steps on the entire data set by using the complete tall timetable.
tt = tall(mds)
tt = M×3 tall timetable t EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : :
Firstly, remove duplicate rows from the tall timetable.
tt = unique(tt)
tt = M×3 tall timetable t EngineSpeed TorqueCommand EngineTorque _ ___________ _____________ ____________ ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : Preview deferred. Learn more.
Secondly, calculate engine power and obtain the top 20 rows with maximum EnginePower
values.
tt.EnginePower = (pi * tt.EngineSpeed .* tt.EngineTorque) / (30 * 1000)
tt = M×4 tall timetable t EngineSpeed TorqueCommand EngineTorque EnginePower _ ___________ _____________ ____________ ___________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : Preview deferred. Learn more.
maxEnginePower = topkrows(tt, 20, "EnginePower")
maxEnginePower = M×4 tall timetable t EngineSpeed TorqueCommand EngineTorque EnginePower _ ___________ _____________ ____________ ___________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : Preview deferred. Learn more.
[tt, maxEnginePower] = gather(tt, maxEnginePower)
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: Completed in 0.95 sec Evaluation completed in 1.4 sec
tt=359326×4 timetable
t EngineSpeed TorqueCommand EngineTorque EnginePower
______________ ___________ _____________ ____________ ___________
0 sec 0 0 47.153 0
0 sec 2.37e-26 0 47.153 1.1703e-28
1.47e-05 sec 0.11056 47.158 47.158 0.00054599
8.85e-05 sec 0.66312 48.708 48.708 0.0033824
0.00010107 sec 0.75762 49.77 49.77 0.0039487
0.0001405 sec 1.053 39.967 39.967 0.0044072
0.00017993 sec 1.3482 23.143 23.143 0.0032675
0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06
0.00076951 sec 5.7492 15 -0.042938 -2.5851e-05
0.0014014 sec 10.437 15 -0.078013 -8.5265e-05
0.0023449 sec 17.382 15 -0.13009 -0.00023679
0.0036773 sec 27.079 15 -0.20304 -0.00057575
0.0054808 sec 40 15 -0.30067 -0.0012595
0.0072843 sec 52.691 15 -0.39703 -0.0021907
0.01 sec 71.373 15 -0.53973 -0.0040341
0.013562 sec 95.119 15 51.176 0.50976
⋮
maxEnginePower=20×4 timetable
t EngineSpeed TorqueCommand EngineTorque EnginePower
__________ ___________ _____________ ____________ ___________
3819.8 sec 5000 217.53 217.53 113.9
3819.8 sec 5000 217.53 217.53 113.9
3819.8 sec 5000 217.53 217.53 113.9
3819.8 sec 5000 217.53 217.53 113.9
3819.8 sec 5000 217.53 217.53 113.9
3819.9 sec 5000 217.53 217.53 113.9
3819.9 sec 5000 217.53 217.53 113.9
3819.9 sec 5000 217.53 217.53 113.9
3819.9 sec 5000 217.52 217.52 113.89
3819.9 sec 5000 217.52 217.52 113.89
3820 sec 5000 217.52 217.52 113.89
3820.1 sec 5000 217.52 217.52 113.89
3820.2 sec 5000 217.52 217.52 113.89
3820.3 sec 5000 217.52 217.52 113.89
3820.4 sec 5000 217.52 217.52 113.89
3820.5 sec 5000 217.52 217.52 113.89
⋮
Lastly, visualize the EngineTorque
and EnginePower
signals over time in a plot with two y-axes.
figure yyaxis left plot(tt.t, tt.EngineTorque) title("Engine Torque and Engine Power Over Time") xlabel("Time") ylabel("Engine Torque [Nm]") yyaxis right plot(tt.t, tt.EnginePower) ylabel("Engine Power [kW]")
Close MDF File
Close access to the MDF file by clearing the MDF datastore variable from workspace.
clear mds