Statistics of datastore of tabular data

2 次查看(过去 30 天)
Hey all,
I have thousands of parquet files. Each file has more than 50,000 rows of numerical data with more than 100 columns each. My data can't fit in memory so I use datastores to import and handle the data for machine learning workflow downstream. I would like to know if it is possible to calculate some statistics (max, min, mean, std for each channel) of each file during the datastore creation process, which I can use afterwards to filter and select the relevant segments of data for my downstream analysis.
Thanks in advance

采纳的回答

Abhas
Abhas 2024-3-26
Hi Omar,
To calculate statistics (max, min, mean, std for each channel) during the datastore creation process in MATLAB and use them for filtering and selecting relevant data segments for downstream analysis, you can follow these steps:
  1. Create a Datastore: Initialize a 'datastore' for your Parquet files.
  2. Define Custom Function: Create a function to compute the desired statistics for each chunk of data.
  3. Apply Transformation: Use the 'transform' function to apply your custom statistics calculation to the datastore.
  4. Read and Aggregate Statistics: Iterate over the datastore to read the statistics of each chunk and aggregate them globally.
  5. Use Statistics for Filtering: Leverage the aggregated statistics to filter and select relevant data segments.
Here's the MATLAB code to reflect the above steps:
% Step 1: Create Your Datastore
ds = parquetDatastore('path/to/your/parquet/files/*.parquet');
% Step 2: Define Your Custom Function
function statsTable = calculateStats(tbl)
statsTable = varfun(@min, tbl, 'OutputFormat', 'table');
statsTable.Properties.VariableNames = strcat(statsTable.Properties.VariableNames, '_min');
maxTable = varfun(@max, tbl, 'OutputFormat', 'table');
maxTable.Properties.VariableNames = strcat(maxTable.Properties.VariableNames, '_max');
statsTable = [statsTable, maxTable];
meanTable = varfun(@mean, tbl, 'OutputFormat', 'table');
meanTable.Properties.VariableNames = strcat(meanTable.Properties.VariableNames, '_mean');
statsTable = [statsTable, meanTable];
stdTable = varfun(@std, tbl, 'OutputFormat', 'table');
stdTable.Properties.VariableNames = strcat(stdTable.Properties.VariableNames, '_std');
statsTable = [statsTable, stdTable];
end
% Step 3: Apply the Transformation
ds = transform(ds, @calculateStats);
% Step 4: Read and Aggregate the Statistics
globalMin = inf; % Initialize for min. Do similarly for max, mean, std
while hasdata(ds)
statsChunk = read(ds);
chunkMin = min(table2array(statsChunk(:, contains(statsChunk.Properties.VariableNames, '_min'))), [], 'all');
globalMin = min(globalMin, chunkMin);
% Update global max, mean, std similarly
end
% At this point, globalMin (and other statistics) can be used for filtering and selecting relevant data segments
At this point, you have the aggregated statistics (e.g., globalMin) which you can use to filter and select relevant segments of your data for further analysis.
You may refer to the following documentation links to have a better understanding on working with datastore and transform in MATLAB:
  1. parquetDatastore: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.parquetdatastore.html?s_tid=doc_ta
  2. transform: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.transform.html?s_tid=doc_ta
  1 个评论
Omar Kamel
Omar Kamel 2024-3-28
Hi Abhas, Thanks a lot for the elaborate answer. This is what I was exactly looking for.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Tables 的更多信息

产品


版本

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by