parquetread

Read columnar data from a Parquet file

Syntax

T = parquetread(filename)

T = parquetread(filename,Name,Value)

Description

T = parquetread(filename) reads the Parquet file specified in filename into a table or timetable T.

T = parquetread(filename,Name,Value) reads the Parquet file into a table or timetable with additional options specified by one or more name-value pair arguments.

example

Examples

collapse all

Read Parquet File into Table

Open Live Script

Get information about a Parquet file, read the data from the file into a table, and then read a subset of the variables into a table.

Create a ParquetInfo object for the file outages.parquet.

info = parquetinfo('outages.parquet')

info = 
  ParquetInfo with properties:

               Filename: "/mathworks/devel/bat/filer/batfs2566-0/Bdoc24b.2855429/build/runnable/matlab/toolbox/matlab/demos/outages.parquet"
               FileSize: 44202
           NumRowGroups: 1
        RowGroupHeights: 1468
          VariableNames: ["Region"    "OutageTime"    "Loss"    "Customers"    "RestorationTime"    "Cause"]
          VariableTypes: ["string"    "datetime"    "double"    "double"    "datetime"    "string"]
    VariableCompression: ["snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"]
       VariableEncoding: ["plain"    "plain"    "plain"    "plain"    "plain"    "plain"]
                Version: "2.0"

Read data from the file into a table and display the first 10 rows.

T = parquetread('outages.parquet');
T(1:10,:)

ans=10×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Select and import the variables Region, OutageTime, and Cause into a table and display the first 10 rows.

SelVarNames = {'Region','OutageTime','Cause'};
T_subset = parquetread('outages.parquet','SelectedVariableNames',SelVarNames);
T_subset(1:10,:)

ans=10×3 table
      Region            OutageTime               Cause      
    ___________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00    "attack"         
    "West"         20-Jun-2004 14:39:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    "equipment fault"

Read Parquet File into Timetable

Open Live Script

Read the data from the file into a timetable, and then use timetable functions to determine if the timetable is regular and sorted.

Read data from outages.parquet into a timetable and display the first 10 rows. Use the second variable OutageTime in the data as the time vector for the timetable.

TT = parquetread('outages.parquet','RowTimes','OutageTime');
TT(1:10,:)

ans=10×5 timetable
         OutageTime           Region        Loss     Customers       RestorationTime             Cause      
    ____________________    ___________    ______    __________    ____________________    _________________

    01-Feb-2002 12:18:00    "SouthWest"    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    23-Jan-2003 00:49:00    "SouthEast"    530.14    2.1204e+05                     NaT    "winter storm"   
    07-Feb-2003 21:15:00    "SouthEast"     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    06-Apr-2004 05:44:00    "West"         434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    16-Mar-2002 06:18:00    "MidWest"      186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    18-Jun-2003 02:49:00    "West"              0             0    18-Jun-2003 10:54:00    "attack"         
    20-Jun-2004 14:39:00    "West"         231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    06-Jun-2002 19:28:00    "West"         311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    16-Jul-2003 16:23:00    "NorthEast"    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    27-Sep-2004 11:09:00    "MidWest"      286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Determine if the timetable is regular and sorted. A regular timetable has the same time interval between consecutive row times and a sorted timetable has a row time vector is in ascending order.

isregular(TT)

ans = logical
   0

issorted(TT)

ans = logical
   0

Sort the timetable on its row times using the sortrows function and display the first 10 rows of the sorted data.

TT = sortrows(TT);
TT(1:10,:)

ans=10×5 timetable
         OutageTime           Region        Loss     Customers       RestorationTime             Cause       
    ____________________    ___________    ______    __________    ____________________    __________________

    01-Feb-2002 12:18:00    "SouthWest"    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"    
    05-Mar-2002 17:53:00    "MidWest"      96.563    2.8666e+05    10-Mar-2002 14:41:00    "wind"            
    16-Mar-2002 06:18:00    "MidWest"      186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"    
    26-Mar-2002 01:59:00    "MidWest"      388.04    5.6422e+05    28-Mar-2002 19:55:00    "winter storm"    
    20-Apr-2002 16:46:00    "MidWest"       23141           NaN                     NaT    "unknown"         
    08-May-2002 20:34:00    "SouthWest"    50.732         34481    08-May-2002 22:21:00    "thunder storm"   
    18-May-2002 11:04:00    "MidWest"      1389.1    1.3447e+05    21-May-2002 01:22:00    "unknown"         
    20-May-2002 10:57:00    "NorthEast"    9116.6    2.4983e+06    21-May-2002 15:22:00    "unknown"         
    27-May-2002 09:44:00    "SouthEast"    237.28    1.7101e+05    27-May-2002 16:19:00    "wind"            
    02-Jun-2002 16:11:00    "SouthEast"         0             0    05-Jun-2002 05:55:00    "energy emergency"

Conditionally Import Subset of Data Using Row Filter

Open Live Script

Import a subset of data by specifying variables and rows to import by using a row filter.

To import a subset of the outages.parquet file, create a filter to import only the OutageTime, Region, and Cause variables. Then, refine the filter to import only rows with values that meet certain conditions.

rf = rowfilter(["OutageTime" "Region" "Cause"]);
rf2 = (rf.OutageTime > datetime("2013-02-01")) & (rf.Region == "NorthEast") & (rf.Cause == "winter storm");
d = parquetread("outages.parquet",RowFilter=rf2,SelectedVariableNames=["OutageTime" "Region" "Cause"])

d=6×3 table
         OutageTime           Region           Cause     
    ____________________    ___________    ______________

    09-Feb-2013 00:55:00    "NorthEast"    "winter storm"
    13-Feb-2013 01:44:00    "NorthEast"    "winter storm"
    25-Dec-2013 11:24:00    "NorthEast"    "winter storm"
    30-Dec-2013 11:40:00    "NorthEast"    "winter storm"
    22-Feb-2013 02:17:00    "NorthEast"    "winter storm"
    23-Feb-2013 01:53:00    "NorthEast"    "winter storm"

The resulting subset of filtered data contains only the 6 rows that meet the filter conditions and the 3 specified variables.

Input Arguments

collapse all

`filename` — Name of Parquet file
character vector | string scalar

Name of Parquet file, specified as a character vector or string scalar. parquetread works with Parquet 1.0 or Parquet 2.0 files.

Depending on the location of the file, filename can take on one of these forms.

Location

Form

Current folder or folder on the MATLAB^® path

Specify the name of the file in filename.

Example: 'data.parquet'

File in a folder

If the file is not in the current folder or in a folder on the MATLAB path, then specify the full or relative path name.

Example: 'C:\myFolder\data.parquet'

Example: 'myDir\myFile.ext'

Internet URL

If the file is specified as an internet uniform resource locator (URL), then filename must contain the protocol type 'http://' or 'https://' and end with '?raw=true'.

Example: 'http://hostname/path_to_file/my_data.parquet?raw=true'

Remote Location

If the file is stored at a remote location, then filename must contain the full path of the file specified with the form:

scheme_name://path_to_file/my_file.ext

Based on the remote location, scheme_name can be one of the values in this table.

Remote Location	`scheme_name`
Amazon S3™	`s3`
Azure^® Blob Storage	`wasb`, `wasbs`
HDFS™	`hdfs`

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/data.parquet'

The parquetread function can import structured data from Parquet files. For more information on Parquet data types supported for reading, see Apache Parquet Data Type Mappings.

Data Types: char | string

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'OutputType','table' imports the data in the Parquet file as a table.

`OutputType` — Output datatype
`'auto'` (default) | `'table'` | `'timetable'`

Output datatype, specified as the comma-separated pair consisting of 'OutputType' and 'auto', 'table', or 'timetable'.

'auto' — Return a table or a timetable. The parquetread detects if the output should be a table or a timetable based on other name-value pairs that you specify. For example, when you set timetable related name-value pairs, then parquetread infers that the output is a timetable. Setting these name-value pairs indicates that the output is a timetable: RowTimes, StartTime, SampleRate, or TimeStep.
'table' — Return a table. For more information on the table datatype, see table.
'timetable' — Return a timetable. For more information on the timetable datatype, see timetable.

Example: 'OutputType','timetable'

Data Types: char | string

`SelectedVariableNames` — Subset of variables to import
character vector | string scalar | cell array of character vectors | string array

Subset of variables to import, specified as the comma-separated pair consisting of 'SelectedVariableNames' and a character vector, string scalar, cell array of character vectors, or a string array.

SelectedVariableNames must be a subset of variable names contained in the Parquet file. To get the names of all the variables in the file, use the VariableNames property of the ParquetInfo object.
If you do not specify the SelectedVariableNames name-value pair, parquetread reads all the variables from the file.

Data Types: char | string | cell

`RowTimes` — Row times variable
variable name | time vector

Row times variable, specified as the comma-separated pair consisting of 'RowTimes' and a variable name or a time vector.

Variable name must be a character vector or string scalar containing the name of any variable in the input table that contains datetime or duration values. The variable specified by the variable name provides row time labels for the rows. The remaining variables of the input table become the variables of the timetable.
Time vector must be a datetime vector or a duration vector. The number of elements of time vector must equal the number of rows of the input table. The time values in the time vector do not need to be unique, sorted, or regular. All the variables of the input table become variables of the timetable.

Data Types: char | string | datetime | duration

`StartTime` — Start time of row times
datetime scalar | duration scalar

Start time of the row times, specified as the comma-separated pair consisting of StartTime and a datetime scalar or duration scalar.

If the start time is a datetime, then the row times of T are datetime values.
If the start time is a duration, then the row times of T are duration values.
If the time step is a calendar duration, then the start time must be a datetime value.

StartTime is a timetable related parameter. The parquetread function uses StartTime along with SampleRate or TimeStep to define the time vector for the output T.

Data Types: datetime | duration

`SampleRate` — Sample rate
numeric scalar

Sample rate, specified as the comma-separated pair consisting of 'SampleRate' and a numeric scalar. The sample rate is the number of samples per second (Hz) of the time vector of the output timetable T.

SampleRate is a timetable related parameter. The parquetread function uses SampleRate along with other timetable parameters to define the time vector for the output T.

Data Types: double

`RowGroups` — Indices of row groups to import
positive numeric scalar | vector of positive integers

Indices of row groups to import, specified as a positive numeric scalar or vector of positive integers, referring to indices of row groups to read.

If you specify a scalar, then the function reads a single row group.
If you specify a vector, then the function reads all the specified row groups.
If you do not specify row groups, then parquetread imports the entire file.

Example: RowGroups=701:720

`RowFilter` — Filter to select rows to import
`matlab.io.RowFilter` object

Filter to select rows to import, specified as a matlab.io.RowFilter object. The matlab.io.RowFilter object designates conditions each row must satisfy to be included in your output table or timetable. If you do not specify RowFilter, then parquetread imports all rows from the input Parquet file.

`TimeStep` — Time step of time vector
duration | calendarDuration

Time step of time vector, specified as the comma-separated pair consisting of 'TimeStep' and a duration scalar.

If you specify the time step as a calendar duration (for example, calendar months), then the vector of row times must be a datetime vector.
If you specify the time step as a duration (for example, seconds), then the vector of row times can either be a datetime or duration vector.

TimeStep is a timetable related parameter. The parquetread function uses TimeStep along with other timetable parameters to define the time vector for the output T.

`VariableNamingRule` — Flag to preserve variable names
`"modify"` (default) | `"preserve"`

Flag to preserve variable names, specified as either "modify" or "preserve".

"modify" — Convert invalid variable names (as determined by the isvarname function) to valid MATLAB identifiers.
"preserve" — Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

Output Arguments

collapse all

`T` — Output data
table | timetable

Output data, returned as a table or timetable. The output of the parquetread function depends on the value of the OutputType name-value pair. For more information, see the name-value pair description for OutputType.

Limitations

In some cases, parquetwrite creates files that do not represent the original array T exactly. If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Extended Capabilities

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

Version History

Introduced in R2019a

expand all

R2022b: Read Parquet files containing structured data

Read structured data from Parquet files as nested tables.

R2022b: Use function in thread-based environments

This function supports thread-based environments.

R2022a: Read Parquet file data more efficiently using `rowfilter` to conditionally filter rows

Conditionally filter and read data faster (Predicate Pushdown) from Parquet files when using parquetread and parquetDatastore. You can create conditions for filtering by using the rowfilter function, matlab.io.RowFilter object, and RowFilter name-value argument.

R2022a: Determine and define row groups in Parquet file data

A Parquet file can store a range of rows as a distinct row group for increased granularity and targeted analysis. parquetread uses the RowGroups name-value argument to determine row groups while reading Parquet file data. parquetwrite uses the RowGroupHeights name-value argument to define row groups while writing Parquet file data.

R2022a: Import nested Parquet file data

You can now import nested Parquet file data with:

LogicalType as LIST.
LogicalType as NONE and PhysicalType as either BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

The parquetread function converts and imports these data architectures as cell arrays.

R2021b: Read and write datetimes with original time zones

Parquet files require time-zone-aware timestamps to be in the UTC time zone. When writing datetimes, parquetwrite converts them to equivalent UTC values and stores the original time zone values in the metadata of the Parquet file. parquetread uses the stored original time zone values to enable roundtripping.

R2021a: Read online data

Read files from an internet URL by specifying filename as a string scalar or character vector that contains the protocol type 'http://' or 'https://'.

R2021a: Use categorical data in Parquet data format

Import Parquet data that contains the categorical data type.

R2019b: Read tabular data containing any characters

Import tabular data that has variable names containing any Unicode characters, including spaces and non-ASCII characters. To read tabular data that contains arbitrary variable names, such as variable names with spaces and non-ASCII characters, set the PreserveVariableNames parameter to true.

parquetread

Syntax

Description

Examples

Read Parquet File into Table

Read Parquet File into Timetable

Conditionally Import Subset of Data Using Row Filter

Input Arguments

filename — Name of Parquet file character vector | string scalar

Name-Value Arguments

OutputType — Output datatype 'auto' (default) | 'table' | 'timetable'

SelectedVariableNames — Subset of variables to import character vector | string scalar | cell array of character vectors | string array

RowTimes — Row times variable variable name | time vector

StartTime — Start time of row times datetime scalar | duration scalar

SampleRate — Sample rate numeric scalar

RowGroups — Indices of row groups to import positive numeric scalar | vector of positive integers

RowFilter — Filter to select rows to import matlab.io.RowFilter object

TimeStep — Time step of time vector duration | calendarDuration

VariableNamingRule — Flag to preserve variable names "modify" (default) | "preserve"

Output Arguments

T — Output data table | timetable

Limitations

Extended Capabilities

Thread-Based Environment Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

Version History

R2022b: Read Parquet files containing structured data

R2022b: Use function in thread-based environments

R2022a: Read Parquet file data more efficiently using rowfilter to conditionally filter rows

R2022a: Determine and define row groups in Parquet file data

R2022a: Import nested Parquet file data

R2021b: Read and write datetimes with original time zones

R2021a: Read online data

R2021a: Use categorical data in Parquet data format

R2019b: Read tabular data containing any characters

See Also

`filename` — Name of Parquet file
character vector | string scalar

`OutputType` — Output datatype
`'auto'` (default) | `'table'` | `'timetable'`

`SelectedVariableNames` — Subset of variables to import
character vector | string scalar | cell array of character vectors | string array

`RowTimes` — Row times variable
variable name | time vector

`StartTime` — Start time of row times
datetime scalar | duration scalar

`SampleRate` — Sample rate
numeric scalar

`RowGroups` — Indices of row groups to import
positive numeric scalar | vector of positive integers

`RowFilter` — Filter to select rows to import
`matlab.io.RowFilter` object

`TimeStep` — Time step of time vector
duration | calendarDuration

`VariableNamingRule` — Flag to preserve variable names
`"modify"` (default) | `"preserve"`

`T` — Output data
table | timetable

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

R2022a: Read Parquet file data more efficiently using `rowfilter` to conditionally filter rows