parquetwrite
Write columnar data to Parquet file
Description
parquetwrite(
specifies additional options with one or more name-value pair arguments. For example, you
can specify filename
,T
,Name,Value
)"VariableCompression"
to change the compression algorithm
used, or "Version"
to write the data to a Parquet 1.0 file.
Examples
Write Table or Timetable to Parquet File
Write tabular data into a Parquet file and compare the size of the same tabular data in .csv
and .parquet
file formats.
Read the tabular data from the file outages.csv
into a table.
T = readtable('outages.csv');
Write the data to Parquet file format. By default, the parquetwrite
function uses the Snappy
compression scheme. To specify other compression schemes see 'VariableCompression'
name-value pair.
parquetwrite('outagesDefault.parquet',T)
Get the file sizes and compute the ratio of the size of tabular data in the .csv
format to size of the same data in .parquet
format.
Get size of .csv
file.
fcsv = dir(which('outages.csv'));
size_csv = fcsv.bytes
size_csv = 101040
Get size of .parquet
file.
fparquet = dir('outagesDefault.parquet');
size_parquet = fparquet.bytes
size_parquet = 44881
Compute the ratio.
sizeRatio = ( size_parquet/size_csv )*100 ; disp(['Size Ratio = ', num2str(sizeRatio) '% of original size'])
Size Ratio = 44.419% of original size
Write Nested Data to Parquet File
Create nested data and write it to a Parquet file.
Create a table with one nested layer of data.
FirstName = ["Akane"; "Omar"; "Maria"]; LastName = ["Saito"; "Ali"; "Silva"]; Names = table(FirstName,LastName); NumCourse = [5; 3; 6]; Courses = {["Calculus I"; "U.S. History"; "English Literature"; "Studio Art"; "Organic Chemistry II"]; ["U.S. History"; "Art History"; "Philosphy"]; ["Calculus II"; "Philosphy II"; "Ballet"; "Music Theory"; "Organic Chemistry I"; "English Literature"]}; data = table(Names,NumCourse,Courses)
data=3×3 table
Names NumCourse Courses
_____________________ _________ ____________
FirstName LastName
_________ ________
"Akane" "Saito" 5 {5x1 string}
"Omar" "Ali" 3 {3x1 string}
"Maria" "Silva" 6 {6x1 string}
Write your nested data to a Parquet file.
parquetwrite("StudentCourseLoads.parq",data)
Read the nested Parquet data.
t2 = parquetread("StudentCourseLoads.parq")
t2=3×3 table
Names NumCourse Courses
_____________________ _________ ____________
FirstName LastName
_________ ________
"Akane" "Saito" 5 {5x1 string}
"Omar" "Ali" 3 {3x1 string}
"Maria" "Silva" 6 {6x1 string}
Input Arguments
filename
— Name of output Parquet file
character vector | string scalar
Name of output Parquet file, specified as a character vector or string scalar.
Depending on the location you are writing to, filename
can take
on one of these forms.
Location | Form | ||||||||
---|---|---|---|---|---|---|---|---|---|
Current folder | To write to the current folder, specify the name of the file in
Example:
| ||||||||
Other folders | To write to a folder different from the current folder, specify the
full or relative path name in
Example:
Example:
| ||||||||
Remote Location | To write to a remote location,
Based on the remote location,
For more information, see Work with Remote Data. Example:
|
Data Types: char
| string
T
— Input data
table | timetable
Input data, specified as a table or timetable.
Use parquetwrite
to export structured Parquet data. For more
information on Parquet data types supported for writing, see Apache Parquet Data Type Mappings.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')
VariableCompression
— Compression scheme names
'snappy'
(default) | 'brotli'
| 'gzip'
| 'uncompressed'
| cell array of character vectors | string vector
Compression scheme names, specified as one of these values:
'snappy'
,'brotli'
,'gzip'
, or'uncompressed'
. If you specify one compression algorithm thenparquetwrite
compresses all variables using the same algorithm.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.
In general, 'snappy'
has better performance for reading and
writing, 'gzip'
has a higher compression ratio at the cost of more
CPU processing time, and 'brotli'
typically produces the smallest
file size at the cost of compression speed.
Example: parquetwrite('myData.parquet', T, 'VariableCompression',
'brotli')
Example: parquetwrite('myData.parquet', T, 'VariableCompression',
{'brotli' 'snappy' 'gzip'})
VariableEncoding
— Encoding scheme names
'auto'
(default) | 'dictionary'
| 'plain'
| cell array of character vectors | string vector
Encoding scheme names, specified as one of these values:
'auto'
—parquetwrite
uses'plain'
encoding for logical variables, and'dictionary'
encoding for all others.'dictionary'
,'plain'
— If you specify one encoding scheme thenparquetwrite
encodes all variables with that scheme.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.
In general, 'dictionary'
encoding results in smaller file
sizes, but 'plain'
encoding can be faster for variables that do not
contain many repeated values. If the size of the dictionary or number of unique values
grows to be too big, then the encoding automatically reverts to plain encoding. For
more information on Parquet encodings, see Parquet encoding definitions.
Example: parquetwrite('myData.parquet', T, 'VariableEncoding',
'plain')
Example: parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain'
'dictionary' 'plain'})
VariableNames
— Custom variable names
string array | character vector | cell array of character vectors
Since R2024a
Custom variable names to use in exported data, specified as a string array,
character vector, or cell array of character vectors. By default,
parquetwrite
uses the variable names of the input table or
timetable.
Example: parquetwrite(filename,T,VariableNames=["name","address","phone"])
RowGroupHeights
— Number of rows to write per output row group
nonnegative numeric scalar | vector of nonnegative integers
Number of rows to write per output row group, specified as a nonnegative numeric scalar or vector of nonnegative integers.
If you specify a scalar, the scalar value sets the height of all row groups in the output Parquet file. The last row group may contain fewer rows if there is not an exact multiple.
If you specify a vector, each value in the vector sets the height of a corresponding row group in the output Parquet file. The sum of all the values in the vector must match the height of the input table.
A row group is the smallest subset of a Parquet file that can be read into memory at once. Reducing the row group height helps the data fit into memory when reading. Row group height also affects the performance of filtering operations on a Parquet data set because a larger row group height can be used to filter larger amounts of data when reading.
If RowGroupHeights
is unspecified and the input table exceeds
67108864 rows, the number of row groups in the output file is equal to
floor(TotalNumberOfRows/67108864)+1
.
Example: RowGroupHeights=100
Example: RowGroupHeights=[300, 400, 500, 0, 268]
Version
— Parquet version to use
'2.0'
(default) | '1.0'
Parquet version to use, specified as either '1.0'
or
'2.0'
. By default, '2.0'
offers the most
efficient storage, but you can select '1.0'
for the broadest
compatibility with external applications that support the Parquet format.
Caution
Parquet version 1.0 has a limitation that it cannot round-trip variables of type
uint32
(they are read back into MATLAB® as int64
).
Limitations
In some cases, parquetwrite
creates files that do not represent the
original array T
exactly. If you use parquetread
or
datastore
to read the files, then the result might not have the same
format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.
Extended Capabilities
Thread-Based Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced in R2019aR2024a: Specify custom variable names
You can specify custom variable names to use in exported data by using the
VariableNames
name-value argument.
R2022b: Write nested data to Parquet files
Write nested table and timetable variables to Parquet files using
parquetwrite
.
R2022b: Use function in thread-based environments
This function supports thread-based environments.
R2022a: Determine and define row groups in Parquet file data
A Parquet file can store a range of rows as a distinct row group for increased
granularity and targeted analysis. parquetread
uses the RowGroups
name-value argument to
determine row groups while reading Parquet file data. parquetwrite
uses
the RowGroupHeights
name-value argument to define row groups while
writing Parquet file data.
R2022a: Export nested data
You can now export nested cell arrays as LIST arrays.
R2021b: Read and write datetimes with original time zones
Parquet files require time-zone-aware timestamps to be in the UTC time zone. When
writing datetimes, parquetwrite
converts them to equivalent UTC values
and stores the original time zone values in the metadata of the Parquet file.
parquetread
uses the stored original time zone values to enable
roundtripping.
R2021a: Use categorical data in Parquet data format
Write Parquet data that contains the categorical
data type.
R2020a: Control encoding scheme and Parquet version when writing files
The parquetwrite
function has two new name-value arguments:
'VariableEncoding'
controls whether a Parquet file uses plain or dictionary encoding for each variable.'Version'
specifies whether to use Parquet 1.0 or Parquet 2.0 file formatting.
R2019b: Write tabular data containing any characters
Write tabular data that has variable names containing any Unicode characters, including
spaces and non-ASCII characters. To write tabular data that contains arbitrary variable
names, such as variable names with spaces and non-ASCII characters, set the
PreserveVariableNames
parameter to true
.
See Also
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)