What is the correct way to save a large MATLAB structure?

57 次查看(过去 30 天)
I have a MATLAB structure which is just over 21GB in memory (from whos) and when I save this to a MAT file with the "-v7.3" and "-nocompression" flags it takes well over an hour (on a high performance workstation with an NVMe SSD) and I get a file which is 77GB on disk. I understand that there is some overhead in saving to a MAT file and that "-nocompression" will result in larger files that with compression (but I gave up after about 3 hours waiting for a compressed version to save), but how can 56GB of "overhead" be considered acceptable?
I only need to save this one structure and I won't be adding any other data or modifying the MAT file, so all of the additional features of the v7.3 format are of no use to me, I just need support for >2Gb variables. I attempted to use the undocumented getByteStreamFromArray function to get a byte array I can just dump to a file but this just returned "Error during serialization".
Am I somehow missing a "correct" way to do this efficiently? Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format or write my own serializer? I appreciate that doing either of these isn't exactly a massive job, I'm just very surprised there isn't better native support for large files!

回答(2 个)

Matt J
Matt J about 18 hours 前
编辑:Matt J about 18 hours 前
but how can 56GB of "overhead" be considered acceptable?
It depends on what your struct contains. Field data containing handle objects, for example, will give a deceptively small memory total according to whos() because only the handle, and not what it is pointed to, is counted. However, when you save to a .mat file, the entirety of the data pointed to by the handle will be cloned, resulting in a much larger memory footprint. Example:
s.h=gcf;
whos('s').bytes
ans =
176
save file1 s
dir('file.mat').bytes
ans =
1836
Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format
It is hard for me to imagine why one would ever want 21GB of data in a single file. It would block off a huge chunk of contiguous disk space and it would take forever to load.

Rahul
Rahul about 6 hours 前
Hi Owen,
The issue you're encountering stems from the design of MATLAB's MAT-file formats and the inherent inefficiencies of the -v7.3 format for your specific use case.
  • With the -v7.3 flag you can store variables with size greater than 2GB, with compression.
  • Without the -v7.3 flag (e.g. if the default version is set to -v7 or lower) there is no (or less) compression, but we cannot store large arrays.
MAT-file structure: The -v7.3 format uses HDF5 as its backend, which is highly versatile but not optimized for cases where you have a single, large variable. HDF5 is designed for general-purpose storage, including metadata and other overheads that can lead to excessive file sizes.
Serialization limitations: Large and complex data structures like struct can incur significant overhead because every field and subfield is treated as a separate dataset in HDF5.
getByteStreamFromArray is limited to serializing objects that MATLAB's internal serializer can handle. Structures or arrays with greater than 4 GB of data often hit limitations in MATLAB's serialization mechanism.
Some of the possible solutions that could resolve this issue are as follows:
Split into multiple variables and use -v7 format
  • If feasible, divide your large structure into several smaller variables, each <2GB.
  • Save these in the older -v7 format, which is more space-efficient for such cases.
fields = fieldnames(myStruct);
for i = 1:numel(fields)
save(['part_' fields{i} '.mat'], 'myStruct', '-v7');
end
Writing custom serializer
  • If the structure doesn't contain complex objects, you can recursively serialize it into a binary file with custom MATLAB code.
fid = fopen('large_struct.bin', 'w'); fwrite(fid, myStruct, 'uint8');
fclose(fid);
Use low-level HDF5 tools
  • If you need to stick to -v7.3, you can consider using MATLAB's low-level HDF5 functions to write the structure directly without unnecessary overhead.
h5create('large_struct.h5', '/myStruct', size(myStruct));
h5write('large_struct.h5', '/myStruct', myStruct);
If performance is your priority and you don't need the -v7.3 features, you can split the data into smaller parts and use the -v7 format.
Moreover, if your workstation supports parallel computing, you can consider using MATLAB's Parallel Computing Toolbox to parallelize the saving process. This might help speed up the process, especially if your structure can be split into independent parts.
To know more about the usage of HDF5 functions used in the above code, refer to the documentation link mentioned below:
Best!

类别

Help CenterFile Exchange 中查找有关 Workspace Variables and MAT-Files 的更多信息

标签

产品


版本

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by