matfile and half inefficient storage
8 次查看(过去 30 天)
显示 更早的评论
Dear MATLAB users,
I have encountered the following inefficient storage problem:
delete('myfile.mat')
handle = matfile('myfile.mat')
handle.X = half(X); % X is big
handle.Y = half(Y); % Y is big
handle.a = a;
handle.b = b;
%%% the size of myfile.mat is 2.4Gb %%%
data = load('myfile');
save('mynewfile1.mat', '-v7.3', '-struct', 'data')
%%% the size of mynewfile1.mat is 1.2Gb %%%
data = load('myfile');
save('mynewfile2.mat', '-struct', 'data')
%%% the size of mynewfile2.mat is 1.2Gb %%%
What could be causing this doubling of storage and how can I avoid it without loading and resaving the file.
Update: the problem does not seem to be caused by the -v7.3 flag. I updated the code above to show this.
Thank you for your help.
30 个评论
dpb
2021-7-24
-v7,3 is lastest version; -v7 is the default https://www.mathworks.com/help/matlab/ref/save.html set by TMW in the preferences, apparently for compatibility.
There's a note in the doc under the 'version' named parameter that says--
"Version 7.3 MAT-files use an HDF5 based format that requires some overhead storage to describe the contents of the file. For cell arrays, structure arrays, or other containers that can store heterogeneous data types, Version 7.3 MAT-files are sometimes larger than Version 7 MAT-files."
The blowup is something I've noted in some other Q? over last few months -- there was another conversation just the other day it seems where a file was saved also at something like 2X the size w/ -v7.3 flag but the save command w/o the flag was half the size. Turned out it's in the preferences that the -v7 flag is set by default on initial install.
Seems as though this needs some attention from TMW -- the huge blow-up in size indicates something's not kosher/as intended in the implementation.
dpb
2021-7-24
编辑:dpb
2021-7-24
Quite possible; there's got to be overhead with the matifle object in order to be able to access pieces-parts.
Alternatively, what does half actually do? Does it create some object or what? I don't have any of the TBs that have it so not sure.
Just for checking, what is the settings in Preferences--General-MAT-files? Just so we know for sure what version is used with no explicit flag on the command line.
dpb
2021-7-24
OK, that the default is -v7 and that both
save('mynewfile1.mat', '-v7.3', '-struct', 'data')
save('mynewfile2.mat', '-struct', 'data')
returned the same size file shows the different file size is not related to the version for whatever data actually is.
Now, what we (at least me, since I can't test) don't know yet is what half actually returns -- the doc above was unclear.
What does
x=half(X);
whos x X
return?
Mika
2021-7-24
Here is what I get:
>> X = rand(1000);
>> x = half(X);
>> whos
Name Size Bytes Class Attributes
X 1000x1000 8000000 double
x 1000x1000 2000000 half
dpb
2021-7-24
Well, then, it would seem it is the matfile overhead that's the killer -- if you just store X and x, nothing untoward happens, does it?
Walter Roberson
2021-7-25
But I need matfile to save in a parfor loop.
I have not seen any guarantee that two different processes writing to the same matfile() will not interfere with each other.
The file structure designed for simultaneous access is memmapfile() .
Walter Roberson
2021-7-25
Please explain more about why using parfor requires you to use matfile? As opposed to just saving (possibly using 7.3 if you have big objects)?
Mika
2021-7-25
编辑:Mika
2021-7-25
save cannot be called in a parfor loop, https://www.mathworks.com/help/parallel-computing/transparency.html
yes i could write a separate function, but matfile is a more elegant solution if it worked as expected.
so i guess this is my main concern, the unexpected behavior of matfile (maybe in combination with half).
dpb
2021-7-25
编辑:dpb
2021-7-26
We've eliminated everything on the size conundrum excepting matfile with the exception that haven't seen the explict result of a save statment for the half object (that I can't test). We got so far as to show it didn't show extra memory used via whos but that doesn't prove save didn't need some extra info to go with it. One presumes not, but it hasn't been proven.
If performance is a Q? as I would presume it would be using parfor anyways, the matfile solution may seem "elegant" in minimizing source code, but I think it would still be a sizable time hit even without the the file size issue as compared to the suggested workaround.
Mika
2021-7-26
Thank you for you help. I did try save on the half object, and it doesn't seem to be causing the problem. I guess I should submit a bug report to mathworks in this case?
dpb
2021-7-26
编辑:dpb
2021-7-26
I had presumed that would be the result, but since I couldn't/can't test, just for the record... :)
I agree, I think it's well worth bringing to their explicit attention (altho I would presume they're already aware of it) as it appears they may need to re-examine just what is causing such a huge blowup and rethink what they're doing going forward.
While they probably won't classify it as a bug since it seems to still work to provide the documented functionality, certainly from a performance and quality of implementation POV it deserves to be flagged.
dpb
2021-7-26
That avoids it, but doesn't resolve that storage requirements blow up remarkably with matfile which seems to me at least to be a problem even if one can get around it in some instances by not using it. If never going to use it, isn't much point in having it in the language... :)
James Tursa
2021-7-27
For the record, half data types are stored as opaque classdef objects. They are fundamentally different from the other native numeric types such as double and single. Whether this has anything to do with the behavior I don't know.
Walter Roberson
2021-7-27
Good point, James. The representation of classdef objects can end up being quite different in HDF5 .
dpb
2021-7-27
But the testing didn't show any difference w/ save of the raw type; only w/|matfile...
Eike Blechschmidt
2021-7-28
You could do the following and see if there is a difference in how the files are stored as hdf5 files:
h5disp('myfile.mat');
h5disp('mynewfile1.mat');
Mika
2021-7-29
Thank you, here is a note from mathworks support:
The MAT-file v7.3 is based on HDF5, and HDF5 does not manage free space as effectively as it should. If a dataset in a HDF5 file is frequently added and written, the files can grow unnecessarily large.
A possible work around, as you have demonstrated, is to use the “save” function. This allows MATLAB to compress the data more efficiently, since there is no repetitive write to the MAT file. Please see this documentation link, specifically the tips section for information about efficiently storing to a MAT file in this way:
Mika
2021-8-16
编辑:Mika
2021-8-16
Just to follow up, this small function should do the job:
function mysave(filename, varargin)
data = struct(varargin{:});
save(filename, '-v7.3', '-struct', 'data');
end
Here is an example usage:
% run and check result
mysave('myfile.mat', 'a', 1, 'b', 2, 'c', 3);
whos -file myfile.mat
Name Size Bytes Class Attributes
a 1x1 8 double
b 1x1 8 double
c 1x1 8 double
Q490
2021-8-16
As a side note, and not sure if this is directly related to an answer to your question, a function I've found very useful that can be a good substitute for using matfile is "savefast", written by Tim Holy (https://www.mathworks.com/matlabcentral/profile/authors/1337381) and which can be downloaded at:
For the file sizes you are talking about it saves it extremely quickly and in the smallest possible file size. I highly recommend it.
cui,xingxing
2021-8-24
编辑:cui,xingxing
2021-8-24
Similar questions here ,TMW should provide an effective solution.
回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)