How to increase reading speed from a Gigabyte large file ?
1 次查看(过去 30 天)
显示 更早的评论
farzad
2019-6-17
Hi all
how do I increase reading speed from an Excel file that contains rows and columns with a volume of some GigaBytes?
18 个评论
dpb
2019-6-17
Dunno...'pends on what the data are and how saved...getting it out of Excel and into a .mat or stream file would undoutedly be the fastest.
farzad
2019-6-17
The data are float and let's say 5 Gigabytes.
why .mat and why stream file ? how would the code be like ?
is using the table useful ?
dpb
2019-6-17
'Cuz both .mat and stream files are binary representations of the actual bytes in memory, thus eliminating the need for conversion.
You've still not said which form of file it actually is; if it is .xls(x), then the xlsread is fairly slow.
A table would be one choice for internal storage in Matlab; how useful depends entirely on what the data are and how they need to be processed which like the actual file itself, you're keeping us totally in the dark so all we can do is guess...
dpb
2019-6-17
Well, with .xlsx files you have the choice between xlsread and readtable. You'll just have to test which is faster--one presumes probably readtable. If you have R2019a, you can try the new readmatrix which is now recommended instead of xlsread.
For csv files, the historic ways are csvread, textscan, fscanf altho again with the caveat of requiring R2019a, readmatrix is the TMW-recommended alternative now.
I don't have R2019a installed yet, so I can't comment on the relative performance between it and alternatives.
Still, if speed and doing this more than once will be required, then doing it once and then using .mat or stream files will undoubtedly beat any of the alternatives.
You could, if your application can live with single precision, cut the file size in half by saving single instead of double. That's purely a case of what is required of the data itself as to whether would be a viable alternative or not.
Walter Roberson
2019-6-18
编辑:Walter Roberson
2019-6-18
I wrote out 1e6 by 50 of doubles = 4 gigabytes in binary form, and tested how long loading took.
When saved as space-delimited double using save -ascii -double, then using load() of the 12501000000 bytes of text file took 1416 seconds.
textscan() of that same file took 265 seconds.
fscanf() of the same file took 371 seconds.
When saved as a .csv file using dlmwrite() with precision 16, then using load() took 1107 seconds.
When saved as -v7.3 .mat, then using load() of the 3796914266 bytes of file took 25 seconds.
When saved as a pure binary file, then fread(fid, [1e6 500],'*double') took 14 1/4 seconds the first time, and 2.1 seconds the second time (file in operating system cache.) fread(fid, [1 inf], '*double') takes 4.6 seconds when the file is in operating system cache, which tells us that there is more memory management overhead when the size is unknown.
(I will update as I generate more times.)
farzad
2019-6-18
Thank you very much Walter
That is very much what's I was searching for. How do you save as mat?
Walter Roberson
2019-6-18
data = rand(1e6, 50);
save testdata.mat data -v7.3
but this relies upon having the data in the first place to write out as .mat.
Walter Roberson
2019-6-18
I am having difficulty creating a excel file that large. I wrote the file as .csv but my Excel complains about running out of memory when trying to import it, which does not make sense to me.
Walter Roberson
2019-6-18
I have been updating the timings; you might want to have another look, above.
dpb
2019-6-18
All of which continues to say "ditch Excel" entirely for such large files...
I do find it interesting that textscan manages to beat fscanf -- one would think would boil down to the same C runtime library call. Just out of curiosity, what were the two specific commands used, Walter? Oh--did you include overhead to cast the cell array from textscan to double?
Walter Roberson
2019-6-18
编辑:Walter Roberson
2019-6-18
I created a format with repmat of '%f' 50 times. I fopen and then
datacell = textscan(fid, fmt, 'collectoutput', 1);
Because this puts everything into a single cell the overhead to extract the array is trivial.
The timing with collectoutput 0 without joining the columns after, was a hair higher but not statistically significant.
dpb
2019-6-18
Yeah, that's kinda' what I suspected, thanks for confirming, Walter.
I still find it more than strange that there's 30% reduction over fscanf -- what are they doing wrong with it then is the question that there's that much room for improvement?
These timings couldn't possibly be related to caching issues, I presume; you're too careful for that! :)
回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Spreadsheets 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)