Loading part of a text file (i.e., fileread the first X bytes)

9 次查看(过去 30 天)
I'm using fileread to load data. The problem I have is that the files are large (several MB) and I actually only need to load/process in the first fraction (say 100 kb) of the file. There are over 1M files so wasted computation time from loading all of this "data fat" at the end of the files adds up to several days.
Does anyone know of a way to use fileread (or something similar) where you can specify to only load part of the file into MATLAB's memory buffer? With this many files even saving a fraction of a second will make a big difference.

采纳的回答

Walter Roberson
Walter Roberson 2019-10-2
编辑:Walter Roberson 2019-10-2
You would use fopen(), fread() with a size, then fclose() . You would want to use a "precision" specifier such as '*c' .
However, if there is a possibility that your files are UTF encoded or are multibyte character set, then you need to define more clearly what the size is intended to indicate. Is it (say) 100000 bytes that then potentially have to be decoded, or would you be wanting to read 100000 decoded characters ?
Also, remember to take into account line terminators in your counting. Does your file use carriage returns as well as linefeeds ?
  2 个评论
Scott
Scott 2019-10-2
Thanks Walter. I'll see how the run time compares. I likely could also save time by not passing the full block of text to the various parsing functions as well.
This is helpful, thanks!
Walter Roberson
Walter Roberson 2019-10-2
Extracting the beginning of a character vector is not always more efficient if the parsing code is able to handle extra characters beyond what you need. But if you are using regexp you would want to be sure to use the ? quantifier on .* for example, so using the .*? operator, or make sure you use 'dotexceptnewline' with .* because .* implicitly skips the pointer to the end of the entire stretch of characters and then work backwards to find matches, instead of finding the first match from the current position.
Extracting the beginning of a character vector usually does not cost much and can save you from having to carefully code .?* but when talking about "fractions of a second" then it costs a little that might not strictly need to be used. Extracting the beginning before parsing is cleaner programing in most cases, but not always the utmost optimization.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 String Parsing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by