Regular expressions on uint8 or single byte characters

5 次查看(过去 30 天)
I have a 200 MB text file encoded in UTF-8. My maximum array size is around 350 MB, so I can safely read it in using fread('filename','*uint8'). For using regular expressions, I need to turn this into a char array, which blows up the array size by at least a factor of two (depending on encoding, but for my application I can ignore all fancy characters), and thus leads to an "out of memory" error.
I wrote some code that breaks up the original array, so that the matching of the regular expressions works on smaller chunks, but I am still wondering: Can I somehow run regular expressions on the uint8 array? Or is there a char-like variable type that only uses 1 byte per character?
  5 个评论
dpb
dpb 2013-8-26
Instead of 'unit8', try 'uchar' Not sure it'll help but it is at least a character class, not an integer.
Cedric
Cedric 2013-8-27
编辑:Cedric 2013-8-27
Actually, it is simpler to ask what you are trying to match instead of the pattern (copy/paste of chunk of file content or string, and an explanation of what you want to extract). With a little luck, we can perform this using STRFIND (which works on uint8 arrays) or some numeric test on uint8's.

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by