How to force textscan to include the custom EOL character

Question

dymitr ruta 2022-9-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1808895-how-to-force-textscan-to-include-the-custom-eol-character

评论： dymitr ruta 2022-9-22

Hi Folks,

I am loading multiline chunks of texts from very big files each starting from '@' character into separate cells. Here is my code to do that:

y=textscan(x,'%s',1e7,'EndOfLine','@'); y=y{1};

The result is good and fast except I wanted to also include the opening '@'. I am running it against massive TB files in blocks of 1e7 chunks. Obviously I know I can do y=strcat('@',y) afterwards, but this postfix takes longer than the original textscan itself. Is there a way to force textscan to also include the specified EOL character, or any other faster solution for that?

Here is a testing line where I create a big multiline string to simulate the file:

x=repmat(['@abc:1:abc:1:2:3:4\ndef:1:abc:1:2:3:4\n'],1,1e7); tic; 
y=textscan(x,'%s','EndOfLine','@'); y=y{1}; 
t(1)=toc; 
y=strcat('@',y); 
t(2)=toc

Note: I want to retain the capability to rapidly put filtered string/file back together by x=[y{:}];

Help much appreciated

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Walter Roberson 2022-9-20

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1808895-how-to-force-textscan-to-include-the-custom-eol-character#answer_1057465

在 MATLAB Online 中打开

No, textscan() will always eat the EndOfLine delimiters.

One approach:

Keep a buffer of unprocessed text, initially []

while ~feof(fid)
    buffer = [buffer, fread(fid, '*uchar', CHUNKSIZE))];
    if isempty(buffer) %end of file
        break
    elseif buffer(1) ~= '@'
        %something is wrong with the input stream, we expected a @ at the
        %beginning
    else
        parts = regexp(buffer, '@[^@]*', 'match');
        buffer = parts{end};
        parts(end) = [];
        %now process the chunks in cell array parts
    end
end

At this point, buffer should be non-empty and should hold the last chunk. Since the last chunk is not followed by @ then at the time we read it we cannot know that it is a complete chunk: we can only know that any particular character is the end of a chunk by peeking ahead to see a @ next character or by detecting that we reached end of file. So expect buffer to have the last chunk in it after the loop.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

dymitr ruta 2022-9-21

编辑：dymitr ruta 2022-9-21

Fantastic, slashed the time by half, but your code does not work with '*uchar' as regexp does not accept uint8s that you get as output from fread(fid, CHUNKSIZE, '*uchar'). It works when you do: fread(fid, CHUNKSIZE, '*char'), but then the read chunk occupies twice more space. There is also a limitation of the chunksize in Matlab set at 2^29 (as the error message informs) but I managed to get chunksize=1e9 working, with my data it given me only 2.9m parts at the end, so not that much compared to 1e7 I was working with but given the speedup it is worth it. Thanks

请先登录，再进行评论。

Answer 2

dpb 2022-9-20

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1808895-how-to-force-textscan-to-include-the-custom-eol-character#answer_1057475

在 MATLAB Online 中打开

x=repmat(['@abc:1:abc:1:2:3:4\ndef:1:abc:1:2:3:4\n'],1,1e7); tic; 
y=textscan(x,'%s','EndOfLine','@'); y=y{1}; 
t(1)=toc; 
y=strcat('@',y); 
t(2)=toc; disp([t sum(t)])
    2.5350   21.9892   24.5242
tic; 
y=textscan(x,'%s','EndOfLine','@'); y=string(y{1}); 
t(1)=toc; 
y="@"+y; 
t(2)=toc; disp([t sum(t)])
    3.7476    5.2454    8.9931

Trades the very expensive strcat function for direction addition with the newer string class -- takes a second longer to convert to string, but save 15 or so in the catenation operation.

The direct catenation of the cellstr array with the cellufn variant was even slower than strcat without more effort than I had time to give at the moment, but the above may lead to some other ideas on direct memory manipulation -- presuming the character strings in the real application aren't of uniform length, the conversion to a straight char() array is probably not the way to go so I didn't even look at that variant.

2 个评论
显示无隐藏无

dymitr ruta 2022-9-21

Thanks, good solution although in my setup it gives me much smaller improvements: 21s->17s, and if you want to convert the strings into cells (which I prefer) via cellstr you end up almost with the same time. The solution above via fread slashes this time down to only 9s. I will bare in mind the strings + fast operation, though, many thanks.

dymitr ruta 2022-9-22

在 MATLAB Online 中打开

After some experimentation I got your string solution to work the fastest in combination with the split function, even beating Walter's regexp:

y='@'+split(string(x),'@'); y(1)=[];

Fantastic, many thanks for the lead

请先登录，再进行评论。

How to force textscan to include the custom EOL character

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

How to force textscan to include the custom EOL character

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

2 个评论
显示无隐藏无