How can I delete numeric Headings/Delimiters from a large text file

Question

Tate Shorthill 2018-1-19

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/377884-how-can-i-delete-numeric-headings-delimiters-from-a-large-text-file

评论： Greg 2018-1-23

examplefile.txt

I have packets of data each (47,4) (rows,columns) in a massive text file. Each of these packets are separated by a row of numerical headings (1,3). I would like create a script to find all the 3-column rows and delete them thus giving me one massive 4-column data file. (note, all values are numeric) Thanks for the help!

I've attached an example file.

2 个评论
显示无隐藏无

Greg 2018-1-19

Can you upload a representative sample file?

Sounds like a job for memmapfile, if we had an example to work with.

Tate Shorthill 2018-1-19

I've attached an example file.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Jan 2018-1-19

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/377884-how-can-i-delete-numeric-headings-delimiters-from-a-large-text-file#answer_300842

编辑：Jan 2018-1-23

在 MATLAB Online 中打开

inFID     = fopen(FileName, 'r');
outFID    = fopen([FileName, '.fixed'], 'W');  % [EDITED: 'w' => 'W']
Delimiter = ',';  % Or whatever it is
Break     = char(10);
while ~feof(inFID)
  S = fgets(inFID);
  if length(findstr(S, Delimiter)) > 2
    fwrite(outFID, S, 'char');
    fwrite(outFID, Break, 'char');  % Linebreak
  end
end
fclose(inFID);
fclose(outFID);

Import each line. Export it only, if it is recognized as not belonging to the "numerical headings (1,3)". Maybe length(findstr(S, Delimiter)) > 2 is not optimal, so please post a short example of how the kind of lines can be distinguished.

By the way: "Massive" text files are nonsense. Text files are useful only, if they are read and edited by human, but this is impossible for huge data. Think of converting the data to a binary format, if you want to store a matrix only. This would be more efficient.

[EDITED 2] Do the lines to be kept start with a space? Then this might be faster:

S = fileread(FileName);
C = strsplit(S, char(10));
C = C(strncmp(C, ' ', 1));
fid = fopen([FileName, '.fixed'], 'W');
fprintf(fid, '%s\n', C{:});
fclose(fid)

3 个评论
显示 1更早的评论隐藏 1更早的评论

Jan 2018-1-22

I do not see a need to advertise another thread here. My answer does not concern textscan or regexp. But if you have a good reason to do this, post a link. It is not convenient to let the readers search.

Jan 2018-1-23

If have changed the output buffering by using fopen('W'). See [EDITED 2] for another idea.

请先登录，再进行评论。

Answer 2

Greg 2018-1-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/377884-how-can-i-delete-numeric-headings-delimiters-from-a-large-text-file#answer_300961

在 MATLAB Online 中打开

f = 'examplefile.txt';
fmt = ['%*f%*f%*f',repmat('%f%f%f%f',1,47)];
fid = fopen(f,'rt');
data = textscan(fid,fmt,Inf,'Delimiter',{' ','\n'}, ...
    'MultipleDelimsAsOne',true,'CollectOutput',true);
fclose(fid);
data = data{1}';
data = reshape(data,4,[])';

Jan's solution is the brute force, won't break if a block has 46 lines instead of 47. Mine is the take advantage of the given repetition. I'll leave it to you to output the data; as with other comments, I don't recommend writing it out to another .txt file. Chances are that file will just need to be read in again - a .mat file or a standard fwrite(...,'double') format would be drastically more appropriate.

2 个评论
显示无隐藏无

Tate Shorthill 2018-1-23

thanks for your feedback. I likely will come back to this because the code provided by Jan, while effective, is fairly slow.

Greg 2018-1-23

You're dealing with ASCII text, slow is kind of your only option. This should be a bit faster than Jan's but keep in mind that his is reading and writing mine is reading only.

请先登录，再进行评论。

Answer 3

per isakson 2018-1-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/377884-how-can-i-delete-numeric-headings-delimiters-from-a-large-text-file#answer_300968

编辑：per isakson 2018-1-20

在 MATLAB Online 中打开

And with regular expressions

>> out = cssm('examplefile.txt');
>> out(1:32)
ans =
  1 5 0 0
  2 15 0 0
  3 25 0 
>>

where cssm

function    out = cssm( ffs )
      xpr = [         ... Match the "numerical headings"                  
      '(?m)       '   ... "^" and "$" match begining and end of line 
      '^\x20*     '   ... beginning of line and optional space, '\x20'    
      '[\d.]+     '   ... one or more digits and periods
      '[,\x20]+   '   ... list delimiter; one or more comma and space
      '[\d.]+     '   ... one or more digits and periods
      '[,\x20]+   '   ... list delimiter; one or more comma and space
      '[\d.]+     '   ... one or more digits and periods
      '\x20*      '   ... optional trailing spaces
      '\r?\n      '   ... zero or one CR and one LF; new line
      ];
      xpr( isspace( xpr ) ) = [];         % remove space
      str = fileread( ffs );              % read entire file as string
      out = regexprep( str, xpr, '' );    % replace "numerical headings" by empty
      fid = fopen( 'out.txt', 'w' );      
      [~] = fwrite( fid, out, 'char' );   % write the modified string
      fclose( fid );
  end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

How can I delete numeric Headings/Delimiters from a large text file

2 个评论
显示无隐藏无

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How can I delete numeric Headings/Delimiters from a large text file

2 个评论 显示 无隐藏 无

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

2 个评论 显示 无隐藏 无

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

2 个评论
显示无隐藏无

3 个评论
显示 1更早的评论隐藏 1更早的评论

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论