Removing unwanted lines from text file

Question

jgillis16 2015-8-6

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/232544-removing-unwanted-lines-from-text-file

评论： Cedric 2015-8-6

采纳的回答： Cedric

virgorm.txt

在 MATLAB Online 中打开

I am trying to remove all the NaN from column 7 of the attached text file and move them into a new text file.

I have written the code below:

% - Read original.
 content = fileread( 'virgorm.txt' ) ;
 % - Match and eliminate lines without pattern matching.
 sepId = reshape( strfind( content, '|' ), 7, [] ) ;
 match = content(sepId(7,:)+1) == 'NaN' ;
 lines = strsplit( content, '\n' ) ;
 lines(match) = [] ;
 % - Export updated content.
 fId = fopen( 'virgormwou.txt', 'w' ) ;
 fprintf( fId, strjoin( lines, '\n' )) ;
 fclose( fId ) ;

But, it doesn't seem to be working. I suspect it is because of line:

match = content(sepId(7,:)+1) == 'NaN' ;

The error I get is:

Error using reshape Product of known dimensions, 7, not divisible into total number of elements, 6492.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Cedric 2015-8-6

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/232544-removing-unwanted-lines-from-text-file#answer_188402

编辑：Cedric 2015-8-6

在 MATLAB Online 中打开

Not far! You made two small mistakes actually. The first is that you have 7 columns, and hence 6 separators, so the array of separators IDs must be reshaped using 6 rows:

sepId = reshape( strfind( content, '|' ), 6, [] ) ;

Then you cannot test is one char/element equals 'NaN' the way you do. I would just check for the presence of 'N' after the 6th separator:

found = content(sepId(6,:)+1) == 'N' ;

Finally, and I renamed the variable match into found for that purpose (which should remind you one of your previous questions), you can split and export to two files as follows:

 lines = strsplit( content, '\n' ) ;   *** UPDATED: I forgot to copy this line.
 fId = fopen( 'output_nan.txt', 'w' ) ;
 fprintf( fId, strjoin( lines(found), '\n' )) ;
 fclose( fId ) ;
 fId = fopen( 'output_noNan.txt', 'w' ) ;
 fprintf( fId, strjoin( lines(~found), '\n' )) ;
 fclose( fId ) ;

4 个评论
显示 2更早的评论隐藏 2更早的评论

Cedric 2015-8-6

Did you change the sepId(7,:) intto sepId(6,:) as well?

Cedric 2015-8-6

在 MATLAB Online 中打开

You should work on an small example actually, to get a better understanding of what we do:

 >> buffer = sprintf( '1|3|2|~|7\n2|1|5|~|12\n3|2|28|~|137' )
 buffer =
        1|3|2|~|7
        2|1|5|100|12
        3|2|28|~|137

This creates a string of characters which has the same structure as your files. The \n is an escape code that creates a new line.

Now we can look for the positions/IDs of | in this string:

 >> strfind( buffer, '|' )
 ans =
     2     4     6     8    12    14    16    20    25    27    30    32

and you can check that it works if you count the new line as a single character. You can see what is the ASCII code of all these characters by the way, by converting to numeric (adding 0 triggers an automatic conversion to numeric):

 >> buffer + 0 
 ans =
     49   124    51   124    50   124   126   124    55    10    50   124    49   124    53   124    49    48    48   124    49    50    10    51   124    50   124    50    56   124   126   124    49    51    55

Here, 49 is the ASCII code of '1', 51 is the ASCII code of '3', 124 is the ASCII code of '|', and 10 is the ASCII code that codes for new lines. The shows that SPRINTF codes '\n' with 10, which is a single character.

Back to positions, accounting for the fact that new lines are single characters, you can check that positions work. Now if we want to get the position of all 3rd | on each line, we can compute the start and the step for extracting relevant positions. Another way is to create an array whose number of columns equals the number of | on a line, which means to reshape the vector of positions as follows:

 >> sepId = reshape( strfind( buffer, '|' ), 4, [] )
 sepId =
     2    12    25
     4    14    27
     6    16    30
     8    20    32

Here we get is transposed, but you recognize in the first column all positions associated with line 1, in the second column all positions associated with line 2, etc. So getting positions/IDs associated with the 3rd | means extracting row 3 of this array:

 >> sepId(3,:)
 ans =
     6    16    30

Now we can get the character that follows immediately by extracting elements of buffer at these positions +1 :

 >> buffer(sepId(3,:)+1)
 ans =
 ~1~

and we can test whether these characters are '~' or not:

 >> found = buffer(sepId(3,:)+1) == '~'
 found =
     1     0     1

Note that found is a vector of logicals (booleans: true noted 1, and false noted 0):

 >> class( found )
 ans =
     logical

which means that we can create "not found":

 >> ~found
 ans =
     0     1     0

We can use both for indexing arrays (logical indexing). If we want to index lines, we have to split buffer into lines, which we do with STRSPLIT using the new line as delimiter:

 >> lines = strsplit( buffer, '\n' )
 lines = 
    '1|3|2|~|7'    '2|1|5|100|12'    '3|2|28|~|137'

This is a cell array of lines/strings:

 >> class( lines )
 ans =
     cell

and we can index its cells using a logical index (true=1 elements flag cells to extract):

 >> lines(found)
 ans = 
    '1|3|2|~|7'    '3|2|28|~|137'
 >> lines(~found)
 ans = 
    '2|1|5|100|12'

Now we can export these to files, but we have to join lines with a new line character:

 >> strjoin( lines(found), '\n' )
 ans =
     1|3|2|~|7
     3|2|28|~|137

and the rest you know well, it's opening files for writing, writing, and closing files.

请先登录，再进行评论。

Removing unwanted lines from text file

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

4 个评论
显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Removing unwanted lines from text file

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

4 个评论 显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论