Reading mixed format data from '.txt' file in matlab

Question

JAMMI ASHOK 2020-8-16

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/580143-reading-mixed-format-data-from-txt-file-in-matlab

评论： JAMMI ASHOK 2020-8-18

george
 23.91  29.70  19.08  48.00  23.33  10.25   2.28  3.36
 23.84  29.88  19.78  48.75  21.76   8.30   2.62   
  0.08  -0.18  -0.70  -0.75   1.57   1.94  -0.34  5.6  2.3  4.9  5.68
sams
 18.90  29.30  15.12  43.20  19.71   8.87   2.22
 18.76  31.28  15.15  50.18  16.15   5.96   2.71  21.76   8.30   2.62 
  0.14  -1.98  -0.03  -6.98   3.56   2.91  -0.49
peter
 22.71  78.30  18.27  82.90  21.28  36.08   0.59
 21.60  73.83  17.03  84.30  20.11  39.14   0.51
  1.10   4.47   1.24  -1.40   1.17  -3.07   0.08
jack
 18.56  40.70  14.85  45.30  19.13  11.34   1.69  78.30  18.27  82.90
 19.12  26.06  15.30  47.38  16.90   5.71   2.96
 -0.56  14.64  -0.45  -2.08   2.23   5.63  -1.27

This is a sample. I want to know how to read the data, when we have different lines and different formats of data.

Thank you for you time,

Ashok.

2 个评论
显示无隐藏无

Walter Roberson 2020-8-16

Is the number of numeric lines between names always the same?

I notice that the numer of numeric items is not the same for every line. Do you want it to be loaded in as a cell array with a vector for every line, so that the length of the lines can be preserved? Do you want shorter lines to be padded out with zeros so that every line is stored as the same length? Do you want shorter lines to be padded with NaN?

For the above sample, what output would you want?

JAMMI ASHOK 2020-8-16

Thank you Walter.

I want to be able to access every point in the data.

Like for example -> (1,1) -> "george"

(2,3)-> 19.08, (9,1) -> "peter".....

I don't know whether my expectations are right.

Atleast I want it to be loaded in as a cell array with a vector for every line, so that the length of the lines can be preserved.

The number of numeric lines between names are not constant.

Thank you,

Ashok.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Walter Roberson 2020-8-16

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/580143-reading-mixed-format-data-from-txt-file-in-matlab#answer_480504

在 MATLAB Online 中打开

filename = 'sample.txt';
S = fileread(filename);
lines = cellfun(@strtrim, regexp(S,'\r?\n', 'split'), 'uniform', 0);
values = cellfun(@(s) cell2mat(textscan(s, '')),lines, 'uniform', 0);
mask = cellfun(@isempty,values);
values(mask) = lines(mask);

I did, however, take a shortcut here, and the above code will error if there are any lines that have text in a column after numbers; also any lines that have numbers in a column after text, the numbers will not be converted. If numbers and text need to be parsed on the same line, then the above will need to be improved on.

2 个评论
显示无隐藏无

JAMMI ASHOK 2020-8-16

Thank you Walter.

The code helped.

Can you please suggest some good reference for file processing I/O in matlab.

Walter Roberson 2020-8-16

File I/O is not so different in MATLAB as in C. The usual functions such as fopen(), fclose(), fseek(), fread(), fwrite() and fscanf() are there, along with sscanf().

The utility routine fileread() is useful to read in entire files as characters, instead of having to fopen/fread/fclose(). As of R2020a, fileread() also does character set translation if it can figure out the encoding.

One useful function that MATLAB has that C / C++ does not have, is textscan(), which is designed for reading in groups of text that has a repeated structure. The repeated structure does not have to be columns: in your file if the number of numeric lines was the same each time, then we would have been able to read in the name / values sections with a single call (as character vectors for each group.)

MATLAB also has the higher level I/O functions readtable(), readcell(), readmatrix(), along with lower level csvread() and dlmread(). [csvread() and dlmread() are implemented by using textscan(), and are not good at reading text.]

An important utility for text processing is regexp() and closely related regexprep() . They are quite powerful for parsing purposes... but they can be difficult to figure out how to handle more complicated tasks. They effectively have their own built-in language . Using them well can take quite a bit of experience. Fortunately, there are very closely related routines in a number of other programming languages (but some of the fine details are all MATLAB.)

请先登录，再进行评论。

Answer 2

per isakson 2020-8-17

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/580143-reading-mixed-format-data-from-txt-file-in-matlab#answer_480837

编辑：per isakson 2020-8-18

在 MATLAB Online 中打开

Here is an alternative

Running

>> out = cssm('cssm.txt')

outputs

out = 
  struct with fields:
    george: [3×11 double]
      sams: [3×10 double]
     peter: [3×7 double]
      jack: [3×10 double]
>> out.peter
ans =
        22.71         78.3        18.27         82.9        21.28        36.08         0.59
         21.6        73.83        17.03         84.3        20.11        39.14         0.51
          1.1         4.47         1.24         -1.4         1.17        -3.07         0.08
>> out.peter(2,4)
ans =
         84.3
         
>> out.sams(:,6:end)
ans =
         8.87         2.22          NaN          NaN          NaN
         5.96         2.71        21.76          8.3         2.62
         2.91        -0.49          NaN          NaN          NaN         

where

function    out = cssm( ffs )
    %%
    chr = fileread( ffs );
    %   getting rid of carrige return simplifies the following code
    chr = strrep( chr, char(13), '' );  
    %   Split the text string with the a single name on a row as delimiter.
    %   Convert from class char to string, because I want to use strings.
    %   Allow name be a valid Matlab variable name; allow trailing space (\x20)
    [ data, names ] = strsplit( string(chr), '(?m)^[a-zA-Z]\w+\x20*$' ...
                                , 'DelimiterType','RegularExpression' );
    %   Since the files starts with a delimiter (name) there will be a leading  
    %   empty data block. Delete it.
    data(1) = [];
    %   The first character of the data blocks will be newline. Skip it. 
    %   Would "extractAfter(data,1)" be better?
    data = extractAfter( data, newline );
    %%
    %   The lines of a data block contains different numbers of columns. One way
    %   to cope with this is to add many empty columns and read the fithteen first
    %   columns. textscan() can handle too many but not too few (I thought).   
    data = strrep( data, newline, ",,,,,,,,,,,,,,,"+newline );
    for jj = 1 : numel(names)
        % Read the fithteen first columns and skip the rest. Fithteen is a
        % magic nymber that I chose. Using both white-space and comma as
        % delimiter seems to work fine. However, I'm not sure whether the
        % documentations says so.
        num = textscan( data(jj), '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%*[^\n]' ...
                                , 'Delimiter',{','}, 'Collectoutput',true );
        num = num{1};
        % Delete columns with only NaNs (and hope such columns cannot occur 
        % intentionally). 
        num( :, all(isnan(num),1) ) = [];
        % Assign the result of a block to an output structure.
        out(1).(names(jj)) = num;
    end
end

In response to comments

I've modified my function with the goal

It shall be straight forward to add code to handle new types of blocks without breaking the handling of existing types

And I've added the text, "<missing>" to the jack block of the file cssm.txt. Now

>> out = cssm('cssm.txt');
>> out.sams(:,6:end)
ans =
         8.87         2.22          NaN          NaN          NaN
         5.96         2.71        21.76          8.3         2.62
         2.91        -0.49          NaN          NaN          NaN
>> out.jack
ans = 
    " 18.56  40.70  14.85  45.30  19.13  11.34   1.69  78.30  18.27  82.90
      19.12  26.06  15.30  47.38  16.90   5.71   2.96  <missing>
      -0.56  14.64  -0.45  -2.08   2.23   5.63  -1.27
     "
>> 

where

function    out = cssm( ffs )
    %%
    chr = fileread( ffs );
    %   getting rid of carrige return simplifies the follow code
    chr = strrep( chr, char(13), '' );  
    %   split the text string with the a single name on a row as delimiter
    %   convert from class char to string, because I want to use strings;
    %   allow name be a valid Matlab variable name; allow trailing space (\x20)
    [ data, names ] = strsplit( string(chr), '(?m)^[a-zA-Z]\w+\x20*$' ...
        , 'DelimiterType','RegularExpression' );
    %   Since the files starts with a delimiter (name) there will be a leading  
    %   empty data block. Delete it.
    data(1) = [];
    %   The first character of the data blocks will be newline. Skip it. 
    %   Would "extractAfter(data,1)" be better?
    data = extractAfter( data, newline );
    
    for jj = 1 : numel(names)
        
        if all( ismember( char(data(jj)), [newline,' +-.0123456789']' ) )
            block_type = "pure_numeric";
        else
            block_type = "unidentified";
        end
        
        switch block_type
            case "pure_numeric"
                block = pure_numeric_data_( data(jj) );
            otherwise
                block = unidentified_data_( data(jj) );
        end
        
        % Assign the result of a block to an output structure.
        out(1).(names(jj)) = block;
    end
end
function    num = pure_numeric_data_( data )        %
%%
%   The lines of a data block contains different numbers of columns. One way
%   to cope with this is to add many empty columns and read the fithteen first
%   columns. textscan() can handle too many but not too few (I thought). 
%
%   Read the fithteen first columns and skip the rest. Fithteen is a
%   magic nymber that I chose. Using both white-space and comma as
%   delimiter seems to work fine. However, I'm not sure whether the
%   documentations says so.
    data = strrep( data, newline, ",,,,,,,,,,,,,,,"+newline );
    num = textscan( data, '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%*[^\n]' ...
                        , 'Delimiter',{','}, 'Collectoutput',true );
    num = num{1};
    % Delete columns with only NaNs (and hope such columns cannot occur 
    % intentionally). 
    num( :, all(isnan(num),1) ) = [];
end
function    blk = unidentified_data_( data )        %
    blk = data;
end

and folded the functions looks like this

8 个评论
显示 6更早的评论隐藏 6更早的评论

per isakson 2020-8-18

编辑：per isakson 2020-8-18

在 MATLAB Online 中打开

I added some kind of response to my answer. It's more about programming style.

In your case the statement

    [ data, names ] = strsplit( string(chr), '(?m)^[a-zA-Z]\w+\x20*$' ...
        , 'DelimiterType','RegularExpression' );

must be modified. Maybe, it suffies to match lines that starts with "#", '(?m)^#.+$'. The output name, names, is now misleading.

The code block

if all( ismember( char(data(jj)), [newline,' +-.0123456789']' ) )
    block_type = "pure_numeric";
else
    block_type = "unidentified";
end

can be replaced by code that deduce the value of block_type and some appropriate field names from the now misnamed variable names.

Use profile() to decide whether the function is becomming too slow.

JAMMI ASHOK 2020-8-18

Thanks. "per isakson" .

Your last post has worked.

Your codes are very useful to me. I will try this change.

请先登录，再进行评论。

Reading mixed format data from '.txt' file in matlab

2 个评论
显示无隐藏无

采纳的回答

2 个评论
显示无隐藏无

更多回答（1 个）

8 个评论
显示 6更早的评论隐藏 6更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Reading mixed format data from '.txt' file in matlab

2 个评论 显示 无隐藏 无

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（1 个）

8 个评论 显示 6更早的评论隐藏 6更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

2 个评论
显示无隐藏无

8 个评论
显示 6更早的评论隐藏 6更早的评论