Extract data from text file

Question

matlab noob 2019-4-29

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/459096-extract-data-from-text-file

编辑： Stephen23 2019-5-1

sample data.txt

I have this 'sample data.txt' text file with the data not in the right form. I need to read this text file and extract the data and tabulate it in the order as shown in figure below. I am not sure how can I do it.

Really appreciate it if someone can help to guide me. Thank you.

2 个评论
显示无隐藏无

Guillaume 2019-4-29

The format of your text file is dreadful! Has it been altered in any way from its original format? It would be much easier to parse if the column data was separated by a tab or comma character instead of spaces and if the table header wasn't split onto two lines within one of the column header.

The screenshot that you show doesn't match the text file you've attached and therefore leave some questions unanswered:

It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to?
The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? Even better in my opinion would be to convert to numbers, in that case should Full thickness be converted to 100?

Unfortunately, because of that awful formatting, you're going to have to write a parser for the file and make plenty of assumptions that may be invalid and cause the parsing to fail on future files. If you can get the same data in a more sensible format that would be better.

matlab noob 2019-4-29

Really appreciate for your reply regarding this question. According to the question you've asked

It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to? The other column can also have multi word entris
The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? I'll need the original text as it is, no conversion is encourage in my case.

Meanwhile I'm searching something that can read specific string inbetween those data that I'll like to extract out. Is it possible for this idea to apply for this case?

Thank you.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Stephen23 2019-4-29

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/459096-extract-data-from-text-file#answer_372731

编辑：Stephen23 2019-4-29

在 MATLAB Online 中打开

sample data.txt

That is a very badly formatted file. For example, the field delimiters are space characters and space characters also occur within the fields (without any text delimiters to group the fields together). There is no robust general solution for parsing such a poorly formatted file, altough in some limited cases (such as with prior knowledge of the field contents) you might be able to parse it but parsing such files will always be fragile. On that basis I assumed that the fields contain only the text in the number and types that you have shown, i.e. each line contains exactly:

1 or 2 words (starts with 'Basal' or 'Mid' or 'Apical', or constitutes 'Apex')
1 number
1 word
('Nil' or 'Present')
('Nil' or 'Present')
('Nil' or 'Full thickness' or a percentage)

This matches all of the seventeen rows in your example data file:

str = fileread('sample data.txt');
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
	'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
tkn = regexpi(str,rgx,'tokens');
tkn = vertcat(tkn{:})

Giving:

tkn = 
    'Basal Anterior'         '1'     'Hypokinetic'    'Nil'        'Nil'        '50%'           
    'Basal Anteroseptal'     '2'     'Dyskinetic'     'Present'    'Present'    'Full thickness'
    'Basal Inferoseptal'     '3'     'Hypokinetic'    'Present'    'Present'    '50%'           
    'Basal Inferior'         '4'     'Hypokinetic'    'Nil'        'Present'    '50%'           
    'Basal Inferolateral'    '5'     'Normal'         'Nil'        'Nil'        'Nil'           
    'Basal Anterolateral'    '6'     'Normal'         'Nil'        'Nil'        'Nil'           
    'Mid Anterior'           '7'     'Hypokinetic'    'Nil'        'Nil'        '<50%'          
    'Mid Anteroseptal'       '8'     'Dyskinetic'     'Present'    'Present'    'Full thickness'
    'Mid Inferoseptal'       '9'     'Akinetic'       'Present'    'Present'    'Full thickness'
    'Mid Inferior'           '10'    'Hypokinetic'    'Nil'        'Present'    '<50%'          
    'Mid Inferolateral'      '11'    'Normal'         'Nil'        'Nil'        'Nil'           
    'Mid Anterolateral'      '12'    'Normal'         'Nil'        'Nil'        '<50%'          
    'Apical Anterior'        '13'    'Akinetic'       'Nil'        'Nil'        '50%'           
    'Apical Septal'          '14'    'Akinetic'       'Nil'        'Nil'        '< 50%'         
    'Apical Inferior'        '15'    'Akinetic'       'Nil'        'Nil'        '> 50%'         
    'Apical Lateral'         '16'    'Hypokinetic'    'Nil'        'Nil'        'Full thickness'
    'Apex'                   '17'    'Akinetic'       'Nil'        'Nil'        'Full thickness'
>> size(tkn)
ans =
    17     6
>>     

Clearly you can put that into a table if you really want to:

>> hdr = {'LeftVentricularSegments','No','WallMotion','PerfusionAtRest','PerfusionAtStress','DelayedGadoliniumEnhancement'};
>> T = cell2table(tkn,'VariableNames',hdr)
T = 
    LeftVentricularSegments     No      WallMotion      PerfusionAtRest    PerfusionAtStress    DelayedGadoliniumEnhancement
    _______________________    ____    _____________    _______________    _________________    ____________________________
    'Basal Anterior'           '1'     'Hypokinetic'    'Nil'              'Nil'                '50%'                       
    'Basal Anteroseptal'       '2'     'Dyskinetic'     'Present'          'Present'            'Full thickness'            
    'Basal Inferoseptal'       '3'     'Hypokinetic'    'Present'          'Present'            '50%'                       
    'Basal Inferior'           '4'     'Hypokinetic'    'Nil'              'Present'            '50%'                       
    'Basal Inferolateral'      '5'     'Normal'         'Nil'              'Nil'                'Nil'                       
    'Basal Anterolateral'      '6'     'Normal'         'Nil'              'Nil'                'Nil'                       
    'Mid Anterior'             '7'     'Hypokinetic'    'Nil'              'Nil'                '<50%'                      
    'Mid Anteroseptal'         '8'     'Dyskinetic'     'Present'          'Present'            'Full thickness'            
    'Mid Inferoseptal'         '9'     'Akinetic'       'Present'          'Present'            'Full thickness'            
    'Mid Inferior'             '10'    'Hypokinetic'    'Nil'              'Present'            '<50%'                      
    'Mid Inferolateral'        '11'    'Normal'         'Nil'              'Nil'                'Nil'                       
    'Mid Anterolateral'        '12'    'Normal'         'Nil'              'Nil'                '<50%'                      
    'Apical Anterior'          '13'    'Akinetic'       'Nil'              'Nil'                '50%'                       
    'Apical Septal'            '14'    'Akinetic'       'Nil'              'Nil'                '< 50%'                     
    'Apical Inferior'          '15'    'Akinetic'       'Nil'              'Nil'                '> 50%'                     
    'Apical Lateral'           '16'    'Hypokinetic'    'Nil'              'Nil'                'Full thickness'            
    'Apex'                     '17'    'Akinetic'       'Nil'              'Nil'                'Full thickness' 

12 个评论
显示 10更早的评论隐藏 10更早的评论

matlab noob 2019-5-1

编辑：matlab noob 2019-5-1

在 MATLAB Online 中打开

% capture next line 
nl = '[\r\n]+';
% read the text file
file = fileread(a);
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
              nl,'(\d+)',... % number after the LVsegments
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion 
              nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
              ]; 
str = regexpi(file, expression, 'tokens');
str = vertcat(str{:});
% Insert header for each data extracted
header = {'Left_Ventricular_Segments','No','Wall_Motion','Delayed_Gadolinium_Enhancement'};
% Data tabulation
table = cell2table(str,'VariableNames', header)

This is the code that I've done to read all my text file (100+), but I face some problem.

Recalling the problem of the text file, it does not have a consistent arrangement of data.

Some text file (mostly) consist of

"Left_Ventricular_Segments" "No" "Wall_Motion" "Delayed_Gadolinium_Enhancement"

which apply to most of the cases.

However, some of the text file (only a few) consist of one extra column

"Left_Ventricular_Segments" "No" "Wall_Motion" "Perfusion Defect At Stress" "Delayed_Gadolinium_Enhancement"

I've read that there is this (?(cond)expr) & (?(cond)expr1|expr2) is it applicable in my situation? Meanwhile still struggling on how to use this...

Or is there any smarter way in including this condition into the code? Esle I will go for a dumb way by adding another line for this purposes. Thank you.

% capture next line 
nl = '[\r\n]+';
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
              nl,'(\d+)',... % number after the LVsegments
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion 
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Perfusion Defect At Stress
              nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
              ]; 
table = 
    Left_Ventricular_Segments     No      Wall_Motion     Perfusion_Defect_At_Stress    Delayed_Gadolinium_Enhancement
    'Basal Anterior'             '1'     'Hypokinetic'    'Nil'                         'Nil'                         
   ...

Stephen23 2019-5-1

编辑：Stephen23 2019-5-1

在 MATLAB Online 中打开

This code reads your three later files, where each field is on its own line.

The code relies on one main assumption: that the header name "No" appears by itself on one line, which is used to anchor and identify the block of data that you are looking for. The other lines are simply contiguous with that header name. It also uses the "No" field values to identify the number of fields: this requires that only the "No" fields constitute numeric values.

R = '([^\n]+\n)*No(\n[^\n]+)+'; % regular expression, contiguous around "No".
S = dir('sample*.txt');
N = numel(S);
C = cell(1,N);
for k = 1:N
	str = fileread(S(k).name);
	str = regexprep(str,'\r\n','\n'); % replace Windows newlines.
	M = regexp(str,R,'match','once'); % match lines of text file.
	P = regexp(M,'\n','split');       % split lines into cell array.
	V = str2double(P);                % convert lines into numbers.
	D = mean(diff(find(~isnan(V))));  % identify non-NaN (i.e. "No" lines").
	H = regexprep(P(1:D),'\s+','_');  % get heater lines.
	X = strcmpi(P{D+1},'Enhancement');   % identify superfluous header.
	A = reshape(P(1+X+D:end),D,[]).';    % get data lines.
	T = cell2table(A,'variableNames',H); % convert data + header into table.
	C{k} = T;
end

Giving:

>> C{:}
ans = 
    Left_Ventricular_Segments     No     Perfusion_defect_at_rest    Perfusion_defect_at_stress     Wall_Motion     Delayed_Gadolinium
    _________________________    ____    ________________________    __________________________    _____________    __________________
    'Basal Anterior'             '1'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Anteroseptal'         '2'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Inferoseptal'         '3'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Inferior'             '4'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Basal Inferolateral'        '5'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Basal Anterolateral'        '6'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Anterior'               '7'     'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Mid Anteroseptal'           '8'     'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Mid Inferoseptal'           '9'     'Present'                   'Present'                     'Hypokinetic'    'Full thickness'  
    'Mid Inferior'               '10'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Inferolateral'          '11'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Anterolateral'          '12'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Apical Anterior'            '13'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Septal'              '14'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Inferior'            '15'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Lateral'             '16'    'Nil'                       'Nil'                         'Normal'         '<50%'            
    'Apex'                       '17'    'Nil'                       'Nil'                         'Dystkinetic'    '<50%'            
ans = 
    Left_Ventricular_Segments     No     Wall_Motion    Perfusion_At_Rest    Perfusion_At_Stress    Delayed_Gadolinium
    _________________________    ____    ___________    _________________    ___________________    __________________
    'Basal Anterior'             '1'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Anteroseptal'         '2'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Inferoseptal'         '3'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Inferior'             '4'     'Normal'       'Nil'                'Nil'                  '50% (mid wall)'  
    'Basal Inferolateral'        '5'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Anterolateral'        '6'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anterior'               '7'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anteroseptal'           '8'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferoseptal'           '9'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferior'               '10'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferolateral'          '11'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anterolateral'          '12'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Anterior'            '13'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Septal'              '14'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Inferior'            '15'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Lateral'             '16'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apex'                       '17'    'Normal'       'Nil'                'Nil'                  'Nil'             
ans = 
    Left_Ventricular_Segments     No      Wall_Motion     Perfusion_Defect_At_Stress    Delayed_Gadolinium
    _________________________    ____    _____________    __________________________    __________________
    'Basal Anterior'             '1'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Anteroseptal'         '2'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Basal Inferoseptal'         '3'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Basal Inferior'             '4'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Inferolateral'        '5'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Anterolateral'        '6'     'Hypokinetic'    'Nil'                         'Nil'             
    'Mid Anterior'               '7'     'Hypokinetic'    'Nil'                         '50%'             
    'Mid Anteroseptal'           '8'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Mid Inferoseptal'           '9'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Mid Inferior'               '10'    'Hypokinetic'    'Possibly'                    'Nil'             
    'Mid Inferolateral'          '11'    'Hypokinetic'    'Nil'                         'Nil'             
    'Mid Anterolateral'          '12'    'Hypokinetic'    'Nil'                         'Nil'             
    'Apical Anterior'            '13'    'Akinetic'       'Nil'                         '50%'             
    'Apical Septal'              '14'    'Hypokinetic'    'Nil'                         '50%'             
    'Apical Inferior'            '15'    'Hypokinetic'    'Nil'                         '50%'             
    'Apical Lateral'             '16'    'Hypokinetic'    'Nil'                         '< 50%'           
    'Apex'                       '17'    'Dyskinetic'     'Nil'                         '50%'             
>> 

matlab noob 2019-5-1

Thank you so much for your help and explaination. I think I'm able to understand your concept. However, I'll need some time to understand the code. Once agian thank you!

请先登录，再进行评论。

Answer 2

KSSV 2019-4-29

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/459096-extract-data-from-text-file#answer_372728

在 MATLAB Online 中打开

T = readtable(myfile)

2 个评论
显示无隐藏无

Guillaume 2019-4-29

There is no way that readtable can cope with the sample file supplied by the OP.

matlab noob 2019-4-29

Appreciate for the reply. Thanks!

请先登录，再进行评论。

Extract data from text file

2 个评论
显示无隐藏无

采纳的回答

12 个评论
显示 10更早的评论隐藏 10更早的评论

更多回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

Community Treasure Hunt

Extract data from text file

2 个评论 显示 无隐藏 无

采纳的回答

12 个评论 显示 10更早的评论隐藏 10更早的评论

更多回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

Community Treasure Hunt

2 个评论
显示无隐藏无

12 个评论
显示 10更早的评论隐藏 10更早的评论

2 个评论
显示无隐藏无