Extracting Data field of a Series in HTML file

Question

b 2020-4-7

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/515926-extracting-data-field-of-a-series-in-html-file

评论： b 2020-5-10

In an HTML file, there is a section like this :

        series: [{
            name: 'Numbers',
            color: '#33CCFF',
            lineWidth: 5,
            data: [45,78,84,91,111,125,178,231,274,283,303,333]        }],

How to extract the 'data' field into an array in a matlab code ?

There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Mohammad Sami 2020-4-7

are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.

you can easily change the pattern to name: \'Numbers\'

b 2020-4-7

在 MATLAB Online 中打开

I am trying to do the following:

url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);

The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

per isakson 2020-4-7

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/515926-extracting-data-field-of-a-series-in-html-file#answer_424483

编辑：per isakson 2020-5-9

在 MATLAB Online 中打开

cssm.txt

I misunderstood your question. This is a bit of overkill.

Assumptions

the string, series:, always indicates the start of a block of interest

I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).

This script reads all blocks

%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] ); 
for jj = 1 : len
    
    txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).name = matlab.lang.makeValidName( txt );
    txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).color = txt;
    
    txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
    series(jj).lineWidth = str2double( txt );
    
    txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
    series(jj).data = str2num( txt );  %#ok<ST2NM>
end

and extract "series which matches name='Numbers'. Not the other series'."

>> series(strcmp({series.name},'Numbers')).data
ans =
    45    78    84    91   111   125   178   231   274   283   303   333
    
    

In response to comment below

Assumptions

the string, series:, always indicates the start of a block of interest
the string, }], indicates the end of a block of interest
all html-files of interest are named index.html
all files named index.html are of interest
all html-files of interest are in subfolders under a root-folder, ...\finCase
every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers

The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.

Try

>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
  4×2 cell array
    {'anderson'       }    {1×9  double}
    {'kim-j-clijsters'}    {1×10 double}
    {'paul-judd'      }    {1×11 double}
    {'simmi'          }    {1×12 double}
>> 

where (in one m-file)

function    client_data = read_client_data( root, file, name )
    
    sad = dir( fullfile( root, '**', file ) ); 
    len = length( sad );
    client_data = cell( len, 2 );
    for jj = 1 : len 
        cac = strsplit( sad(jj).folder, filesep );
        client = cac{end};
        series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
        client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
    end
end
function    series = read_one_file_( file )
    
    chr = fileread( fullfile( file ) );
    cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
    
    len = length( cac );
    series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
    
    for jj = 1 : len
        
        txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).name = strtrim( txt );
        
        txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).color = txt;
        
        txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
        series(jj).lineWidth = str2double( txt );
        
        txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
        series(jj).data = str2num( txt );  %#ok<ST2NM>
        
    end
end

TODO: add error handling and comments

10 个评论
显示 8更早的评论隐藏 8更早的评论

per isakson 2020-5-9

在 MATLAB Online 中打开

A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.

Replace the statement

series(jj).data = str2num( txt ); %#ok<ST2NM>

by

out = textscan( txt             , '%f'      ...
            ,   'CollectOutput' , true      ...  
            ,   'Delimiter'     , ','       ...
            ,   'EmptyValue'    , 0         ...      
            ,   'TreatAsEmpty'  , 'null'    ...
            ,   'Whitespace'    , ' \t[]'   );
series(jj).data = reshape( out{:}, 1,[] );

and read about textscan in the documentation.

b 2020-5-10

LOL on the tragedy of being Null.

The code section works nicely with output as needed.

Indebted once again.

请先登录，再进行评论。

Extracting Data field of a Series in HTML file

4 个评论
显示 2更早的评论隐藏 2更早的评论

采纳的回答

10 个评论
显示 8更早的评论隐藏 8更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

Extracting Data field of a Series in HTML file

4 个评论 显示 2更早的评论隐藏 2更早的评论

采纳的回答

10 个评论 显示 8更早的评论隐藏 8更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

4 个评论
显示 2更早的评论隐藏 2更早的评论

10 个评论
显示 8更早的评论隐藏 8更早的评论