How to extract data from a table format HTML?

Question

0 个投票

Hi,

I want to access a html and extract some information. However, when I use webread and then htmlTree I miss part of html data and don't know why.

Example:

Using this url

url = http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152

I would like to get information about the rows or columns of SMILES and InChL fields. However, when I use the code below I can't observe this information. I have tried different selectors, but I don't know if the data is dynamically generated.

url = http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152

html = webread(url);

tree = htmlTree(html);

selector= "td";

subtrees= findElement(tree,selector);

str = extractHTMLText(subtrees);

table_data = str(1:end);

Thank you,

Alan

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Jonas 2023-2-2

在 MATLAB Online 中打开

1 个投票

without digging deeper into html, we can use just text seach:

d=webread('http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152',weboptions('Timeout',15));
SMILESfirstTry=extractBetween(d,'<th class="inf">SMILES</th>','</td>','Boundaries','exclusive');
SMILESsecondTry=extractAfter(SMILESfirstTry{1},'<td colspan="4">')
SMILESsecondTry = 'c1c(ccc(c1)/C=C/C(=O)O)O'

similar could be done for the other tags

simlarly a bit more html stuff:

tree = htmlTree(d);
selector= "tr";
subtrees= findElement(tree,selector);
 str = extractHTMLText(subtrees);
 searchTags={'InChIKey' 'InChICode' 'SMILES'};
 location=contains(str,searchTags);
 rawEntries=str(location)
rawEntries = 3×1 string array
    "InChIKey  NGSWKAQJJWESNS-ZZXKWVIFSA-N"
    "InChICode  InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+"
    "SMILES  c1c(ccc(c1)/C=C/C(=O)O)O"
 extractAfter(rawEntries,'  ')
ans = 3×1 string array
    "NGSWKAQJJWESNS-ZZXKWVIFSA-N"
    "InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+"
    "c1c(ccc(c1)/C=C/C(=O)O)O"

2 个评论
显示无隐藏无

Alan Cesar Pilon Miro 2023-2-3

Hi Jonas,

Thank you! the first method worked very well.

Just to mentioned. I had some difficults in the second way, I could not find the objetcts.

Jonas 2023-2-6

thx for your reply. make sure, that your the data returned from webread is not empty, since the website seems to be quite slow, sometimes the returned data is empty. maybe further increasing the timeout limit can help here

请先登录，再进行评论。

How to extract data from a table format HTML?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

类别

标签

Community Treasure Hunt

How to extract data from a table format HTML?

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无 隐藏 无

更多回答（0 个）

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无