How do I read the text between href tags and return the results in a cell array?
2 次查看(过去 30 天)
显示 更早的评论
Currently, I have an html webpage saved in a text format. Below is an example of the portion of the text I am interested in:
<a href='/some/1056-text-stuff'>
I want to search the text document for every case the "<a href='\some\ " pattern appears and extract the text between the tokens, i.e.
/some/1056-text-stuff
Matlab has regexp, match and tags but I am struggling to pick out the string cleanly. Ideally, I would like to search the document and return a cell array of strings which lists all of the matches. Here is my current code:
str= fileread('C:\Users\Me\Documents\MATLAB\trial.txt'); %read in text file
urls = regexp(str, 'href=(\S+)(\s*)$', 'tokens', 'lineAnchors'); %find urls
0 个评论
采纳的回答
Julian
2016-6-17
You can try something like
>> RE='<a[\s]+href="(?<target>.*?)"[^>]*>(?<text>.*?)</a>';
>> list=regexp(html, RE, 'names')
I can recommend this tool https://www.regexbuddy.com/
2 个评论
Ana Alonso
2019-12-17
Hi there,
What do the (?<target>.*?) and (?<text>.*?) expressions correspond to?
I've never worked with html before and I'm just trying to scrape urls from the html code.
Thanks!
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Data Import and Export 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!