Matlab extract url from html source
16 次查看(过去 30 天)
显示 更早的评论
Hi, I am trying to extract all urls from a HTML source code. I used strfind command to find "http" as the starting of url and ".html", ".php" , ".png" as the end of the url. After that i join the starting and the ending to form a complete URL
But this give very bad result because it usually mix up.
I want to ask if there is any easier way to do this?
I'm thinking about searching for a pattern, a single command to give all urls that start with http:// and end with .html , .php, or .png
In the html source code, there are some other url extension, but i want to ignore all of them.
Thank you very much for any help
0 个评论
回答(3 个)
Jason Ross
2012-5-2
I would do this using a series of regular expressions. Take a look at "Parsing Strings with Regular Expressions" on the following page for an example. It uses email addresses, but doing it for a URL is very similar since you know how it starts and ends, and you care about what's in between.
1 个评论
Walter Roberson
2015-2-28
Walter Roberson
2012-5-2
regexp(TheString, 'http://.*?\.(html|php|png)')
However, this cannot notice that (say) http://mathworks.com/scripts.htmlx/logo.png should extend to the .png instead of just to the .html . In order to be able to determine that you have reached the end of the URI, you need to know the list of characters which terminate URI in your context. Taking into account that sloppy pages often send URI with embedded blanks, which is syntactically invalid...
0 个评论
Abhisar Ekka
2021-2-13
You can run this piece of code and it works.
html = webread("<----paste your url here ---->");
hyperlinks = regexp(html,'https?://[^"]+','match')'
Inside webread, paste your url. Webread does the work of reading & parsing the html code . And upon using regexp which matches regular expression we get all kinds of http and https links in the url.
1 个评论
Gobert
2021-6-13
How can one check each html code to find emails? For example, see below: How to make this code work?
html = webread("https://edition.cnn.com");
hyperlinks = regexp(html,'https?://[^"]+','match')';
rgx ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
emails = regexpi(hyperlinks,rgx,'match')';
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Adding custom doc 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!