How to select specific urls in a webpage with regexp?

4 次查看(过去 30 天)
Hi all,
I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>
so after the link there is the string '\d* hp'. Here the code I use to detected them:
url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';
html=urlread(url);
hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');
This code works rather fine, but I'm not able to get rid of the first wrong result that is:
<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr>
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>
As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks
  1 个评论
Michael Dombrowski
Michael Dombrowski 2017-6-29
When I run your code I get no results in hyperlinks. But, have you thought of adding "farm-tractors" into your regex? It would resolve your issue, and as long as all the links also go to the farm-tractors directory it would work fine.

请先登录,再进行评论。

采纳的回答

Guillaume
Guillaume 2017-6-29
编辑:Guillaume 2017-6-29
Note: avoid greedy .* particularly in complex expressions, it's bound to cause you problems. Negative classes often work better. For example, instead of <td.*>, use <td[^>]*>.
As per Michael comment, your posted regex does not work. But even with the simplified regex:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a.*?/a>', 'match')' %transposed for easy viewing in command window
you can see that there is a problem. Unfortunately for you, the problem is actually the webpage which is actually not valid html. Your whole problem comes from the fact that the spacer.gif <a hyperlink (on line 131 of the source html) is never closed. So of course, your regex captures everything up to the next a> which belongs to the next <tr><td>.
Unfortunately that makes your life rather difficult. Try:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a[^>]*>[^<]*</a>(?=</td><td[^>]*>\d+ hp</td>)', 'match')' %transposed for easy viewing in command window
And if you can report to the website owner that their page is missing a closing tag.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Adding custom doc 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by