Regular expressions help with HTML source code

Question

0 个投票

I'm looking to parse through some HTML source code to pull information from the Wall Street Journal. I need to pull the price of the following commodities: the 4 domestic crude oil spot prices, copper, aluminum, cotton, and cocoa

This is the URL: http://online.wsj.com/mdc/public/page/2_3023-cashprices.html

I'm having some trouble with getting regexp to work the way I want it to.

what string expression would you use to pull out the middle (bold) price listed? If the value is n.a., it's okay if it just returns 'n.a.' or its equivalent.

I tried a variety of methods and I couldn't get it to work.

Could someone show an example of the string he or she would use for extracting the price?

Thanks!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Cedric 2013-3-12

编辑：Cedric 2013-3-12

在 MATLAB Online 中打开

0 个投票

Did you see my answer to your previous question? Tokens work well in such situations;

 >> buffer = urlread('http://online.wsj.com/mdc/public/page/2_3023-cashprices.html');
 >> item    = 'West Texas Intermediate, Cushing' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: ''
     price: '92.06'
 >> item    = 'London fixing, spot price' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: '&#163;'          % Code, but the forum renders it.
     price: '19.4273'

Cheers,

Cedric

Note that a . is returned for n.a. entries.

EDIT 1: corrected pattern thank to Walter's comment about pound-signs.

EDIT 2: updated with named tokens so we get the prefix (e.g. pound-sign).

3 个评论
显示 1更早的评论隐藏 1更早的评论

Cedric 2013-3-12

Ah thank you Walter, I had not realized that there could be these signs!

Cedric 2013-3-12

Updated so the prefix is extracted (e.g. pound-sign).

请先登录，再进行评论。

Answer 2

Walter Roberson 2013-3-11

在 MATLAB Online 中打开

0 个投票

'^<b>.*?\d+(\.\d+)?<\\b>$'

This should allow for the currency symbol, and for the possibility that the decimal point and following digits are not there. The only real "trick" here is the use of .*? to indicate the minimum expansion of repeated . (i.e., match any one character) where .* by itself is "greedy" and would match as many characters as possible.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Joseph Williams 2013-3-12

That doesn't work. In part because for the source code on the url, the end tags are denoted with a '/' instead of a '\', but after that, it still doesn't returns and empty answer. Do you have any other suggestions?

Best, J. Williams

请先登录，再进行评论。

Regular expressions help with HTML source code

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

类别

标签

Community Treasure Hunt

Regular expressions help with HTML source code

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论 隐藏 1更早的评论

更多回答（1 个）

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论