Regular Expression to detect spaces in a string
10 次查看(过去 30 天)
显示 更早的评论
Hallo All, I have a string for example
string='<abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided>'
I want to use regexp to get all the white spaces that occur between " and < /a >. I have been trying to figure out how to use regexp to get the spaces but have not yet found an elegant solution. For eg: regexp(string,'(?<="\S*)\s') retuns only 2 spaces and not all of them.. Could someone help me out..
Thanks a lot
2 个评论
Jan
2013-10-14
There are two " characters in this string. Which one do you mean? Please post the wanted result by editing the question (not as comment or pseudo-answer).
采纳的回答
Deepak
2013-10-14
1 个评论
Cedric
2013-10-14
编辑:Cedric
2013-10-14
Hi Deepak, The issue with counting spaces using regexp is that it's not possible to do it using a simple query. The call to regexp (possibly regexprep) that we would have to use would be much more complicated than doing the whole operation using one call to regexp with a simple pattern and a few additional operations.
更多回答(1 个)
Cedric
2013-10-14
编辑:Cedric
2013-10-14
Here is an example assuming that you want characters between " and </a> and not only white spaces:
>> s = regexp( html, '(?<=")[^"]+(?=</a>)', 'match' )
s =
'>Mathworks' '>Google'
Look-arounds are treacherous when dealing with this type of situations where the expression in the look-behind can appear multiple times before the expression in the look forward is found. The following example illustrates it
>> s = regexp( html, '(?<=").+?(?=</a>)', 'match' )
s =
'http://www.mathworks.com">Mathworks' 'http://www.google.com">Google'
where we see that the "smallest possible match" fails despite the lazy .+?. Let me know if you want to understand why.. or see the example/discussion between Per and I here.
Note that using tokens is generally more efficient than using look-arounds:
>> s = regexp( html, '"([^"]+)</a>', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
>Mathworks
s{2}{1} =
>Google
Back to the initial question, the pattern could be more specific though if you wanted to extract the content or the value of the href parameter, e.g.
>> s = regexp( html, '[^>]+(?=</a>)', 'match' )
s =
'Mathworks' 'Google'
Or
>> s = regexp( html, 'href.+?"([^"]*)', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
http://www.mathworks.com
s{2}{1} =
http://www.google.com
Or
>> s = regexp( html, 'href.+?"(?<href>[^"]*).*?>(?<content>.*?)</a>', 'names' )
s =
1x2 struct array with fields:
href
content
>> s(1)
ans =
href: 'http://www.mathworks.com'
content: 'Mathworks'
>> s(2)
ans =
href: 'http://www.google.com'
content: 'Google'
All these approaches can be fine-tuned/complex-ified for managing a broader set of cases, e.g. when there is a tag in the content of the anchor tag.
EDIT: if you really want to get the position of white spaces, your expression does work but not as you thought. It actually matches
'"abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
and
'">abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
which start both with a " followed by non-whitespace characters until after the t of Limit. Once thing that you could do if you wanted to keep the pattern simple, is to get the starting and ending position of the relevant sub-string:
>> [mStart, mStop] = regexp( html, '(?<=")[^"]+(?=</a>)', 'start', 'end' )
mStart =
76
mStop =
132
and use them to mask a logical index of position of white spaces:
>> isSpace = html == ' ' ;
>> isSpace(1:mStart-1) = false ;
>> isSpace(mStop+1:end) = false ;
>> find( isSpace )
ans =
117 121 125
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!