Regular Expression to detect spaces in a string

Question

1 个投票

Hallo All, I have a string for example

string='<abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided  abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided>'

I want to use regexp to get all the white spaces that occur between " and < /a >. I have been trying to figure out how to use regexp to get the spaces but have not yet found an elegant solution. For eg: regexp(string,'(?<="\S*)\s') retuns only 2 spaces and not all of them.. Could someone help me out..

Thanks a lot

2 个评论
显示无隐藏无

Cedric 2013-10-14

编辑：Cedric 2013-10-14

What do you mean by "spaces"? Is it just white spaces or all characters? If you really meant white spaces, is it their position that you want? If you want characters, what is the purpose? REGEXP can parse the whole tag and extract whatever part you want.

Jan 2013-10-14

There are two " characters in this string. Which one do you mean? Please post the wanted result by editing the question (not as comment or pseudo-answer).

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Deepak 2013-10-14

0 个投票

Hi Cedric, Thanks for the really detailed answer. It really helped. I actually wanted to get the position of white spaces. So the second part of the answer really addresses my query. I was hoping to get the whote spaces with one regexp without using any other commands like isspace, but I guess would be complicated... I am not really familiar with tokens.. So once again thanks for ur detailed answer..

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Cedric 2013-10-14

编辑：Cedric 2013-10-14

Hi Deepak, The issue with counting spaces using regexp is that it's not possible to do it using a simple query. The call to regexp (possibly regexprep) that we would have to use would be much more complicated than doing the whole operation using one call to regexp with a simple pattern and a few additional operations.

请先登录，再进行评论。

Answer 2

Cedric 2013-10-14

编辑：Cedric 2013-10-14

在 MATLAB Online 中打开

3 个投票

Here is an example assuming that you want characters between " and </a> and not only white spaces:

 >> s = regexp( html, '(?<=")[^"]+(?=</a>)', 'match' )
 s = 
    '>Mathworks'    '>Google'

Look-arounds are treacherous when dealing with this type of situations where the expression in the look-behind can appear multiple times before the expression in the look forward is found. The following example illustrates it

 >> s = regexp( html, '(?<=").+?(?=</a>)', 'match' )
 s = 
    'http://www.mathworks.com">Mathworks'    'http://www.google.com">Google'

where we see that the "smallest possible match" fails despite the lazy .+?. Let me know if you want to understand why.. or see the example/discussion between Per and I here.

Note that using tokens is generally more efficient than using look-arounds:

 >> s = regexp( html, '"([^"]+)</a>', 'tokens' ) ;
 >> celldisp(s)
 s{1}{1} =
         >Mathworks
 s{2}{1} =
         >Google

Back to the initial question, the pattern could be more specific though if you wanted to extract the content or the value of the href parameter, e.g.

 >> s = regexp( html, '[^>]+(?=</a>)', 'match' )
 s = 
    'Mathworks'    'Google'

Or

 >> s = regexp( html, 'href.+?"([^"]*)', 'tokens' ) ;
 >> celldisp(s)
   s{1}{1} =
           http://www.mathworks.com
   s{2}{1} =
           http://www.google.com

Or

 >> s = regexp( html, 'href.+?"(?<href>[^"]*).*?>(?<content>.*?)</a>', 'names' )
 s = 
    1x2 struct array with fields:
      href
      content
 >> s(1)
 ans = 
       href: 'http://www.mathworks.com'
    content: 'Mathworks'
 >> s(2)
 ans = 
       href: 'http://www.google.com'
    content: 'Google'

All these approaches can be fine-tuned/complex-ified for managing a broader set of cases, e.g. when there is a tag in the content of the anchor tag.

EDIT: if you really want to get the position of white spaces, your expression does work but not as you thought. It actually matches

'"abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '

and

'">abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '

which start both with a " followed by non-whitespace characters until after the t of Limit. Once thing that you could do if you wanted to keep the pattern simple, is to get the starting and ending position of the relevant sub-string:

 >> [mStart, mStop] = regexp( html, '(?<=")[^"]+(?=</a>)', 'start', 'end' )
 mStart =
    76
 mStop =
   132

and use them to mask a logical index of position of white spaces:

 >> isSpace = html == ' ' ;
 >> isSpace(1:mStart-1)  = false ;
 >> isSpace(mStop+1:end) = false ;
 >> find( isSpace )
 ans =
   117   121   125

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Regular Expression to detect spaces in a string

2 个评论
显示无隐藏无

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

类别

产品

标签

Community Treasure Hunt

Regular Expression to detect spaces in a string

2 个评论 显示 无 隐藏 无

采纳的回答

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

更多回答（1 个）

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

类别

产品

标签

另请参阅

Community Treasure Hunt

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论