How to capture tokens using regular expressions?

Question

Patrick Mboma 2015-9-16

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions

评论： Cedric 2015-9-19

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form

 expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
 out=regexp(expression,pattern,'name')

The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be

main='abcd' and digits='1'.

What I am missing is the right "pattern". Any suggestions?

5 个评论
显示 3更早的评论隐藏 3更早的评论

Cedric 2015-9-17

编辑：Cedric 2015-9-17

在 MATLAB Online 中打开

Dear Patrick,

In summary, for extracting and validating digits and decimal point, I would would write a pattern like

'(.*?)[\(_]([\d\.]*)'

which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:

'(.*?)[\(_]([^\)]*)'

which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.

I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains

 name1_45 
 name2(45)
 name2b_32
 name2c(84)
 ..

then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:

 % - Dummy file content.
 content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
 % - Flag elements to replace.
 doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
 % - Replace with with space.
 content(doReplace) = ' ' ;
 % - Parse.
 parsed = textscan( content, '%s %f' ) ;

(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get

 >> parsed
 parsed = 
    {4x1 cell}    [4x1 double]
 >> parsed{1}
 ans = 
    'name1'
    'name2'
    'name2b'
    'name2c'
 >> parsed{2}
 ans =
    45
    45
    32
    84

Patrick Mboma 2015-9-19

Thanks a lot Cedric!!!

Cedric 2015-9-19

My pleasure!

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Benjamin Kraus 2015-9-16

3
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192653

在 MATLAB Online 中打开

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');

The pattern breaks down like this:

(?<main>[a-zA-Z]+) - A token named "main" with only letters.
(?:[_\(]) - An uncaptured token containing either an underscore or "(".
(?<digits>[0-9]+) - A token named "digits" with only numbers.
)? - An optional ")" character at the end.

The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Patrick Mboma 2015-9-17

在 MATLAB Online 中打开

Dear Benjamin,

Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,

whatever345whatever_100

would also be something I would like to capture. It is the second part that would only include digits.

A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

请先登录，再进行评论。

Answer 2

Kirby Fears 2015-9-16

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192648

在 MATLAB Online 中打开

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.

ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Patrick Mboma 2015-9-17

在 MATLAB Online 中打开

Dear Kirby,

There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.

In my current solution for instance, I first use regular expressions to transform all the inputs into the same format

whatever_45

then I look for the underscore, etc. But this entails several lines of codes.

Thanks for your input!

请先登录，再进行评论。

How to capture tokens using regular expressions?

5 个评论
显示 3更早的评论隐藏 3更早的评论

回答（2 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How to capture tokens using regular expressions?

5 个评论 显示 3更早的评论隐藏 3更早的评论

回答（2 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

5 个评论
显示 3更早的评论隐藏 3更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论