help with regexpi expression match

Question

J M 2017-12-4

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/370829-help-with-regexpi-expression-match

编辑： per isakson 2017-12-4

I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:

d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};

I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.

if I use the following code:

accession6 = regexpi(d2,'(?<=:)\w+','match');

using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.

Any help would be super appreciated.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Stephen23 2017-12-4

编辑：Stephen23 2017-12-4

"Any help would be super appreciated"

You might like to download my FEX submission iregexp, an interactive regular expression tool:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool

It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

per isakson 2017-12-4

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/370829-help-with-regexpi-expression-match#answer_294533

编辑：per isakson 2017-12-4

在 MATLAB Online 中打开

One expression

'chromosome' followed by anything up till ':' and one ':'
capturing group of one or more letter, digit, underscore, and '.' (greedy)
zero or more of anything up till '/' and one '/'
capturing group of one or more letter, digit, underscore, and '.' (greedy)

And repeat until no more matches are found

>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

If d2 contains one string

>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac = 
    {1x2 cell}    {1x2 cell}    {1x2 cell}
>>

6 个评论
显示 4更早的评论隐藏 4更早的评论

per isakson 2017-12-4

编辑：per isakson 2017-12-4

在 MATLAB Online 中打开

And an alternative that uses @JM's approach. In a first step match "name slash name" between

look-behind: (?<=chromosome[^:]+[:])
look-ahead: (?=;|$)

and in a second step split the two names at slash

cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

per isakson 2017-12-4

chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.

请先登录，再进行评论。

help with regexpi expression match

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

6 个评论
显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

Community Treasure Hunt

help with regexpi expression match

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

6 个评论 显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论