help with regexpi expression match

2 次查看(过去 30 天)
J M
J M 2017-12-4
I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:
d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};
I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.
if I use the following code:
accession6 = regexpi(d2,'(?<=:)\w+','match');
using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.
Any help would be super appreciated.
  1 个评论
Stephen23
Stephen23 2017-12-4
编辑:Stephen23 2017-12-4
"Any help would be super appreciated"
You might like to download my FEX submission iregexp, an interactive regular expression tool:
It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.

请先登录,再进行评论。

回答(1 个)

per isakson
per isakson 2017-12-4
编辑:per isakson 2017-12-4
One expression
  • 'chromosome' followed by anything up till ':' and one ':'
  • capturing group of one or more letter, digit, underscore, and '.' (greedy)
  • zero or more of anything up till '/' and one '/'
  • capturing group of one or more letter, digit, underscore, and '.' (greedy)
And repeat until no more matches are found
>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
If d2 contains one string
>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac =
{1x2 cell} {1x2 cell} {1x2 cell}
>>
  6 个评论
per isakson
per isakson 2017-12-4
编辑:per isakson 2017-12-4
And an alternative that uses @JM's approach. In a first step match "name slash name" between
  • look-behind: (?<=chromosome[^:]+[:])
  • look-ahead: (?=;|$)
and in a second step split the two names at slash
cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
per isakson
per isakson 2017-12-4
chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Annotations 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by