help with regexpi expression match
2 次查看(过去 30 天)
显示 更早的评论
I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:
d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};
I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.
if I use the following code:
accession6 = regexpi(d2,'(?<=:)\w+','match');
using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.
Any help would be super appreciated.
1 个评论
回答(1 个)
per isakson
2017-12-4
编辑:per isakson
2017-12-4
One expression
- 'chromosome' followed by anything up till ':' and one ':'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
- zero or more of anything up till '/' and one '/'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
And repeat until no more matches are found
>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
If d2 contains one string
>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac =
{1x2 cell} {1x2 cell} {1x2 cell}
>>
6 个评论
per isakson
2017-12-4
编辑:per isakson
2017-12-4
And an alternative that uses @JM's approach. In a first step match "name slash name" between
- look-behind: (?<=chromosome[^:]+[:])
- look-ahead: (?=;|$)
and in a second step split the two names at slash
cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
per isakson
2017-12-4
chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Annotations 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!