How can I specify position and exclude repeated characters using regexp?
3 次查看(过去 30 天)
显示 更早的评论
In searching gene sequence data, I want to find sequences that have the form NGGGNGGGN
where N = A, C, T, or G, in any order, of length 1-7. However, I do not want to find N with repeated G, for example I don't want N = GG, AGGA, AGGGA. I want to only find N that includes G but does not have consecutive G like GG, and I don't want to find N where G is first or last such that the GGG would be extended by the presence of the G in N.
I want to use something like expr = 'G{3}[ACTG]{1-7}(?!GG)G{3}' but MatLab does not like this. I'm not very good with conditions in regexp, or regexp in general. Any help is appreciated.
0 个评论
回答(1 个)
Nitin Khola
2015-11-5
编辑:Walter Roberson
2015-11-6
Thanks for providing a detailed question.
From what I understand, I think you are just looking for sequences that have only one repeated pattern for G's i.e. "GGG". Anything else besides this pattern is unwanted. So I thought we could just do a "strfind" http://www.mathworks.com/help/matlab/ref/strfind.html to look for a pattern of "GGG". If you go through the documentation link I provided, you will notice how "strfind" will return the values of starting indices for the pattern it is searching for. These indices will be helpful in eliminating the sequences of the form that have "AGGGA", for example, in N. So the idea is simple, first do an "strfind" and locate indices for each string that has the "GGG" pattern. Second, eliminate sequences with indices that are not allowed, for example, only the indices of 7 and 16 correspond to valid indices if the length of N is 6. You can even come up with a formula for the "valid sequence indices". For example, length of N = (total sequence length - 6)/3. Valid indices for "GGG" pattern = (length of N + 1) and (2*length of N + 3 + 1) etc. I apologize in advance, if I have committed any arithmetic errors in providing the above example formula.
Also, you may need to loop through your data for this or if all of your data is stored in a cell array, you can take the shorter route of using "cellfun" http://www.mathworks.com/help/matlab/ref/cellfun.html.
Have fun!
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!