Finding the repeated substrings

I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .

2 个评论

See Andrei Bobrov's answer for an efficient solution.
Thank you Stephen!

请先登录,再进行评论。

 采纳的回答

str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

更多回答(2 个)

A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});

5 个评论

If it's alright, I had a question about the use of unique. Why not use tabulate? Just curious.
Thanks!
Maybe he didn't know about it - I didn't.
outT = tabulate(B)
out =
8×2 table
DNA counts
____ ______
AAGT 2
AGTC 2
GTCA 2
TCAA 2
CAAG 1
CAAT 1
AATC 1
ATCG 1
outT =
8×3 cell array
{'AAGT'} {[2]} {[16.6666666666667]}
{'AGTC'} {[2]} {[16.6666666666667]}
{'GTCA'} {[2]} {[16.6666666666667]}
{'TCAA'} {[2]} {[16.6666666666667]}
{'CAAG'} {[1]} {[8.33333333333333]}
{'CAAT'} {[1]} {[8.33333333333333]}
{'AATC'} {[1]} {[8.33333333333333]}
{'ATCG'} {[1]} {[8.33333333333333]}
yeah that's fair. I was just curious since I was just looking at both and wondering why I may want to use one over the other. Seems mainly like if I want a table or cell.
Thanks!
tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.

请先登录,再进行评论。

For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}

类别

帮助中心File Exchange 中查找有关 Genomics and Next Generation Sequencing 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by