counting pattern frequency in a string
10 次查看(过去 30 天)
显示 更早的评论
Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.
The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.
Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).
So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?
Thank you
2 个评论
采纳的回答
Stephen23
2017-12-22
编辑:Stephen23
2017-12-22
Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:
- The requested substring length num is small and fixed.
- MATLAB'S inbuilt functions are highly optimized and quite efficient.
Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.
% Fake Data:
%str = char(randi([65,90],1,1e7));
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
% Loop over <num> starting indices:
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
vec = k-1:num:tot;
[tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
idh{k} = histc(idt,unique(idt)); % save the count
end
% Count unique rows:
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));
For that small sample string it returns:
>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
1
1
1
1
1
1
1
2
2
which seems to be correct. For the longer string (1e7 elements) it took my laptop
Elapsed time is 9.031 seconds.
to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!