How many times does each number appear?
4 次查看(过去 30 天)
显示 更早的评论
Let a
a =
1 4
3 5
4 7
6 8
be integer intervals
[1 2 3 4]
[3 4 5]
[4 5 6 7]
[6 7 8]
I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...
So the output b would list the element and how many times it appears:
b =
1 1
2 1
3 2
4 3
5 2
6 2
7 2
8 1
The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!
2 个评论
Jan
2013-4-2
Some contributors spend a lot of time here to solve the problem. What about voting for this question to get the interst of others?
采纳的回答
Teja Muppirala
2013-4-2
Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.
mA = max(a(:));
b = (1:mA)';
b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ...
accumarray(min(a(:,2)+1,mA),-1,[mA 1]))];
b(end) = b(end) +1;
1 个评论
更多回答(6 个)
Brian B
2013-4-2
编辑:Brian B
2013-4-2
Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.
Here I generate intervals of up to 1000, so the loop runs about 10 times.
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
d = diff(a, 1, 2) + 1 ; % interval length
s = a(:,1); % interval start point
L = ceil(max(d)/2);
b = 0*idx;
tic
while L>0
iL = (d>=L);
u = hist(s(iL),idx);
%b = b + conv(u, ones(1,L),'same');
b = b + filter(ones(1,L),1,u);
s(iL) = s(iL)+L;
d(iL) = d(iL)-L;
L = ceil(max(d)/2);
toc
end
This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....
EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.
Jan
2013-4-2
编辑:Jan
2013-4-2
Are the values in the first and second column of a distinct? If so:
c = zeros(1, max(a(:, 2)) + 1);
c(a(:, 1)) = 1;
a2 = a(:, 2) + 1; % [EDITED: +1 inserted]
c(a2) = c(a2) - 1;
c = cumsum(c);
c(end) = [];
If the values are not unique, instead of 1 and -1 histc determines the values:
q = 1:max(a(:, 2));
c = histc(a(:, 1), q);
c = c - histc(a(:, 2) + 1, q); % [EDITED: +1 inserted]
c = cumsum(c);
How efficient is simple FOR loop?
n = max(a(:, 2));
c = zeros(1, n + 1);
for ia = 1:length(a)
a1 = a(ia, 1);
c(a1) = c(a1) + 1;
a2 = a(ia, 2) + 1; % [EDITED: +1 inserted]
c(a2) = c(a2) - 1;
end
c(end) = [];
c = cumsum(c);
Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.
2 个评论
Jan
2013-4-2
编辑:Jan
2013-4-2
Interesting timings:
N = 1e7; M = 1e6;
a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
FOR-loop: 1.90 seconds
HISTC: 9.64 seconds (see also Brian B's HISTC solution)
CUMSUM: 1.71 seconds (columns of [a] must be unique!)
CONV: 41.00 seconds (Brian B's loop with HIS and CONV)
ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)
The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).
Image Analyst
2013-4-1
编辑:Image Analyst
2013-4-1
Are they known to be integers? I'd use histc(). You could do this in about 4 lines of code -- 1 (or maybe a 3 line for loop) to construct a 1D list of all numbers you gave in the seconds array you showed, 1 to construct the hist bin edges, one to call histc() and one to construct "b" from the result of histc(). Give it a try and come back if you have trouble.
Cedric
2013-4-1
编辑:Cedric
2013-4-1
Hi Hamad,
Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:
>> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;
>> buffer = [buffer{:}].' ;
>> counts = accumarray(buffer, ones(numel(buffer),1))
counts =
1
1
2
3
2
2
2
1
There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.
7 个评论
Brian B
2013-4-2
编辑:Brian B
2013-4-2
Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
b = hist(a(:,1),idx);
b = b - hist(a(:,2)+1,idx);
b = cumsum(b);
This works even if the values are not distinct.
2 个评论
Jan
2013-4-2
I've posted a very similar solution, but without the "+1" in the 2nd histc() call - I've fixed my bug now.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Logical 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!