How many times does each number appear?

Question

Hamad 2013-4-1

5
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear

Let a

be integer intervals

I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...

So the output b would list the element and how many times it appears:

The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!

2 个评论
显示无隐藏无

Jan 2013-4-2

Some contributors spend a lot of time here to solve the problem. What about voting for this question to get the interst of others?

Hamad 2013-4-2

编辑：Hamad 2013-4-2

First thing to say is that I'm a little overwhelmed by the breadth of effort I have received with this coding challenge. For that reason, I wish to thank each contributor individually.

Firstly, Image Analyst, thank you for a very speedy initial response. And referring to "why do you need this?"... there are ~1e8 milliseconds in a 24-hour period. Highly-traded assets such as index futures contracts yield price updates every few milliseconds during the day and every few minutes at night, averaging around 10,000 observations per day. The product of observations for a pair of these is therefore also ~1e8.

Brian B, you deserve thanks for offering creative contributions based on filters, and for following up with others' contributions.

Jan Simon, you deserve thanks for offering a wide range of efficient solutions along with performance metrics.

Last but by no means least, Cedric Wannaz deserves thanks for offering very detailed intuitive explanations to his contributions. You make impenetrable code seem easy.

THE RESULTS

I have tested all solutions to the same standard, measured by efficiency (minor coding modifications excepted). The most efficient algorithm which solves the problem on my system setup and to my particular needs is provided by Teja Muppirala, and this is followed closely by the FOR-loop provided by Jan Simon. Thank you!!

For this reason, I gratefully accept Teja Muppirala's contribution. At the same time, I offer +1 votes to every contribution here. You have helped me greatly. My sincerest thanks to you all.

Best wishes,

Hamad

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Teja Muppirala 2013-4-2

5
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80692

在 MATLAB Online 中打开

Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.

 mA = max(a(:));
 b = (1:mA)';
 b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ...
               accumarray(min(a(:,2)+1,mA),-1,[mA 1]))];
 b(end) = b(end) +1;

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Cedric 2013-4-2

编辑：Cedric 2013-4-3

+10! Just needs a small correction on the last element.

请先登录，再进行评论。

Answer 2

Brian B 2013-4-2

3
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80698

编辑：Brian B 2013-4-2

在 MATLAB Online 中打开

Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.

Here I generate intervals of up to 1000, so the loop runs about 10 times.

N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
d = diff(a, 1, 2) + 1 ; % interval length
s = a(:,1); % interval start point
L = ceil(max(d)/2);
b = 0*idx;
tic
while L>0
    iL = (d>=L);
    u = hist(s(iL),idx);
    %b = b + conv(u, ones(1,L),'same');
    b = b + filter(ones(1,L),1,u);
    s(iL) = s(iL)+L;
    d(iL) = d(iL)-L;    
    L = ceil(max(d)/2);
    toc
end

This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....

EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Cedric 2013-4-2

+1 !

请先登录，再进行评论。

Answer 3

Jan 2013-4-2

3
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80753

编辑：Jan 2013-4-2

在 MATLAB Online 中打开

Are the values in the first and second column of a distinct? If so:

c = zeros(1, max(a(:, 2)) + 1);
c(a(:, 1)) = 1;
a2     = a(:, 2) + 1;  % [EDITED: +1 inserted]
c(a2)  = c(a2) - 1; 
c      = cumsum(c);
c(end) = [];

If the values are not unique, instead of 1 and -1 histc determines the values:

q = 1:max(a(:, 2));
c = histc(a(:, 1), q);
c = c - histc(a(:, 2) + 1, q);  % [EDITED: +1 inserted]
c = cumsum(c);

How efficient is simple FOR loop?

n = max(a(:, 2));
c = zeros(1, n + 1);
for ia = 1:length(a)
  a1    = a(ia, 1);
  c(a1) = c(a1) + 1;
  a2    = a(ia, 2) + 1;  % [EDITED: +1 inserted]
  c(a2) = c(a2) - 1;
end
c(end) = [];
c      = cumsum(c);

Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.

2 个评论
显示无隐藏无

Jan 2013-4-2

编辑：Jan 2013-4-2

在 MATLAB Online 中打开

Interesting timings:

N = 1e7; M = 1e6;
a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
FOR-loop:   1.90 seconds
HISTC:      9.64 seconds (see also Brian B's HISTC solution)
CUMSUM:     1.71 seconds (columns of [a] must be unique!)
CONV:      41.00 seconds (Brian B's loop with HIS and CONV)
ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)

The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).

Hamad 2013-4-2

My rankings concur in general with yours. Teja's method is more efficient, but a for-loop is more flexible. This was a very tough decision. Please see my final comment to the question above. Apologies for the delay in getting back to you and everyone here. My sincerest thanks for your help. Best wishes, Hamad.

请先登录，再进行评论。

Answer 4

Image Analyst 2013-4-1

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80651

编辑：Image Analyst 2013-4-1

Are they known to be integers? I'd use histc(). You could do this in about 4 lines of code -- 1 (or maybe a 3 line for loop) to construct a 1D list of all numbers you gave in the seconds array you showed, 1 to construct the hist bin edges, one to call histc() and one to construct "b" from the result of histc(). Give it a try and come back if you have trouble.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Hamad 2013-4-2

编辑：Hamad 2013-4-2

Image Analyst, many thanks for your answer. Please see my comment to Cedric Wannaz's answer below, which also pertains to your answer. Many thanks! Hamad.

请先登录，再进行评论。

Answer 5

Cedric 2013-4-1

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80660

编辑：Cedric 2013-4-1

在 MATLAB Online 中打开

Hi Hamad,

Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:

 >> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;
 >> buffer = [buffer{:}].' ;
 >> counts = accumarray(buffer, ones(numel(buffer),1))
 counts =
     1
     1
     2
     3
     2
     2
     2
     1

There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.

7 个评论
显示 5更早的评论隐藏 5更早的评论

Hamad 2013-4-2

在 MATLAB Online 中打开

Both great and concise answers, many thanks for these. However, both solutions rely on row-wise expansion of a.

Image Analyst, thank you for answering helpfully and promptly.

Cedric, as usual thank you for teaching me a couple of new MATLAB functions. To answer your question, "extremely large" is a 1-2 GB array a of intervals of integers as above, but with max(max(a)) ~ 1e8 and size(a,1) ~ 1e8. The first line of your code, modified with parfor as you suggest takes ~3 mins on my 6-core, and yields an elegant cell array of expanded intervals. Unfortunately, the result occupies 26 GB of memory, making it sluggish and impractical to manipulate.

My alternative is to use a for-loop as follows:

 a = [1 4;3 5;4 7;6 8];
 b = zeros(max(max(a)),1);
 for i = 1 : length(a),
      b( a(i,1) : a(i,2) ) = b( a(i,1) : a(i,2) ) + 1;
 end
 b =
     1
     1
     2
     3
     2
     2
     2
     1

This solution does not occupy vast chunks of memory. But there are two problems with it: 1) It uses a for-loop so it almost surely is not the most efficient way of doing it. 2) it is sequential so I cannot see an obvious way of making it more efficient. Can you guys??

Cedric 2013-4-2

编辑：Cedric 2013-4-3

在 MATLAB Online 中打开

Well, I had 5 minutes and I just tested this, which is not too troublesome with ~3e7 ranges on my poor laptop..

 d = diff(a, 1, 2) + 1 ;
 [d_sort, id_sort] = sort(d) ;
 d_range = min(d) : max(d) ;
 id_buffer = cell(numel(d_range), 1) ;
 for k = 1 : numel(d_range)
    d_select     = d_range(k) ;
    n_rows       = sum(d_sort == d_select) ;
    buffer       = ones(n_rows, d_select) ;
    buffer(:,1)  = a(id_sort(d_sort == d_select)) ;
    buffer       = cumsum(buffer, 2) ;
    id_buffer{k} = reshape(buffer, 1, []) ;
 end
 buffer = [id_buffer{:}].' ;
 counts = accumarray(buffer, ones(numel(buffer),1)) ;

I still used a cell array so you can easily PARFOR it, but I changed the approach thinking that we could operate by blocks of ranges with same size. It allows building blocks of ranges in one shot (with e.g. CUMSUM here), and it should work well if you have significantly fewer range sizes than ranges..

No time to explain further as I really had 10 minutes and I must go now, but let me know if you test it and have questions.

Cheers,

Cedric

Cedric 2013-4-3

编辑：Cedric 2013-4-3

Well, not sure about that, but you're welcome! The thread started with a homework tag and ended up being one of the most animated that I've seen a while .. I'd bet that your next question won't be mistaken for a HW ;-)

Cheers,

Cedric

Hamad 2013-4-3

With careful descriptions including words like {let, suppose, given, solution}... I bet it will be mistaken again! Look out for my next soon! Take care.

请先登录，再进行评论。

Answer 6

Brian B 2013-4-2

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80757

编辑：Brian B 2013-4-2

在 MATLAB Online 中打开

Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:

N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
b = hist(a(:,1),idx);
b = b - hist(a(:,2)+1,idx);
b = cumsum(b);

This works even if the values are not distinct.

2 个评论
显示无隐藏无

Jan 2013-4-2

I've posted a very similar solution, but without the "+1" in the 2nd histc() call - I've fixed my bug now.

Brian B 2013-4-2

编辑：Brian B 2013-4-2

Yes, I posted this before you edited yours to handle the non-distinct case! I gave you a vote, since you obviously had the key idea.

请先登录，再进行评论。

Answer 7

Brian B 2013-4-2

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/69413-how-many-times-does-each-number-appear#answer_80687

编辑：Brian B 2013-4-2

在 MATLAB Online 中打开

You can avoid saving all of the intermediate intervals (e.g., [1 2 3 4]) if you just want the count:

 a = [1 4;3 5;4 7;6 8];
 b = zeros(max(max(a)),1);
 parfor k=1:length(b)
    b(k) = sum( a(:,1)<=k & k<=a(:,2) );
 end

3 个评论
显示 1更早的评论隐藏 1更早的评论

Hamad 2013-4-2

Elegant solution but the data overhead costs more than the parallelism saves, due to array of logicals of size a being calculated over each k. Thanks very much for your answer!

Hamad 2013-4-2

编辑：Hamad 2013-4-2

lol, why does every one of my posts get tagged as homework??

Many thanks for your help, Brian. Best wishes, Hamad.

请先登录，再进行评论。

How many times does each number appear?

2 个评论
显示无隐藏无

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（6 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

7 个评论
显示 5更早的评论隐藏 5更早的评论

2 个评论
显示无隐藏无

3 个评论
显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How many times does each number appear?

2 个评论 显示 无隐藏 无

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（6 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

2 个评论 显示 无隐藏 无

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

7 个评论 显示 5更早的评论隐藏 5更早的评论

2 个评论 显示 无隐藏 无

3 个评论 显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

7 个评论
显示 5更早的评论隐藏 5更早的评论

2 个评论
显示无隐藏无

3 个评论
显示 1更早的评论隐藏 1更早的评论