randsample issues and generating random numbers from large populations

18 次查看(过去 30 天)
The randsample function is pretty handy -- I supply a population in a vector and a number of samples I'd like drawn uniformly without replacement. Usually my population is a simple, increasing set of numbers, like 1 to 100000.
But I'm working now on a much larger problem where I'd like to generate, say, 500,000 samples from the population 1:2^60, or something huge. The problem is that the first thing randsample does is create a vector x = zeros(1,2^60), which of course breaks immediately as the maximum variable size is violated.
My question is this -- if my needs are simple enough, is it possible to generate the samples without storing the entire population in a vector? Could I generate them one at a time? Or could I somehow get the vector of samples as before but without storing the population?
A cursory idea was to generate smaller random numbers and concatenate them, but then I'm not sure what I'm losing on the theoretical side...
Any advice, statistical or MATLAB would be appreciated.

采纳的回答

Walter Roberson
Walter Roberson 2012-1-29
Warning #1: the below has potentially unlimited running time, with the expected running time increasing exponentially as the sample size approaches the population size.
SS = 500000; %number of samples wanted
Population = 2^60; %see warning #2!!
chunk = ceil(Population * rand(1,SS)); %includes duplicates
[uchunk, ua, ub] = unique(chunk, 'first');
chunk = uchunk(ua); %random order and duplicates after first removed
lc = length(chunk);
while lc < SS
newchunk = ceil(Population * rand(1,SS-lc));
[uchunk, ua, ub] = unique(newchunk, 'first');
newchunk = uchunk(ua);
newchunk( ismember(newchunk, chunk) ) = [];
chunk = [chunk, newchunk];
lc = length(chunk);
end
samples = chunk;
Warning #2: the above code will not work when the population size exceeds 2^53, which is about 10^15. rand() generates from 1 * 2^(-53) to (2^53 - 1) * 2^(-53) -- only (2^53 - 2) different choices.
Generating over a larger population is not just a matter of using a version of rand() with higher resolution: at the upper end of the range, near 1, rand() is already as high a resolution as is supported by double precision floating point numbers. It thus becomes necessary to replace the entire
ceil(Population * rand())
technique. Doing that is practical up to a population size of 2^64 (the range of uint64), which gets you to about a population size of about 1.8E19. If there is not already something in the MATLAB File Exchange to generate random integers, you can add an interface onto MT19937-64.c
If you were to need to go beyond 2^64 for your population size, you would start having to use multiple integers to represent the sample numbers, which would complicate things somewhat.

更多回答(2 个)

Derek O'Connor
Derek O'Connor 2012-2-1
Here are two functions for taking a sample S of size Ns from a large, simply-defined population P of size Np. Both use O(Ns) space.
The first function samples with replacement and is very simple, with time complexity O(Ns).
The second function samples without replacement and this requirement makes things a lot more difficult. The only way I know of doing this is to use the Ulam-von Neumann acceptance-rejection principle: keep generating randomly until you get what you want. This is done by the while-loop that that tests the current random element r from P for membership in S, and rejects if r is in S. This rejection loop means that this method has a random running time, which, in principle, may be infinite.
In the rejection while-loop, we need to test for membership of S, as each random element r of P is generated. This can be done in two obvious ways:
  1. Use a "bit-vector" member of size Np and set "bit" member(r) when r is first chosen. Thus the membership test is O(1), but this solution is ruled out because Np is too big.
  2. Scan S(1:k-1) for r, when r is chosen from P. This uses no extra memorybut requires O(Ns) time for the membership test.
Thus we have the classic trade-off between time and space.
A Bigger Problem
Walter points out that there are (increasingly) problems with random number generators: we have lots of memory which allows us to generate huge samples or permutations (= a sample of size Np, from a population of size Np, without replacement.) However, current (or future?) random number generators cannot possibly access the complete sample space of all permutations of size 10^8, yet we can generate these large permutations in a few seconds or minutes. So, what part of the gigantic (10^8)! permutation space are these generators sampling from?
Related Topics:
  1. http://www.mathworks.co.uk/matlabcentral/answers/27515-random-sample-without-replacement
  2. http://www.mathworks.co.uk/matlabcentral/answers/879-does-matlab-have-a-birthday-problem
  3. http://www.mathworks.co.uk/matlabcentral/answers/12248-what-does-randperm-miss
% ---------------------------------------------------------------
function [S,pdup] = RandSampWR(L,U,Ns);
% ---------------------------------------------------------------
% Generate a random sample S of size Ns from the range of Np
% integers L:U, with replacement. Time and Space Complexity:O(Ns)
% Prob[Duplicates in S] ~ 1 - exp(-Ns^2/(2*Np))
% Derek O'Connor, 12 Mar 2011. derekroconnor@eircom.net
%
% USE: L=-2^50; U=2^50; Ns=2^25; [S,pdup] = RandSampWR(L,U,Ns);
% ---------------------------------------------------------------
S = zeros(1,Ns);
Np = U-L+1;
for k = 1:Ns
S(k) = L + floor(rand*Np);
end
pdup = 1-exp(-Ns^2/(2*Np));
% -----------------------------------------------------------------
function [S,Enw,pdup] = RandSampNR(L,U,Ns);
% -----------------------------------------------------------------
% Generate a random sample of size Ns from the range of Np integers
% L:U, without replacement, using a rejection loop.
% Time Complexity of member is O(Ns) and this is called nw times,
% which is a random variable that ranges from Ns to Inf.
% Thus Time Complexity is at least O(Ns^2) and is suitable for small
% Ns << U-L+1 only. Space Complexity is O(Ns).
% Note that this is a Las Vegas algorithm whose running time is
% random, because of the rejection while-loop.
% Derek O'Connor, 12 Mar 2011. derekroconnor@eircom.net
%
% USE: [S,Enw,pdup] = RandSampNR(-2^29,2^29,2^10)
% -------------------------------------------------------------
S = zeros(1,Ns);
Np = U-L+1;
S(1) = L + floor(rand*Np);
nw = 0;
for k = 2:Ns
r = L + floor(rand*Np);
while member(r,S,k-1)
r = L + floor(rand*Np); nw = nw+1;
end
S(k) = r;
end
Enw = nw/Ns;
pdup = 1-exp(-Ns^2/(2*Np));
% ------------------------------
function answer = member(e,v,k)
% ------------------------------
answer = false;
for i = 1:k
if v(i) == e
answer = true;
return
end
end

Peter Perkins
Peter Perkins 2012-2-1
Michael, if you are using the most recent version of MATLAB, R2011b, try this:
>> tic, randperm(2^53-1,10), toc
ans =
Columns 1 through 9
1.4197e+15 8.7423e+15 8.6214e+15 4.3719e+15 7.2083e+15 1.278e+15 3.7989e+15 8.2482e+15 7.1356e+15
Column 10
8.6423e+15
Elapsed time is 0.000610 seconds.
>> tic, randperm(2^53-1,5e5); toc
Elapsed time is 0.061926 seconds.
  1 个评论
Walter Roberson
Walter Roberson 2012-2-2
That gets to 2^53, but Michael was hoping for 2^60.
Also, one of Derek's Questions a few months ago showed that with populations this large, you would not get a true permutation: you would have too many duplicate values coming out of the internal rand() [which has only 2^53-2 possibilities], and each duplicate value is sorted by order of appearance whereas a permutation would allow for either order.

请先登录,再进行评论。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by