Fast method to find average pairwise distance of a very large matrix?

8 次查看(过去 30 天)
So I have a matrix that is 330,000 observations = rows x 160 variables = columns. I'd like to compute the average distance between each observation in my matrix, but pdist() fails here, because apparently it would take 405.6 GB to store the distance matrix...
Is there a reasonable vectorized (or fast) way to do something like this?
Any thoughts appreciated.
Thanks.

采纳的回答

John D'Errico
John D'Errico 2020-1-10
You cannot compute the 100 billion pairwise distances, because the matrix won't fit in memeory. (Ok, symmetry reduces it to be only 50 billion or so distances.)
But do you seriously, really, desperately need to compute ALL such pairwise distances? You want to know the mean distance. Do you desperately need to know the EXACT value of that mean distance?
The point is, just use a Monte Carlo sampling to get a very good estimate of the mean pairwise distance.
For example, I'll do this using a smaller matrix, just so we can get some result out in a reasonable time.
X = randn(10000,160);
mean(pdist(X))
ans =
17.868
Now, compute a simple random sample. Then compute just a very small subset of the total distances you would need for an exact solution.
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.864
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.868
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.856
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.872
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.874
So 5 simple random permutations gave 5 different estimates. All were correct to within about 0.1% error. For a matrix with 330000 rows, the error would be even smaller. And if even that much error bothers you, then just take the average or median of several such samples.
I would point out that even for an array of size 330000x160, the computation I describe takes about a half second on my computer, almost as much time as it took to generate the array in the first place.
X = randn(330000,160);
tic,mean(vecnorm(X - X(randperm(size(X,1)),:),2,2));toc
Elapsed time is 0.585973 seconds.

更多回答(1 个)

Matt J
Matt J 2020-1-10
编辑:Matt J 2020-1-10
I don't know about the mean distance, but the mean squared or root mean squared distance is pretty easy. Compare:
A=rand(330000/25,160); %The data
%% Without pdist2
tic;
D=mean(vecnorm(A,2,2).^2);
M=norm(mean(A,1))^2;
avgDist=2*(D-M);
toc %Elapsed time is 0.017454 seconds.
%% With pdist2
tic;
P=pdist2(A,A).^2;
avgPdist2=mean(P(:));
toc %Elapsed time is 1.716697 seconds.
Check:
>> avgDist,
avgDist =
26.6471
>> avgPdist2,
avgPdist2 =
26.6471

类别

Help CenterFile Exchange 中查找有关 Array and Matrix Mathematics 的更多信息

产品


版本

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by