Faster than pdist for cityblock on integers?

2 次查看(过去 30 天)
Hi folks, I have very large matrices in the order of 100k+ rows and even more columns containing only 3 possible integer values 0, 1, 2, most frequent of which is 0. I have to calculate pairwise distance matrix using cityblock metric. Is there any faster, more efficient or less memory occupant way of calculating the pairwise distance matrix for such case that perhaps exploits the fact there are only 3 integer values involved and cityblock distance, than simply running d=pdist(a,'cityblock'); ?
The following can be used as benchmark:
a=randi(3,5e3,1e4)-1;
tic;d=pdist(a,'cityblock');toc
Which takes 30s on my average home PC
I am aware of the GPU speedup (15x in this case) by:
tic;d=pdist(gpuArray(a),'cityblock');toc
but this is only available for relatively small matrices I can completely fit in GPU.

采纳的回答

John D'Errico
John D'Errico 2023-1-17
This issue is surely not a problem with the distance computation as anything complicated, as much as it would be generating and allocating the huge arrays necessary, and then doing a whole heck of a lot of computations. I'll make a very small example.
a=randi(3,5,10)-1
a = 5×10
2 1 2 0 2 1 1 2 2 1 2 0 2 1 2 2 2 2 1 2 2 1 1 2 0 1 2 0 1 2 2 2 1 0 2 2 2 2 0 0 1 0 1 1 2 1 2 2 0 2
The citybloc distance here requires me to subtract every row from every other row. Then compute the absolute value of the differences, and then sum them. In this case, the result would be of size 5x5. But we can compute the distance matrix in only one line.
sum(abs(reshape(a,[5 1 10]) - reshape(a,[1 5 10])),3)
ans = 5×5
0 6 10 7 8 6 0 8 7 4 10 8 0 11 8 7 7 11 0 7 8 4 8 7 0
As you can see, the result is the same as what pdist produces, at least when converted to a viewable square array..
squareform(pdist(a,'cityblock'))
ans = 5×5
0 6 10 7 8 6 0 8 7 4 10 8 0 11 8 7 7 11 0 7 8 4 8 7 0
The problem is, for a matrix a of size nxm. the line of code I wrote will generate an intermediate (internal) array of size nxnxm.
There are a LOT of computations to be done. That your elements in this array are only small integers is irrelevant. So for an array of size 5000x10000, that means on the order of
5000^2*10000
ans = 2.5000e+11
subtractions, and then as many absolute values, and as many adds. No matter how you do it, this will take some time.
a=randi(3,1e3,1e3)-1;
tic,d = pdist(a,'cityblock');toc
Elapsed time is 0.290712 seconds.
citydist = @(a) sum(abs(reshape(a,[size(a,1) 1 size(a,2)]) - reshape(a,[1 size(a,1) size(a,2)])),3);
tic,d = citydist(a);toc
Elapsed time is 2.292339 seconds.
Anyway, pdist is very much faster than what I could write, even for a moderately small array.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Matrices and Arrays 的更多信息

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by