Find the K most orthogonal vectors in a set of vectors
1 次查看(过去 30 天)
显示 更早的评论
Hello All,
The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.
I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).
Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.
So far I've tried 2 approaches: Approach #1 - Compute everything then search
% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
Approach #2 - Simultaneous computation and search
%try a cheaper one?
centroidIdx = randi(length(Y)); %initialize on a random row of Y
for k = 1:K-1
dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
[~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
centroidIdx(k+1) = im; %reassign
end
Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?
0 个评论
采纳的回答
Matt J
2016-5-19
编辑:Matt J
2016-5-19
Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is
dotProductY=abs(Y*Y.');
I expect that would have been the main bottleneck.
2 个评论
Matt J
2016-5-19
编辑:Matt J
2016-5-22
This part
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
also looks like it could be incrementalized as follows
im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
for k = 2:K
[~,im] = min(temp); %find next best centroid
centroidIdx(k) = im; %reassign
temp=temp+dotProductY(:,im); %update temp
temp(im) = inf; %dont pick the same row twice
end
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Statistics and Machine Learning Toolbox 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!