Find the K most orthogonal vectors in a set of vectors

Question

Peter Cook 2016-5-19

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/285024-find-the-k-most-orthogonal-vectors-in-a-set-of-vectors

评论： Peter Cook 2016-6-8

Hello All,

The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.

I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).

Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.

So far I've tried 2 approaches: Approach #1 - Compute everything then search

% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
  dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
  [~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
  centroidIdx(k) = im; %reassign
  dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end

Approach #2 - Simultaneous computation and search

    %try a cheaper one?
    centroidIdx = randi(length(Y)); %initialize on a random row of Y
    for k = 1:K-1
        dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product 
        dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
        [~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
        centroidIdx(k+1) = im; %reassign
    end

Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Matt J 2016-5-19

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/285024-find-the-k-most-orthogonal-vectors-in-a-set-of-vectors#answer_222786

编辑：Matt J 2016-5-19

在 MATLAB Online 中打开

Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is

dotProductY=abs(Y*Y.');

I expect that would have been the main bottleneck.

2 个评论
显示无隐藏无

Matt J 2016-5-19

编辑：Matt J 2016-5-22

在 MATLAB Online 中打开

This part

for k = 2:K
    [~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
    centroidIdx(k) = im; %reassign
    dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
  end

also looks like it could be incrementalized as follows

im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
   for k = 2:K
      [~,im] = min(temp); %find next best centroid
      centroidIdx(k) = im; %reassign
      temp=temp+dotProductY(:,im); %update temp
        temp(im) = inf; %dont pick the same row twice
    end

Peter Cook 2016-6-8

Thanks for the help, I can't believe I had that boneheaded dotProductY computation in there. The algorithm runtime is still quite slow, but that, I am accepting, is to be expected for most clustering algorithms with this amount of data.

请先登录，再进行评论。

Find the K most orthogonal vectors in a set of vectors

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Find the K most orthogonal vectors in a set of vectors

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无