Finding partial overlaps between sets

9 次查看(过去 30 天)
Let us say I have some equal-sized sets of indices. I want to determine how much these sets overlap in the sense that I want to know how many of the sets contain a given index. I don't know what this is called, but I think of it as a kind of "soft intersection". I imagine doing it in one of the following ways, but I would like suggestions as to whether this is reasonable and possible better ways.
  1. The indices in the sets can be in (1:N). I have K sets. I could create K length-N indicator arrays and add these K arrays together. The resulting length-N array would contain occurrence counts of the indices of their positions in the array.
  2. Alternatively, I could create the union of the K sets. Then I could use find() to find occurrences of each of the members of the union set in a concatenation of the K index sets. The sizes of the results of find() would be occurrence counts of the indices in the sets.
  1 个评论
Thomas Arildsen
Thomas Arildsen 2013-1-10
I should add that the indices in each of the K sets are unique in that set, so the sets do not contain any duplicates.

请先登录,再进行评论。

采纳的回答

Thomas Arildsen
Thomas Arildsen 2013-1-10
OK, that didn't take very long to try... I implemented no. 1 as follows:
function [ idxUnion, idxCounts ] = overlap1( idxSets )
%OVERLAP1 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
indicator = zeros(K,N);
for k = 1:K
indicator(k,idxSets(:,k)) = 1;
end
counts = sum(indicator,1);
idxUnion = find(counts ~= 0);
idxCounts = counts(idxUnion);
end
I implemented no. 2 as follows:
function [ idxUnion, idxCounts ] = overlap2( idxSets )
%OVERLAP2 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
idxUnion = unique(idxSets)';
idxCounts = zeros(size(idxUnion));
for ii = 1:length(idxCounts)
idxCounts(ii) = length(find(idxUnion(ii) == idxSets));
end
end
I test them as follows:
function [ result ] = overlap_test()
%OVERLAY_TEST Tests the overlay1 and overlay2 functions for errors.
% This function runs overlay1 and overlay2 with a random test input and
% performs a couple of sanity checks on the output to validate these
% functions.
K = 100;
M = 100;
N = 1e4;
testSets = zeros(M,K);
for k = 1:K
testSets(:,k) = randperm(N,M)';
end
% Test overlap1 union
tic
[idxUnion1,idxCnts1] = overlap1(testSets);
t1 = toc;
disp(sprintf('overlap1 took %f s\n',t1))
result(1,1) = isempty(setdiff(idxUnion1,unique(testSets)));
% Test overlap2 union
tic
[idxUnion2,idxCnts2] = overlap2(testSets);
t2 = toc;
disp(sprintf('overlap2 took %f s\n',t2))
result(2,1) = isempty(setdiff(idxUnion2,unique(testSets)));
% Test agreement on counts
try
result(3,1) = all(idxCnts1 == idxCnts2);
catch anError
warning(anError.identifier,anError.message);
result(3,1) = 0;
end
end
No. 1 seems clearly faster.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Creating and Concatenating Matrices 的更多信息

标签

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by