Unique elements in matrix efficiently

Question

0 个投票

Given a matrix A, what is an efficient way to obtain a matrix B that consists of only the unique elements in A (where by unique, I mean row-wise, therefore the 'ismember' function is not suitable).

For example:

A = [1 2 3 1; 4 1 6 7; 8 9 8 11]
A =
      1    2    3    1
      4    1    6    7
      8    9    8   11

Since each row contains different number of unique elements than other rows, the matrix of unique elements would have rows each of possibly different dimensions, which is not possible. Therefore, I can fill the repeated/non-unique elements with some dumb filler values (since I am working with only positives I can replace them with a negative number (eg -1), or if it had negative numbers too, they could be replaced by NaN values.

The result therefore can be:

B =
  2    3   -1
  1    6    7
  9   -1   11

(where the negative -1s could be replaced by NaNs alternatively).

Notice that although A(2, 2) has a value (1) that already exists in the previous column (A(1, 1)), it is still unique in its own row, therefore the 'ismember' function cannot be applied.

I have created a solution, but I can imagine there are more elegant and more efficient solutions using vectorization and avoiding for loops, for when matrix A is very large which happens to be the case:

B = A(:, 1); 
for i = 2:size(A, 2)
  NEWMEMBERS = !sum(bsxfun(@eq, A(:, i), B), 2);
  NEWCOL = NEWMEMBERS .* A(:, i);
  FILLER = -1 * ~NEWMEMBERS;
  NEWCOL = NEWCOL + FILLER;
  B = [B NEWCOL]; 
end

(FILLER can be more generally replaced by a vector of 0 and NaNs instead of 0s and -1s)

3 个评论
显示 1更早的评论隐藏 1更早的评论

John D'Errico 2016-2-26

编辑：John D'Errico 2016-2-26

Note that this question does not actually contain validly executable MATLAB code, having constructs like +=, and ! in the code.

George Aipal 2016-2-26

Edited

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Titus Edelhofer 2016-2-26

在 MATLAB Online 中打开

0 个投票

Hi,

a rather simple version would be this:

B = -ones(size(A));
for row=1:size(A, 1)
  val = unique(A(row,:));
  [~,idx] = ismember(val, A(row,:));
  B(row,idx) = val;
end

I haven't tried though what happens if A is large ...

Titus

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Stephen23 2016-2-26

编辑：Stephen23 2016-2-26

在 MATLAB Online 中打开

0 个投票

Without any loops (and can be easily adapted to use a tolerance):

A = [1 2 3 1; 4 1 6 7; 8 9 8 11]
S = size(A);
[B,C] = sort(A,2);
D = [false(S(1),1),diff(B,1,2)==0];
R = (1:S(1))'*ones(1,S(2));
X = sub2ind(S,R(D),C(D));
A(X) = NaN

displays this in the command window:

A =
   2     3   NaN
   1     6     7
   9   NaN    11

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

George Aipal 2016-2-26

Thanks for your great solution! It replies to my statement indeed! I have followed up in a comment below with a question regarding performance, and how to eliminate redundant columns of only NaNs. I will rephrase the problem to make it look more like a common case, so I'll now refer to rows instead of columns, and X instead of A:

You have a large dataset Xdata (NxM), containing many samples of dimension M (M parameters), each sample on each row. N is very large (many samples), and the task is to reduce the memory size by creating an Xreduced that does not contain unnecessary rows. In this particular case, I want to store only unique parameters, so each column of Xreduced should have only unique values, or fillers (NaN or -1), but rows containing only NaN should not appear in Xreduced.

I suppose I can use your solution, sort it, and then use a for loop to find the first column of A (or first) that contains at least 1 element different than NaN, and then copy from there ownards the rest of A or X, to B or Xreduced, but this requires loops, and duplicating the original large size of the dataset A (Xdata). Is there a better way to do this? Or given this specific problem, perhaps my initial solution in my original question is as optimal as it gets?

请先登录，再进行评论。

Answer 3

George Aipal 2016-2-26

0 个投票

Thank you all for your answers. It is correct the comment that A(2, 2) should not be NaN because it is the only 1 in that row. Regarding the solutions from Titus and Stephen, they both work, and Stephen's solution does respond to my question of avoiding for loops as I was using in my own solution, aiming to increase speed. However, interestingly enough, I have just found out that some of these solutions without the for loop are performing slower than with the for loop, or am I doing something wrong? I will investigate about performances and let you know (it seems that if I run solution 1 followed by solution 2 and then 3rd, somehow affects their individual performance than if I invert the order in which I run them, so I should quit the environment and run 1 solution at a time, perhaps with bigger matrices, and I shall report on the findings.

5 个评论
显示 3更早的评论隐藏 3更早的评论

Titus Edelhofer 2016-2-26

在 MATLAB Online 中打开

Hi George,

the filler comes before by setting

B = nan(size(A));

or

B = -ones(size(B));

or whatever.

My advice to customers is usually: write the code in a way that is simple to write and simple to read. When it's progressed and you identify bottlenecks, then start investigating by tic/toc or profiler. Don't get me wrong, a good deal of my work is teaching vectorization (one of my favorite underused functions is bsxfun). But writing unreadable vectorized code without need I try to avoid. And if I do, I add as comment the simpler/loop version so that someone else (or myself) understand what's happening.

Titus

George Aipal 2016-2-26

编辑：George Aipal 2016-2-26

Great, I'll keep your advice and tips in mind. Regarding the pre-filling, that would not eliminate the columns of only NaNs or -1s, so I guess I would then have to use a for loop where I somehow find the first column containing at least one element different than NaN or -1, and then copy the matrix from then onwards, but still, I would initially require to use the very large matrix?

That may be the only way to do it, the problem is that in this case, each column belongs to a sample from a dataset, and I am extracting only the unique values per dimensions in all this large dataset. Depending on the size of the dataset, it could be too big to store an initial B with so many columns at first (plus all the samples already stored in dataset A). Having said that, is my initial solution in my original question the best solution? A vectorized implementation like Stephen's also looks great, but I would again need to sort out this problem I suppose?

A summary of my problem would be, given a large dataset A (M x N), extract (efficiently, in the fastest possible way since this is a large matrix) the unique values on each dimension/parameter, since the repeated values are redundant and of no use, simply occupying much memory. Usually the samples tend to be on each row, so I could simply do X=permute(A, [2 1]) to make it look more like a common case, then produce an Xreduced that should have fewer rows than X, containing only unique values per parameter, and the rest with fillers, but without needing to have (rows in this case of X) of only fillers, as the idea is to reduce the memory used. I hope that makes any sense?

请先登录，再进行评论。

Unique elements in matrix efficiently

3 个评论
显示 1更早的评论隐藏 1更早的评论

回答（3 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论

类别

标签

Community Treasure Hunt

Unique elements in matrix efficiently

3 个评论 显示 1更早的评论 隐藏 1更早的评论

回答（3 个）

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

5 个评论 显示 3更早的评论 隐藏 3更早的评论

类别

标签

另请参阅

Community Treasure Hunt

3 个评论
显示 1更早的评论隐藏 1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论