Unique elements in matrix efficiently

16 次查看(过去 30 天)
Given a matrix A, what is an efficient way to obtain a matrix B that consists of only the unique elements in A (where by unique, I mean row-wise, therefore the 'ismember' function is not suitable).
For example:
A = [1 2 3 1; 4 1 6 7; 8 9 8 11]
A =
1 2 3 1
4 1 6 7
8 9 8 11
Since each row contains different number of unique elements than other rows, the matrix of unique elements would have rows each of possibly different dimensions, which is not possible. Therefore, I can fill the repeated/non-unique elements with some dumb filler values (since I am working with only positives I can replace them with a negative number (eg -1), or if it had negative numbers too, they could be replaced by NaN values.
The result therefore can be:
B =
1 2 3 -1
4 1 6 7
8 9 -1 11
(where the negative -1s could be replaced by NaNs alternatively).
Notice that although A(2, 2) has a value (1) that already exists in the previous column (A(1, 1)), it is still unique in its own row, therefore the 'ismember' function cannot be applied.
I have created a solution, but I can imagine there are more elegant and more efficient solutions using vectorization and avoiding for loops, for when matrix A is very large which happens to be the case:
B = A(:, 1);
for i = 2:size(A, 2)
NEWMEMBERS = !sum(bsxfun(@eq, A(:, i), B), 2);
NEWCOL = NEWMEMBERS .* A(:, i);
FILLER = -1 * ~NEWMEMBERS;
NEWCOL = NEWCOL + FILLER;
B = [B NEWCOL];
end
(FILLER can be more generally replaced by a vector of 0 and NaNs instead of 0s and -1s)
  3 个评论
John D'Errico
John D'Errico 2016-2-26
编辑:John D'Errico 2016-2-26
Note that this question does not actually contain validly executable MATLAB code, having constructs like +=, and ! in the code.

请先登录,再进行评论。

回答(3 个)

Titus Edelhofer
Titus Edelhofer 2016-2-26
Hi,
a rather simple version would be this:
B = -ones(size(A));
for row=1:size(A, 1)
val = unique(A(row,:));
[~,idx] = ismember(val, A(row,:));
B(row,idx) = val;
end
I haven't tried though what happens if A is large ...
Titus

Stephen23
Stephen23 2016-2-26
编辑:Stephen23 2016-2-26
Without any loops (and can be easily adapted to use a tolerance):
A = [1 2 3 1; 4 1 6 7; 8 9 8 11]
S = size(A);
[B,C] = sort(A,2);
D = [false(S(1),1),diff(B,1,2)==0];
R = (1:S(1))'*ones(1,S(2));
X = sub2ind(S,R(D),C(D));
A(X) = NaN
displays this in the command window:
A =
1 2 3 NaN
4 1 6 7
8 9 NaN 11
  1 个评论
George Aipal
George Aipal 2016-2-26
Thanks for your great solution! It replies to my statement indeed! I have followed up in a comment below with a question regarding performance, and how to eliminate redundant columns of only NaNs. I will rephrase the problem to make it look more like a common case, so I'll now refer to rows instead of columns, and X instead of A:
You have a large dataset Xdata (NxM), containing many samples of dimension M (M parameters), each sample on each row. N is very large (many samples), and the task is to reduce the memory size by creating an Xreduced that does not contain unnecessary rows. In this particular case, I want to store only unique parameters, so each column of Xreduced should have only unique values, or fillers (NaN or -1), but rows containing only NaN should not appear in Xreduced.
I suppose I can use your solution, sort it, and then use a for loop to find the first column of A (or first) that contains at least 1 element different than NaN, and then copy from there ownards the rest of A or X, to B or Xreduced, but this requires loops, and duplicating the original large size of the dataset A (Xdata). Is there a better way to do this? Or given this specific problem, perhaps my initial solution in my original question is as optimal as it gets?

请先登录,再进行评论。


George Aipal
George Aipal 2016-2-26
Thank you all for your answers. It is correct the comment that A(2, 2) should not be NaN because it is the only 1 in that row. Regarding the solutions from Titus and Stephen, they both work, and Stephen's solution does respond to my question of avoiding for loops as I was using in my own solution, aiming to increase speed. However, interestingly enough, I have just found out that some of these solutions without the for loop are performing slower than with the for loop, or am I doing something wrong? I will investigate about performances and let you know (it seems that if I run solution 1 followed by solution 2 and then 3rd, somehow affects their individual performance than if I invert the order in which I run them, so I should quit the environment and run 1 solution at a time, perhaps with bigger matrices, and I shall report on the findings.
  5 个评论
Titus Edelhofer
Titus Edelhofer 2016-2-26
Hi George,
the filler comes before by setting
B = nan(size(A));
or
B = -ones(size(B));
or whatever.
My advice to customers is usually: write the code in a way that is simple to write and simple to read. When it's progressed and you identify bottlenecks, then start investigating by tic/toc or profiler. Don't get me wrong, a good deal of my work is teaching vectorization (one of my favorite underused functions is bsxfun). But writing unreadable vectorized code without need I try to avoid. And if I do, I add as comment the simpler/loop version so that someone else (or myself) understand what's happening.
Titus
George Aipal
George Aipal 2016-2-26
编辑:George Aipal 2016-2-26
Great, I'll keep your advice and tips in mind. Regarding the pre-filling, that would not eliminate the columns of only NaNs or -1s, so I guess I would then have to use a for loop where I somehow find the first column containing at least one element different than NaN or -1, and then copy the matrix from then onwards, but still, I would initially require to use the very large matrix?
That may be the only way to do it, the problem is that in this case, each column belongs to a sample from a dataset, and I am extracting only the unique values per dimensions in all this large dataset. Depending on the size of the dataset, it could be too big to store an initial B with so many columns at first (plus all the samples already stored in dataset A). Having said that, is my initial solution in my original question the best solution? A vectorized implementation like Stephen's also looks great, but I would again need to sort out this problem I suppose?
A summary of my problem would be, given a large dataset A (M x N), extract (efficiently, in the fastest possible way since this is a large matrix) the unique values on each dimension/parameter, since the repeated values are redundant and of no use, simply occupying much memory. Usually the samples tend to be on each row, so I could simply do X=permute(A, [2 1]) to make it look more like a common case, then produce an Xreduced that should have fewer rows than X, containing only unique values per parameter, and the rest with fillers, but without needing to have (rows in this case of X) of only fillers, as the idea is to reduce the memory used. I hope that makes any sense?

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Creating and Concatenating Matrices 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by