Vectorization approach to find matching elements in a matrix.
34 次查看(过去 30 天)
显示 更早的评论
Hi experts, I am new to Matlab and I need your help with vectorization approach to improve speed of my code.
I have a big 10,000,000 x 15 matrix A containing integers in the range of 1-255, each row has 15 different integers. The integers in different rows can be the same, but each row is unique. I also have a smaller matrix B of 240,000 x 15 containing integers in the same range (i.e., 1-255), with the same conditions as of matrix A.
I need to find the rows in matrix A that has at least 2/3 of matching elements to the elements in each row of matrix B. If a row in matrix B has rows in matrix with matched conditions, then store those rows as below, or else just skip it.
- C = rows from matrix A
- D = rows from matrix B
Currently I am using for-loop to do this, but this takes a lot of time and could not even finish.
C = [];
D = [];
for i = 1:1:size(matrix_B,1)
C = [C; A(sum(ismember(matrix_A,matrix_B(i,:)),2)>=10,:)];
D = [D; B(i,:)];
end
Is there a vectorization approach that I can do to solve this problem? Or should I re-define my problem and change the entire solution? I also heard about Parallel Computing which would be a big help to this problem, but I do not know how to use it.
Thank you.
0 个评论
采纳的回答
John D'Errico
2023-3-28
编辑:John D'Errico
2023-3-28
The part of your code that is a serious problem is where you build the arrays by growing them at each step. This is a really bad thing to do in MATLAB, and is why your code will be so abysmally slow. Your code forces MATLAB to reallocate the memory for those arrays millions of times, each time for a slightly larger amount of memeory. Then it is forced to copy over EVERYTHING from the old version of the array to the new one. UGH. Don't do things that way!
How would I solve this? I would probably define the distance between two rows as the number of elements in row 1 that are NOT in row 2. You already do something like that, using ismember.
But now you should be able to use rangesearch, to locate rows in one array that are "close" to rows in the other array, and rangesearch is the perfect tool to use a distance like this on a user provided function.
3 个评论
John D'Errico
2023-3-28
编辑:John D'Errico
2023-3-28
(Note, at first, I was thinking that knnsearch would be the right tool, but of course, it is rangesearch that you need.)
First, you need a function that will compute the "distance" between rows. The distance is just the number of elements in vector 1, that are NOT in vector 2. So with 15 elements, a distance of zero means a perfect match. You used ismember, which should work.
Then use the rangesearch function.
help rangesearch
So the most difficult problem is writing the distance function. I wrote one below, it is vectorized, in the sense that if you have multiple rows for the Y array, it still works. And that is what rangesearch needs. As a test, I'll try it on a small set of rows. I'll sort them so it is easy to see tit works.
X = sort(randi(30,1,10),2)
Y = sort(randi(30,3,10),2)
setInclusionDistance(X,Y)
As you can see, it does work.
Now just use rangesearch.
X = randi(45,[1000,15]);
Y = randi(45,[100,15]);
[IDX,D] = rangesearch(X,Y,5,'Distance',@setInclusionDistance);
So the test I ran here was relatively small, but it returned results. A small distance means there were many hits between rows. A distance of 5 means there were only 5 elements in two rows that were not the same. That will be pretty rare. In order to get a fw hits on such a small sample, I restricted the rows to lie between 1 and 45. I deliberately chose a small set like this so you can see that some rows of X had no hits at all.
IDX{1:5}
D{1:5}
We can see that row 1 of X was "close" to many rows of Y, but there were not hits at all for rows 2 and 3.
function D = setInclusionDistance(X,Y)
% computes the number of elements in the vector X that are not in Y
D = numel(X) - sum(ismember(Y,X),2);
end
In a larger test on my computer of the size you mentioned, it took a few minutes to finish, but this is a large problem.
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!