How to perform row operations on a large MATRIX or CELL efficiently?

4 次查看(过去 30 天)
I have a large CELL of the order 18000x4 (Or a MATRIX of the similar order) Each CELL element is a 'name' and therefore a string (In case of Matrix it is a 'number') The operation I need to do is explained in the following example (Please find relevant files attached). Example:
CELL INPUT
'1' '3' '4' []
'1' '3' '5' []
'1' '3' '6' []
'1' '4' '4' []
'1' '4' '5' []
'1' '4' '6' []
'2' '3' '4' []
'2' '3' '5' []
'2' '3' '6' []
'2' '4' '4' []
'2' '4' '5' []
'2' '4' '6' []
'5' '4' [] []
'5' '7' [] []
'6' '8' '4' []
'6' '8' '7' []
'9' '10' [] []
CELL OUTPUT
'1' '4' [] []
'1' '3' '5' []
'1' '3' '6' []
'2' '4' [] []
'2' '3' '5' []
'2' '3' '6' []
'4' '5' [] []
'5' '7' [] []
'4' '6' '8' []
'6' '7' '8' []
'10' '9' [] []
There are 2 main operations done to get the output:
1) Delete the duplicates in the rows. (1 4 4) ---> (1 4) and (2 4 4) ---> (2 4) in the above example.
2) After step (1), compare every row and delete the supersets of the row if any. (1 3 4) (1 4 5) (1 4 6) deleted because they are supersets of (1 4). Similarly, (2 3 4) (2 4 5) (2 4 6) are deleted because they are supersets of (2 4)
I have written a MATLAB function called 'minimize' which does this operation. But, when I need to do this for a large CELL of order 8656x4, MATLAB keeps running and there is no solution after 48hours. I need to get this done for even larger orders of the CELL and within a time frame of 1-2 hours.
Could you please provide any method/suggestion to efficiently solve this?
  2 个评论
Carl
Carl 2017-10-10
编辑:Carl 2017-10-10
Hi Ashish. I see from your .mat files that you have analogous matrices of type 'cell' and 'double'. Just to clarify, is it only slow when performing those operations on the 'cell' data? Or do you see the same performance when operating on the 'double' type matrices as well?
Initial suggestions for improving performance would be to take a vectorized, rather than loop-based, approach to copying data. For example, change:
for k = 1:size(temp_name,2)
minimal_row(i,k) = temp_name(k);
end
to
minimal_row(i,:) = temp_name;
Or,
for j = 1:size(number_matrix,2)
if number_matrix(i,j) == 0
break;
else
temp_name(j) = number_matrix(i,j) ;
end
end
to
temp_name = number_matrix(i,:);
temp_name = temp_name(temp_name ~= 0)
This type of syntax can be optimized more efficiently.
Ashish Keshkamat
Ashish Keshkamat 2017-10-11
Hello Carl, It is relatively better when I solve using 'double' type matrices as compared to cells. But,the difference is not appreciable.
Thank you for your feedback. Let me try the vectorised approach.

请先登录,再进行评论。

采纳的回答

Ashish Keshkamat
Ashish Keshkamat 2017-11-14
Hello All,
Thank you for your time and effort to help me solve this problem. I have figured it out to solve the 'minimize' more efficiently. This function is part of a bigger program I am working on, and was the bottle-neck. There is a pattern in which the input to 'minimize' is generated. I track the duplicates in while generating the input and only compare those rows and delete the super-sets among them. This reduces the computations from thousands to hundreds and sometimes even to tens.
Carl, Thanks for the suggestion of vectorization. I used it in many instances.
Sharath, Thank you for pointing out the flaws in my Algorithm. It helped me to think in different ways to reduce the number of computations.
With all the optimization done, I am able to get my output using input of the order 8656x4 in 20 seconds.
Best Regards, Ashish

更多回答(1 个)

Sharath Chandran
Sharath Chandran 2017-10-11
编辑:Sharath Chandran 2017-11-3
Hi Ashish,
I reckon, this has more to do with your Algorithm.
Let us consider a matrix of dimension 'm x n' ('m' rows & 'n' columns). The time complexity of the first operation that you perform is O(m * n). This roughly equals to '34624' number of computations which is reasonable number.
On the other hand the time complexity of the second operation that you are performing is O(m^2 * n), which roughly equals to at least '299705344' number of computations, after plugging in the value 'm=8656' and 'n=4'. This is clearly the bottleneck of your current algorithm and hence it needs to be optimized.
Alternate approach:
1. After performing step 1, store each row in a Hash-Map - O(m) (let us ignore the time-complexity of storing data in Hash-Map for the time being).
2. For every row, generate subsets and check if any subset is present in the Hash-Map. This should not be an overhead as max. no. of subsets for a 4-element row is 2^4-2 = 14. Here we are subtracting '2' as we are not considering the empty set and the set itself - O(2^n).
3. If any subset is already present in the Hash-Map, reject this set (row) since one of its subset already exists.
4. Total time complexity - O(m * 2^n). This roughly equals to at least '138496' number of computations which is a drastic improvement.
I hope this was helpful.
Please comment below in case you have any queries.
_
Regards,
MathWorks Technical Support Team
Note: This is not the exact time complexity analysis as we have not taken into account the complexity of storing data in the Hash-map.
  1 个评论
Ashish Keshkamat
Ashish Keshkamat 2017-11-7
Hello Sharath,
Thank you for your feedback. I understand the proposed algorithm reduces the number of computations drastically. But, I'm not aware of the concepts of using Hash-map (Hash-table).
Let me see what can be done.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Logical 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by