mode with categorical variables and parfor is slow

Question

Andrea 2023-4-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1950333-mode-with-categorical-variables-and-parfor-is-slow

回答： Raghav 2023-5-5

在 MATLAB Online 中打开

Hello everybody,

I don't understand why the below (sketched) code is slow.

Consider the following vector, with element potentially repeated and the vector of unique values associated to it:

potential_rep_idx = categorical(randi(N,1));
unique_idx = unique(potential_rep_idx);

The purpose of the code is to take a table called "table_of_stuff" made of a table "table_other stuff", made of several columns of various types (double, datetime, cells, strings) and the above vector as follows:

table_of_stuff = [array2table(potential_rep_idx), table_other_stuff]

and identify, for each element of unique_idx, all lines of table_of_stuff in which the element appears. Then, from all these lines, make one single line in which each element corresponds to the mode of the values for that column.

In other words:

table_of_stuff = a long table with columns of various type (double, datetime, cells, strings)
table_of_stuff = categorical(table_of_stuff);
parfor i=1:N 
    find_idx = find( potential_rep_idx == unique_idx(i) ) ;
    mode_table(i,:)  =  array2table(mode((table_of_stuff{find_idx, : }),1)); %
end  

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Raghav 2023-5-5

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1950333-mode-with-categorical-variables-and-parfor-is-slow#answer_1229694

Hi,

Based on the question, it can be understand that parfor is working slow for your code.

There are a few reasons why the code you provided may be slow:

Using find and indexing with logical operations: In the line find_idx = find(potential_rep_idx == unique_idx(i)), you are using the find function with a logical operation to index into the potential_rep_idx vector. This creates a temporary logical vector, which can be memory-intensive and slow for large arrays.
Using mode function inside a loop: The mode function is being used inside a loop, which can be inefficient for large datasets. It is generally better to use vectorized operations instead of loops whenever possible.
Creating a new table in each iteration of the loop: Inside the loop, a new table is being created in each iteration using array2table. This can be memory-intensive and slow for large datasets.

To improve the performance of the code, you can consider the following:

Avoid using find and logical indexing: Instead of using find and logical indexing, you can use the ismember function to directly find the indices of the unique values in the potential_rep_idx vector.
Use vectorized operations instead of loops: You can use the splitapply function to split the table into groups based on the values in the potential_rep_idx vector, apply the mode function to each group, and then combine the results into a single table. This can be much more efficient than using a loop.
Avoid creating a new table in each iteration of the loop: Instead of creating a new table in each iteration of the loop, you can preallocate a matrix or cell array to store the results and then convert it to a table after the loop is finished.

Hope it helps,

Raghav Bansal

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

mode with categorical variables and parfor is slow

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

mode with categorical variables and parfor is slow

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论