Performance of table data type
51 次查看(过去 30 天)
显示 更早的评论
Hello!
Is it normal that writing into a table data structure is 1000 times slower than writing into a cell array of the same size? And that reading is 50 times slower?
Try the following code:
%Test:
tic;
A = cell(10000, 50);
'Time for initializing cell array:'
toc
tic;
B = cell2table(A);
'Time for initializing table:'
toc
i = 0; % create variable
tic;
for i = 1 : 2500
A{i, 7} = 'aaa';
end
'Time for writing into cell array:'
toc
tic;
for i = 1 : 2500
B{i, 7} = {'aaa'};
end
'Time for writing into table:'
toc
x = ''; % create variable
tic;
for i = 1 : 2500
x = A{i, 7};
end
'Time for reading from cell array:'
toc
tic;
for i = 1 : 2500
x = B{i, 7};
end
'Time for reading from table:'
toc
2 个评论
Oleg Komarov
2016-11-30
编辑:Oleg Komarov
2016-12-1
While tables do have performance issues, this example is particularly pathological.
The initialization of a table with an array of empty cells is problematic. The following initialization is much faster:
tic;
A = repmat({''},1e4,50);
'Time for initializing cell array:'
toc
Also, named reference is preferred to curly brackets, i.e. B.A7(i) instead of B{i,7}.
回答(6 个)
Peter Perkins
2014-10-30
Michael, table is currently not as fast as datatypes like double and cell when you are reading or writing individual values in a long loop. However, it's often possible to vectorize your code and read or write entire variables, at which point you probably won't notice a speed difference. You may also find that
B.Var7{i} = 'aaa'
is faster than
B{i, 7} = {'aaa'}
Hope this helps.
0 个评论
Michael
2014-10-31
1 个评论
Nigel Dyer
2015-6-28
Agreed. The table type appeared to be a perfect solution for what I needed to do. I found this question, registered my profile and wrote this while waiting for writetable to complete. The previous code using dmlwrite took a couple of seconds.
Oleg Komarov
2016-11-30
编辑:Oleg Komarov
2016-12-2
I have been using table() way before they were introduced into the core package, since de facto they are the ported version of the dataset() class from the Statistics Toolbox. I also noticed long time ago many limitations in terms of performance and functionality, and have logged feature enhancements with TMW.
To address the limitations of the table(), while waiting for the ufficial implementation of my enhancement requests, I created the tableutils(). Among the problems, you would be astonished to know that the disp() of a big table can literally freeze your pc until the next ice age (and I am not talking about the movies...). This is somethig that I fixed with a buffered disp method.
While my tableutils() do not address directly the problems in subsref/subsasgn, anyone is welcome to contribute to this effort to make the table() class better by submitting an issue or a Pull Request on Github.
.
Addressing some points in the question
- It is 50x faster to initialize with {''} rather than with []
N = 500;
A = cell(N);
sprintf('cell2table() on empty cells: %.3fs', timeit(@()cell2table(A)))
A = repmat({''},(N));
sprintf('cell2table() on {''} cells: %.3fs', timeit(@()cell2table(A)))
- It is 5x faster to use dot-indexing, i.e. subsasgDot, than brace-indexing, i.e. subsasgBraces
S = 1000;
[row,col] = ind2sub(N,randsample(N^2,S,false));
% {} assignment
B = cell2table(A);
tic
for ii = 1:S
B{row(ii),col(ii)} = {'aaa'};
end
toc
% . assignment
C = cell2table(A);
vnames = B.Properties.VariableNames;
tic
for ii = 1:S
C.(vnames{col(ii)})(row(ii)) = {'aaa'};
end
toc
0 个评论
LuisCardona
2016-5-5
Tables are the slowest thing I have ever had. I had to rewrite my code to use matrices coding the name of my columns with integers because their poor performance.
Stay away of the tables!
3 个评论
Victor
2017-6-15
I think, the current Table datatype seems to be an attempt to support more sophisticated Excel-like functionality, with optimization trade-off.
The problem is, with matrices you can't always remember column name by index, and searching string for every call to a variable is not a good solution.
I have used two ways to keep variable/column names - structure of vectors of the same length and vector of structures (a.k.a. nonscalar struct array).
Both have drawbacks - you can't get simultaneous simple row-wise and colum-wise access without slow convertion to another data structure.
But I think that there can be some simpler and optimized version of Table data type, if we want just to combine row-number and column-variable indexing with original arrays and cell arrays. And if we have only numbers (with no cell/string/sparce functionality), it can be even more faster.
LuisCardona
2017-6-28
Hoi Wong. I wanted to clarify that I was talking about the tables in MATLAB, not the concept altogether. Thanks for the comment. But, I keep my position that they are terrible slow in MATLAB
jbpritts
2016-11-24
I have Matlab 2016b. I can confirm that tables are terribly slow. Unless you really need it for heterogeneous data, then avoid them in any performance critical code. I will have to rewrite a fairly complicated section of code using legacy data structures. Matlab should address this extreme performance deficiency.
0 个评论
Peter Perkins
2016-12-2
编辑:Peter Perkins
2016-12-2
As posts on this thread have indicated, while tables are often the right data structure for the job, their performance in scalar indexing is not comparable to that of types such as double and struct. While there have been significant performance improvements since the initial release in R2014b (e.g. writetable), and those improvements will continue, tables are best when operations can be vectorized. That's often true even with plain old double matrices. It's also best to pre-allocate a table rather than growing it row by row, and again, that's true even for double matrices.
In situations where code cannot be vectorized, perhaps because the results of one iteration of a loop affect subsequent iterations, it's often possible to encapsulate the body of a loop into a function that you call by passing it a table's variables using dot subscripting, and assign back to a table's variables, rather than completely rewriting code to not use tables. It often looks something like this:
[t.X,t.Y,t.Z] = fun(t.A,T.B,t.C)
where fun is a loop that works on separate arrays. Even when it's not desirable to encapsulate the code in a function body, it's often possible to "hoist" a small number of variables out of a table and into the workspace before a loop, have the loop work on them, and then put them back in the table. In other words, if performance is an issue, consider replacing the bottlenecks with code that uses lower-level data types rather than completely avoiding tables.
2 个评论
Oleg Komarov
2016-12-4
编辑:Oleg Komarov
2016-12-4
Hi Peter, thanks for the suggestion. Is there any particular reason why the table.subsasgnBraces() transforms the RHS into a table?
A lot of overhead is incurred in that operation and subsequent table methods applied to a table-like RHS.
See for e.g. line 121 @tabular\subsasgnBraces.m, and line 191 of @tabular\subsasgnParens.m which calls a matlab coded repmat since the input is the RHS rendered table, instead of the builtin repmat.
Peter Perkins
2016-12-5
Your earlier observation that dot-then-parens indexing is faster than braces, for example, B.A7(i) vs B{i,7}, is true. That's one of the "significant performance improvements" I was referring to. It's an ongoing process. Table brace indexing is something we're planning to work on.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Matrix Indexing 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!