Pre-processing tall array / datastore data
6 次查看(过去 30 天)
显示 更早的评论
Please pardon me if this has been asked before.
What is the most efficient way to pre-process a large wide table (about 10,000,000 rows by 500 columns) currently divided in many small tables in separate mat-files. The data may be quite sparse in areas and is mostly numeric with a datetime column, some categorical, and some text fields.
For example, should I:
- stack() the small tables tables but leave them in separate files
- combine them into a giant wide table
- combine them into a stacked very tall table
- delete many NaNs significantly reducing the height of the stacked table
- use sparse() on a wide table
Those are just some thoughts. Please let me know the best way.
Thank You,
Michael
2 个评论
Guillaume
2019-8-25
编辑:Guillaume
2019-8-25
I'm not really clear on your question. pre-process in order to achieve what?
Note that stack makes a table less wide (less variables) but a lot taller. I'm not sure that's what you mean by stacking. Perhaps you mean vertically concatenate, in which case the datastore takes care of that for you.
Also note, that sparse is not a function (or a concept) that applies to tables.
I would think that if you use a datastore with tall tables, there's nothing to do. Just use the tables as is (as one big tall table backed by the datastore).
Guillaume
2019-8-26
Michael's comment mistakenly posted as an answer moved here:
Dear Guillame,
Thank you for responding.
1) I haven't used sparse in a long time and didn't realize sparse didn't apply to tables. I guess that's a very good reason, to use a tall table. I was worried that stacked tables would be slow relative to wide tables but perhaps the sparse nature of the stacked table would offset that. Would you know about the relative speed of a stacked vs wide table for SQL type lookups?
2) Regarding datastore/tall/mapreduce, I am just starting to read about them, and like most Matlab docs, they're a little light on examples. Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?
3) Also, I am unclear on the value of having one datastore table vs. many. I currently have tens of thousands of little mat-files with timetables in them. Is there a benefit to combining them and is there a preferred format that is faster than others, e.g. mat vs. CSV?
Thanks again for your help,
Michael
PS, Yes, what I mean by stacked is from wide or unstacked:
Date Var1 Var2 Var3
Jan xxxx xxxx xxxx
Feb xxxx xxxx xxxx
vs. stacked:
Date Field Value
Jan Var1 xxxx
Jan Var2 xxxx
Jan Var3 xxxx
Feb Var1 xxxx
Feb Var2 xxxx
Feb Var3 xxxx
回答(4 个)
Guillaume
2019-8-26
I was worried that stacked tables would be slow relative to wide tables
slow for what type operation. I would think that some things are better suited to wide tables, others to stacked ones. If you are going to be using myfun(mytable.Var1, mytable.Var2) then stacking Var1 and Var2 may not be a good idea. In addition, in the context of tall arrays less rows may be better.
Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?
datastores are only for reading. If you have a tall table, you can write it directly to a single text file with writetable. If you want to split it into several text files, simply write chunks of rows in a loop with writetable.
Also, I am unclear on the value of having one datastore table vs. many
You have just one datastore that is backed by as many files as you want (all files must have the same format and variables of course). The datastore manages accessing the data from the file as required and you access the data using a single tall table (or array). There isn't an option to get several tables out of one datastore.
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Large Files and Big Data 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!