Pre-processing tall array / datastore data

Question

Michael 2019-8-25

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477513-pre-processing-tall-array-datastore-data

评论： Michael 2019-8-27

Please pardon me if this has been asked before.

What is the most efficient way to pre-process a large wide table (about 10,000,000 rows by 500 columns) currently divided in many small tables in separate mat-files. The data may be quite sparse in areas and is mostly numeric with a datetime column, some categorical, and some text fields.

For example, should I:

stack() the small tables tables but leave them in separate files
combine them into a giant wide table
combine them into a stacked very tall table
delete many NaNs significantly reducing the height of the stacked table
use sparse() on a wide table

Those are just some thoughts. Please let me know the best way.

Thank You,

Michael

2 个评论
显示无隐藏无

Guillaume 2019-8-25

编辑：Guillaume 2019-8-25

I'm not really clear on your question. pre-process in order to achieve what?

Note that stack makes a table less wide (less variables) but a lot taller. I'm not sure that's what you mean by stacking. Perhaps you mean vertically concatenate, in which case the datastore takes care of that for you.

Also note, that sparse is not a function (or a concept) that applies to tables.

I would think that if you use a datastore with tall tables, there's nothing to do. Just use the tables as is (as one big tall table backed by the datastore).

Guillaume 2019-8-26

在 MATLAB Online 中打开

Michael's comment mistakenly posted as an answer moved here:

Dear Guillame,

Thank you for responding.

1) I haven't used sparse in a long time and didn't realize sparse didn't apply to tables. I guess that's a very good reason, to use a tall table. I was worried that stacked tables would be slow relative to wide tables but perhaps the sparse nature of the stacked table would offset that. Would you know about the relative speed of a stacked vs wide table for SQL type lookups?

2) Regarding datastore/tall/mapreduce, I am just starting to read about them, and like most Matlab docs, they're a little light on examples. Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?

3) Also, I am unclear on the value of having one datastore table vs. many. I currently have tens of thousands of little mat-files with timetables in them. Is there a benefit to combining them and is there a preferred format that is faster than others, e.g. mat vs. CSV?

Thanks again for your help,

Michael

PS, Yes, what I mean by stacked is from wide or unstacked:

Date   Var1    Var2    Var3
Jan    xxxx    xxxx    xxxx
Feb    xxxx    xxxx    xxxx

vs. stacked:

Date   Field    Value
Jan    Var1     xxxx
Jan    Var2     xxxx
Jan    Var3     xxxx
Feb    Var1     xxxx
Feb    Var2     xxxx
Feb    Var3     xxxx

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Guillaume 2019-8-26

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477513-pre-processing-tall-array-datastore-data#answer_389056

I was worried that stacked tables would be slow relative to wide tables

slow for what type operation. I would think that some things are better suited to wide tables, others to stacked ones. If you are going to be using myfun(mytable.Var1, mytable.Var2) then stacking Var1 and Var2 may not be a good idea. In addition, in the context of tall arrays less rows may be better.

Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?

datastores are only for reading. If you have a tall table, you can write it directly to a single text file with writetable. If you want to split it into several text files, simply write chunks of rows in a loop with writetable.

Also, I am unclear on the value of having one datastore table vs. many

You have just one datastore that is backed by as many files as you want (all files must have the same format and variables of course). The datastore manages accessing the data from the file as required and you access the data using a single tall table (or array). There isn't an option to get several tables out of one datastore.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Michael 2019-8-26

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477513-pre-processing-tall-array-datastore-data#answer_389123

Dear Guillaume,

Certainly wide tables are simpler for my purpose (generating tables for machine learning inputs) but due to the sparsness and size of the data that I'm selecting from, tall arrays may be more efficient. I'm not sure.

Thank you for pointing out that datastore is read-only. It looked that way and I was quite frustrated because I could not confirm it. Likewise, writetable doesn't have an append feature which is disappointing. I've been using writetable to make CSVs and then combining them by piping the output of DOS copy or type commands.

I will perform some experiments and write back regarding:

The relative speed of a large CSV datastore vs. many small ones
Access speed of tall vs. wide storage
The speed of the 32bit KDB+ (Q) solution via the datafeed toolbox.

Thanks,

Michael

2 个评论
显示无隐藏无

Walter Roberson 2019-8-26

dlmwrite() has append mode -- but it is only for numeric values.

Michael 2019-8-27

Dear Mr. Robertson,

Thank you. That may be helpful in the future. In this case, unfortunately, I have mostly text and categorical data.

It would be great if The Mathworks added some basic IO like appending with writetable and writing to a datastore.

It's a bit of a mission to write a whole flexible routine to append a table with many data types using fprintf.

Thanks Again,

Michael

请先登录，再进行评论。

Answer 3

Michael 2019-8-26

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477513-pre-processing-tall-array-datastore-data#answer_389160

在 MATLAB Online 中打开

Hello,

I'm trying to evaluate a datastore of about 20,000 csv files with about 36GB of data that I saved with Matlab using writetable. One column contains datetimes and an example of the files' contents is this:

29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9029
30-Sep-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9106
31-Oct-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9154
30-Nov-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9227
30-Dec-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9311

I tried the following code and received the subsequent error. Can someone enlighten me on how to get this to work?

Thank You,

Michael

PS, Code:

ds = datastore('tall*.csv');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);

PPS, Error:

Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: 0% complete
Evaluation 0% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)
Unable to read the DATETIME data using the locale setting for your system: 'en_US'
If the data contains month or day names in a language foreign to this locale, use the 'DateLocale' parameter to specify the correct locale.
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
            [t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}, readFailureSummary] = iGather(varargin{:});

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 4

Michael 2019-8-27

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477513-pre-processing-tall-array-datastore-data#answer_389178

在 MATLAB Online 中打开

Hello,

I moved the datetime problem to a separate thread.

As for the speed, I ran a little experiment comparing one 36GB file and the same data in 20,000 smaller files.

Here are my unscientific results that show many files are about 20% slower than one big file in this example:

One Huge File

One large CSV
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 10 min 31 sec
Evaluation completed in 10 min 32 sec
dt = 30

20,000 Small FIles

20,000 small CSVs
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 12 min 35 sec
Evaluation completed in 12 min 36 sec
dt = 28.9531

The Code

clear all
fprintf('One large CSV\n')
tcpu(1) = cputime;
ds = datastore('bigtall.csv','DatetimeType','text');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)
clear all
fprintf('\n20,000 small CSVs\n')
tcpu(1) = cputime;
ds = datastore('tall*.csv','DatetimeType','text');
u = unique(tds.FIELD);
U = gather(u);
tds = tall(ds);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)

Thanks,

Michael

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Pre-processing tall array / datastore data

2 个评论
显示无隐藏无

回答（4 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Pre-processing tall array / datastore data

2 个评论 显示 无隐藏 无

回答（4 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

2 个评论 显示 无隐藏 无

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论