What is the best way to work with large "Table" data type variables?

121 次查看(过去 30 天)
Matt
Matt2016-7-29
评论: Matt ,2016-9-15
I find the "table" data type very useful and I would like to take advantage of its features when working with very large tables. Essentially, what I want to do is take a number of delimited text files, all with the same number of columns/variables, import each one as a table, and vertically concatenate them into one MATLAB table. The size of the tables presents a problem, as the files get very big, very fast. My initial thought was to use the "matfile" function, but it is not compatible with the table data type; you have to load the entire table variable to add rows to it, which defeats the purpose. As an example, if I have a .mat file called "test.mat" that contains a table variable called "table1," I cannot access it with "matfile."
m = matfile(test.mat,'Writable',true);
m.table1(1,1);
The second line produces an error:
The variable 'table1' is of class 'table'. To use 'table1', load the entire variable.
I used that example for simplicity, but the same error is generated if I attempt to add rows from a table in the workspace to table1.
Is there a way to do what I want that does not require the entire table to be loaded? If the entire table has to be loaded, I imagine I'll run out of memory very quickly. I would also like to minimize the time required to process the data, so loading the entire table does not lend itself to that goal. It may well be that tables are not an option, but I wanted to ask to see if anyone else had any ideas. If I have to move away from tables, what would be the most efficient alternative?
Thanks, Matt

回答(4 个)

Heather Gorr
Heather Gorr 2016-8-1
Hi Matt, Depending on the type of data processing you are doing, you may be able to read data from each of your files into a table and process incrementally using the datastore function. This will also allow you to specify the appropriate data type (categorical for example) before the data is brought into memory, which could potentially save overhead.
For example, if you want to only retain valid information from a select number of columns, you could do something like this: (adapted from datastore doc, see doc examples for more!)
ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
preview(ds)
% Choose data of interest and data types
ds.SelectedVariableNames = {'Year','UniqueCarrier','ArrDelay','DepDelay'};
ds.SelectedFormats{2} = '%C';
% Set the read size.
ds.ReadSize = 5000;
%%Read the first 5000 rows
data = read(ds);
%%Read the rest and only keep valid data
while hasdata(ds)
t = read(ds);
idx = ~any(ismissing(t),2);
t = t(idx,:);
data = [data;t];
end
This shows working with one file, but is the same for a directory of files with the same overall structure. Here is a bit more info on datastore: http://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html
  1 个评论
Matt
Matt 2016-8-5
Thanks Heather. I found this video helpful to gain a high level understanding of some of the methods for working with "big data," such as datastores. I'm not sure what the answer will ultimately be for me, but I definitely have some options to look into.

请先登录,再进行评论。


Edric Ellis
Edric Ellis 2016-9-15
编辑:Edric Ellis 2016-9-15
New in R2016b is the ability to create a "tall" table which lets you perform operations on the table as if it were an in-memory table, but the data is only read in on demand from a datastore (which can reference multiple delimited text files). See more in the doc here: http://www.mathworks.com/help/matlab/tall-arrays.html.
  1 个评论
Matt
Matt 2016-9-15
Wow, very cool. Now I just need to get IT to hurry up and get the newest version so I can use that.

请先登录,再进行评论。


Star Strider
Star Strider 2016-7-29
One possibility is to generate the table in your workspace from a cell array (or a double array) each time using cell2table or array2table, and store it in your .mat file as a cell array (or double array), using table2cell, table2array and their inverses for the conversion each time you want to work on the entire table.
You can probably add to the arrays without loading the entire table this way.
I don’t have any actual experience doing this with tables (I’ve never needed to) but have with arrays, so I leave it to you to experiment. It’s the only way I can think of to do what you want.
  1 个评论
Matt
Matt 2016-7-29
If I understand correctly, you would import a text file as a table, convert that table to an array, and save it to a .mat file. Then, import the next text file as a table and convert it to an array. Finally, use "matfile" to access the array in the .mat file and concatenate the array in the workspace to the one in the .mat file.
I imagine this would work, but you would only ever be able to work with the tables when you first imported them. After importing data from all desired text files, you could access a part of the final saved array with "matfile" and convert that part to a table, but you would never be able to work on the entire data set as a table. Also, I imagine you would lose all metadata associated with a table when you convert it to an array.
That would still give me a bit more flexibility than importing directly to an array, although I don't know how it would affect memory and speed. I think I'm going to have to make some compromises anyway, due to the fact that "matfile" is not compatible with the table data type. It's too bad Mathworks has not expanded the "matfile" function to work with tables.

请先登录,再进行评论。


Image Analyst
Image Analyst 2016-7-29
If a table runs out of memory then reading it in and converting to an array will double the requirements, at least until you can clear the table from memory. Also it will require separate arrays for each variable type. Like one table for numbers and one for strings. If all you had was numbers you wouldn't even have used a table anyway. Normally you'd only use a table if you have columns where the columns are a mixture of data types.
Just what kind of data are you talking about? What is the data class of each column in the table, and how many gigabytes are we talking about for the table? How much RAM do you have? What does the "memory" function display for you? Can you buy more RAM?
And don't even think about cell arrays - they take up to 15 times or more memory than a table.
Have you considered memmapfile()?
  2 个评论

请先登录,再进行评论。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by