data management for large datasets

Question

Sara 2012-2-16

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/29335-data-management-for-large-datasets

Hi,

I will have two sets of field data -- one taken for six weeks last year, and another taken for two months this year. For EACH dataset I have variables collected from 4-8 different sources, for (up to) 50 days, collected at up to 3 different sites. Both datasets, in their entirety, span about 200 - 300 columns and between 8,000 - 15,000 rows.

Within that, I'm trying to figure out how to set up my code for analysing both sets of data. I want to do some different things --

Analyse the data from each source separately to check for error
Filter out a large quantity (up to 25%) of data which is poor quality
Check all the filtered data from ONE dataset for trends between days (rows) and variables (columns)
Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)
Compare different (filtered) variables within a single dataset over time, and
Perform analysis on the changes between both (filtered) datasets.

I have no idea how to structure and maintain my code to allow me to do all of these things. I know some of the tests I want to do but others I haven't thought of yet. At the moment I have about 10 different programs which load and structure my raw datafiles in different ways (one comprised of an array of structs, another where data is subset into variables etc), but this is incredibly confusing and has led to a lot of error and enormous amounts of repetition. Deeply nested structs became impossible to work with last year.

I will also have a set of images I want to analyse at the same time, taken from the same days, so I need to take that into account too.

Matlab is so powerful and there are so many ways of managing data. Does anyone have any ideas on organising such a large dataset to be able to analyse so many different parts of it?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Richard Willey 2012-2-16

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/29335-data-management-for-large-datasets#answer_37681

Have you looked into the dataset array that ships with Statistics Toolbox?

The dataset array is a special data type that can store heterogeneous data. (I can have a column of strings, followed by a column of categoricals, followed by a column of doubles,...)

The dataset array ships with a variety of built in methods that are designed to simplify data analysis. For example, there is a built in method for "joins" just like you'd find in a relational database. There's also a built in method for converting your data from a tall format to a wide format (and vice-versa)

As a practical example, you cite a requirement to "Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)". The join operation would make that a lot easier...

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

data management for large datasets

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

data management for large datasets

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论