Make classification with huge dataset

I'm trying to make classification with huge dataset containing 6 persons for training and here I'm getting this error from only 1 person dataset: "Requested 248376x39305 (9.1GB) array exceeds maximum array size preference." First of all I'm trying Bagged Tree and Neural Network classificators and I want to ask how can I do it? It's possible to learn these classificators in portions of datasets (learn saved classification model again)?

9 个评论

Please explain how 248376 x 39305 constitutes a 1 person data set
[ I N ] = size(input)
[ O N ] = size(target)
Thanks,
Greg
Input matrix size 248376 x 765
Target matrix size 248376 x 1
Then I'm trying to make Tree Bagged mdl it makes 248376 x 39305 size matrix. P.s. as you see 1 frame got 765 features.
Please show your Tree Bagging code. https://www.mathworks.com/help/stats/treebagger.html does not return matrices.
Right it doesn't return matrices cause he can't start due following error about ram problem, code simple:
Mdl = TreeBagger(50,Features,FeaturesTarget);
So I'm thinking about decomposing all test data into lower size files, but I didn't know how to learn classificator again and again with that portions of data. Need something that let me update a classifier with new data, without retraining the entire thing from scratch.
Have you considered reducing the number of trees?
Tree number reducing not helping, had tried reduce test data for two different models, make them compact and combine, from first view it helps, but can't reach high recognition ratio. I think I need "online" algorithm , that can learn saved model using testing data.
I still don't get it
39305/765
ans =
51.3791
Regardless, I think you should use dimensionality reduction via feature extraction.
Hope this helps,
Greg
This is solution to take some of features average for dimensionality reduction, but it may affect recognition percent.
Of course it will affect it. However, the way to choose is to set a limit on the loss of accuracy.

请先登录,再进行评论。

回答(1 个)

Add more memory (RAM) to you computer. Then check or adjust Preferences -> MATLAB -> Workspace -> MATLAB array size limit.
Or, you could set the division ratios so that a much smaller fraction is used for training and validation, with most of it left for test. This effectively uses only a small subset of the data, but a different small subset each time it trains.

6 个评论

More memory not solution for this, it would be need around 36 Gb of RAM with all training data. With division ratios I would be able to learn same saved model with small portions of test data again and again ?
Amazon Web Services, among other providers, make available machines with more than 36 Gb of RAM. If you had that much RAM your program would run; therefore adding RAM is a solution for the problem.
This project not commercial it's for university master degree, adding RAM is not solution for me, but thanks for answer.
https://www.mathworks.com/products/parallel-computing/matlab-parallel-cloud/ 16 workers, 60 Gigabytes, $US 4.32 per hour educational pricing, including compute services.
Or if you provide your own EC2 instance, https://www.mathworks.com/products/parallel-computing/parallel-computing-on-the-cloud/distriben-ec2.html $0.07 per worker per hour for the software licensing from MATLAB. For example you could do https://aws.amazon.com/ec2/pricing/on-demand/ m4.4xlarge, 16 cores, 64 gigabytes, $US 0.958 per hour for the EC2 service. Between that and the $0.07 per worker from Mathworks it would come in less than $US2.50 per hour. About the price of a Starbucks "Grande" coffee.
Remember, your time is not really "free". At the very least you need to take into account "opportunity costs" -- like an hour spent fighting a memory issue is an hour you could have been working on a minimum wage job.
Thanks for advice, keep this in mind if there would be no other solution
Let me put it this way:
  • You do not with to reduce the number of trees or the data because doing so might decrease the recognition rate
  • We do not have a magic low-memory implementation of the TreeBagger available.
  • You do not have enough memory on your system to run the classification using the existing software
Your choices would seem to be:
  • write the classifier yourself, somehow not using as much memory; or
  • obtain more memory for your own system; or
  • obtain use of a system with more memory

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 Licensing on Cloud Platforms 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by