How to improve regression models for a dataset with too many variables?
9 次查看(过去 30 天)
显示 更早的评论
Hi. I'm quite new with Machine Learning and my problem is to fit a regression model (either linear or non-linear).
My X data is spectrophotometric data with 117 observations and 15956 variables. My Y data is 117-by-1.
I have tried most models I could figure out myself including one-way PLS, N-way PLS, neural network, regression tree and bagging ensemble. However, while PLS models are underfitting, the others are overfitting and my RPD have never exceeded 2.0. I realized that it's probably because I have too many variables compared to few observations. Is there a way for me to improve the models without reducing the dimensions (like using only PCA coefficients to do the regression)?
Thank you.
5 个评论
dpb
2022-8-10
Well, you came here asking for help -- can't help unless have some idea about what it is we're trying to help with...
With that many variables and so few observation, there's bound to be correlation just by random chance.
采纳的回答
the cyclist
2022-8-10
@NC_, as I expect you realize, your question is not really a MATLAB question, but is a generic machine learning question. You are asking how to handle the ML problem of "p << n", where p is the number of features, and n is the number of observations. It's common in certain domains (e.g. analyses that use gene expression as features).
There are lots of ways to approach the problem, and it doesn't make sense to try to bring all of that into this forum. I'm not really an ML expert myself, so maybe can't give you the best pointers, but this page gives a good overview of the issue, and has references for more info.
2 个评论
the cyclist
2022-8-10
I don't want to leave you with the impression that MATLAB doesn't have the tools to handle this type of problem. For example, you can read more at this documentation page about feature selection. But you've stated that you don't want to reduce the dimension.
But, given the fact that the biggest risk in p<<n problems is overfitting, you almost always have to do something akin to feature reduction (or at least regularization). So, I continue to think that you don't quite yet really have a MATLAB question. Especially since you are working on a research problem, you need to get your arms wrapped around the theory of what you are doing first.
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!