Feature selection is a dimensionality reduction technique that selects a subset of features (predictor variables) that provide the best predictive power in modeling a set of data.
Feature selection can be used to:
- Prevent overfitting: avoid modeling with an excessive number of features that are more susceptible to rote-learning specific training examples
- Reduce model size: increase computational performance with high-dimensional data or prepare model for embedded deployment where memory may be limited.
- Improve interpretability: use fewer features, which may help identify those that affect model behavior
There are several common approaches to feature selection.
Iteratively change features set to optimize performance or loss
Stepwise regression sequentially adds or removes features until there is no improvement in prediction. It is used with linear regression or generalized linear regression algorithms. Similarly, sequential feature selection builds up a feature set until accuracy (or a custom performance measure) stops improving.
Rank features based on intrinsic characteristic
These methods estimate a ranking of the features, which in turn can be used to select the top few ranked features. Minimum redundance maximum relevance (MRMR) finds features that maximize mutual information between features and response variable and minimize mutual information between features themselves. Related methods rank features according to Laplacian scores or use a statistical test of whether a single feature is independent of response to determine feature importance.
Neighborhood Component Analysis (NCA) and ReliefF
These methods determine feature weights by maximizing the accuracy of prediction based on pairwise distance and penalizing predictors that lead to misclassification results.
Learn feature importance along with the model
Some supervised machine learning algorithms estimate feature importance during the training process. Those estimates can be used to rank features after the training is completed. Models with built-in feature selection include linear SVMs, boosted decision trees and their ensembles (random forests), and generalized linear models. Similarly, in lasso regularization a shrinkage estimator reduces the weights (coefficients) of redundant features to zero during training.
MATLAB® supports the following feature selection methods:
|Algorithm||Training||Types of Models||Accuracy||Caveats|
|NCA||Moderate||Better for distance-based models||High||Needs manual tuning of regularization lambda|
|MRMR||Fast||Any||High||Only for classification|
|ReliefF||Moderate||Better for distance-based models||Medium||Unable to differentiate correlated predictors|
|Sequential||Slow||Any||High||Doesn’t rank all features|
|F test||Fast||Any||Medium||For regression. Unable to differentiate correlated predictors.|
|Chi-square||Fast||Any||Medium||For classification. Unable to differentiate correlated predictors.|
As an alternative to feature selection, feature transformation techniques transform existing features into new features (predictor variables) with the less descriptive features dropped. Feature transformation approaches include:
- Principal component analysis (PCA), used to summarize data in fewer dimensions by projection onto a unique orthogonal basis
- Factor analysis, used to build explanatory models of data correlations
- Nonnegative matrix factorization, used when model terms must represent nonnegative values such as physical quantities
For more information on feature selection with MATLAB, including machine learning, regression, and transformation, see Statistics and Machine Learning Toolbox™ .
- Feature selection is an advanced technique to boost model performance (especially on high-dimensional data), improve interpretability, and reduce size.
- Consider one of the models with “built-in” feature selection first. Otherwise MRMR works really well for classification.
Feature selection can help select a reasonable subset from hundreds of features automatically generated by applying wavelet scattering. The figure below shows the ranking of the top 50 features obtained by applying the MATLAB function
fscmrmr to automatically generated wavelet features from human activity sensor data.