Main Content

Model Building and Assessment

Synthetic data generation, feature selection, feature engineering, model selection, hyperparameter optimization, cross-validation, predictive performance evaluation, and classification accuracy comparison tests

When you build a high-quality, predictive classification model, it is important to select the right features (or predictors) and tune hyperparameters (model parameters that are not estimated). Feature selection and hyperparameter tuning can yield multiple models. You can compare the k-fold misclassification rates, receiver operating characteristic (ROC) curves, or confusion matrices among the models. Or, conduct a statistical test to detect whether a classification model significantly outperforms another.

You can perform the following actions to build and assess classification models:

  • Generate synthetic data from an existing data set before training a classification model by using synthesizeTabularData or binningTabularSynthesizer.

  • Engineer new features before training a classification model by using gencfeatures.

  • Build and assess classification models interactively by using the Classification Learner app.

  • Automatically select a model with tuned hyperparameters by using fitcauto. This function tries a selection of classification model types with different hyperparameter values and returns a final model that is expected to perform well on new data. Use fitcauto when you are uncertain which classifier types best suit your data.

  • Tune hyperparameters of a specific model by selecting the hyperparameter values and cross-validating the model using those values. For example, to tune an SVM model, choose a set of box constraints and kernel scales, and then cross-validate a model for each pair of values. Certain Statistics and Machine Learning Toolbox™ classification functions offer automatic hyperparameter tuning through Bayesian optimization, grid search, or random search. bayesopt, the main function for implementing Bayesian optimization, is flexible enough for many other applications as well. See Bayesian Optimization Workflow.

  • Interpret a classification model by using lime, shapley, and plotPartialDependence.

Apps

Classification LearnerTrain models to classify data using supervised machine learning

Functions

expand all

synthesizeTabularDataSynthesize tabular data (Since R2024b)
binningTabularSynthesizerBinning-based synthesizer for tabular data synthesis (Since R2024b)
synthesizeTabularDataSynthesize tabular data using binning-based synthesizer (Since R2024b)
mmdtestTwo-sample multivariate hypothesis test using maximum mean discrepancy (MMD) (Since R2024b)
fscchi2Univariate feature ranking for classification using chi-square tests (Since R2020a)
fscmrmrRank features for classification using minimum redundancy maximum relevance (MRMR) algorithm
fscncaFeature selection using neighborhood component analysis for classification
oobPermutedPredictorImportanceOut-of-bag predictor importance estimates for random forest of classification trees by permutation
permutationImportancePredictor importance by permutation (Since R2024a)
predictorImportanceEstimates of predictor importance for classification tree
predictorImportanceEstimates of predictor importance for classification ensemble of decision trees
relieffRank importance of predictors using ReliefF or RReliefF algorithm
selectFeaturesSelect important features for NCA classification or regression (Since R2023b)
sequentialfsSequential feature selection using custom criterion
gencfeaturesPerform automated feature engineering for classification (Since R2021a)
describeDescribe generated features (Since R2021a)
transformTransform new data using generated features (Since R2021a)
fitcautoAutomatically select classification model with optimized hyperparameters (Since R2020a)
bayesoptSelect optimal machine learning hyperparameters using Bayesian optimization
hyperparametersVariable descriptions for optimizing a fit function
optimizableVariableVariable description for bayesopt or other optimizers
learnersizeCompact size of trained machine learning model object (Since R2024b)
plotPlot aggregated hyperparameter optimization results (Since R2024b)
resumeResume hyperparameter optimization problems (Since R2024b)
summarySummary table for AggregateBayesianOptimization object (Since R2024b)
crossvalEstimate loss using cross-validation
cvpartitionPartition data for cross-validation
repartitionRepartition data for cross-validation
testTest indices for cross-validation
trainingTraining indices for cross-validation

Local Interpretable Model-Agnostic Explanations (LIME)

limeLocal interpretable model-agnostic explanations (LIME) (Since R2020b)
fitFit simple model of local interpretable model-agnostic explanations (LIME) (Since R2020b)
plotPlot results of local interpretable model-agnostic explanations (LIME) (Since R2020b)

Shapley Values

shapleyShapley values (Since R2021a)
fitCompute Shapley values for query points (Since R2021a)
plotPlot Shapley values using bar graphs (Since R2021a)
boxchartVisualize Shapley values using box charts (box plots) (Since R2024a)
plotDependencePlot dependence of Shapley values on predictor values (Since R2024b)
swarmchartVisualize Shapley values using swarm scatter charts (Since R2024a)

Partial Dependence

partialDependenceCompute partial dependence (Since R2020b)
plotPartialDependenceCreate partial dependence plot (PDP) and individual conditional expectation (ICE) plots

Confusion Matrix

confusionchartCreate confusion matrix chart for classification problem
confusionmatCompute confusion matrix for classification problem

Receiver Operating Characteristic (ROC) Curve

rocmetricsReceiver operating characteristic (ROC) curve and performance metrics for binary and multiclass classifiers (Since R2022a)
addMetricsCompute additional classification performance metrics (Since R2022a)
aucArea under ROC curve or precision-recall curve (Since R2024b)
averageCompute performance metrics for average receiver operating characteristic (ROC) curve in multiclass problem (Since R2022a)
modelOperatingPointOperating point of rocmetrics object (Since R2024b)
plotPlot receiver operating characteristic (ROC) curves and other performance curves (Since R2022a)
perfcurveReceiver operating characteristic (ROC) curve or other performance curve for classifier output
testcholdoutCompare predictive accuracies of two classification models
testckfoldCompare accuracies of two classification models by repeated cross-validation

Objects

expand all

FeatureSelectionNCAClassificationFeature selection for classification using neighborhood component analysis (NCA)
FeatureTransformerGenerated feature transformations (Since R2021a)
BayesianOptimizationBayesian optimization results
HyperparameterOptimizationOptionsHyperparameter optimization options (Since R2024b)
AggregateBayesianOptimizationAggregate Bayesian optimization results (Since R2024b)

Properties

ConfusionMatrixChart PropertiesConfusion matrix chart appearance and behavior
ROCCurve PropertiesReceiver operating characteristic (ROC) curve appearance and behavior (Since R2022a)

Topics

Classification Learner App

Feature Selection

Feature Engineering

Automated Model Selection

Hyperparameter Optimization

Model Interpretation

Cross-Validation

Classification Performance Evaluation