主要内容

模型的构建和评估

合成数据生成、特征选择、特征工程、模型选择、超参数优化、交叉验证、预测性能评估和分类准确度比较检验

在构建高质量预测分类模型时,选择正确的特征(或预测变量)并调整超参数(未估计的模型参数)非常重要。特征选择和超参数调整可能会产生多个模型。您可以比较模型之间的 k 折分类错误率、受试者工作特征 (ROC) 曲线或混淆矩阵。还可以进行统计检验,以检测一个分类模型是否明显优于另一个。

您可以执行以下操作来构建和评估分类模型:

  • 在使用 synthesizeTabularDatabinningTabularSynthesizer 训练分类模型之前,从现有数据集生成合成数据。

  • 通过使用 gencfeatures,在训练分类模型之前对新特征进行工程处理。

  • 使用分类学习器以交互方式构建和评估分类模型。

  • 通过使用 fitcauto 自动选择一个具有调节后的超参数的模型。此函数尝试选择具有不同超参数值的分类模型类型,并返回预期在新数据上表现良好的最终模型。当您不确定哪些分类器类型最适合您的数据时,请使用 fitcauto

  • 通过选择超参数值并使用这些值对模型进行交叉验证来调节特定模型的超参数。例如,要调整 SVM 模型,可以选择一组框约束和核尺度,然后使用每对值对模型进行交叉验证。某些 Statistics and Machine Learning Toolbox™ 分类函数通过贝叶斯优化、网格搜索或随机搜索提供自动超参数调整。实现贝叶斯优化的主函数 bayesopt 对于许多其他应用来说也足够灵活。请参阅Bayesian Optimization Workflow

  • 使用 limeshapleyplotPartialDependence 解释分类模型。

App

分类学习器使用有监督的机器学习训练模型以对数据进行分类

函数

全部展开

synthesizeTabularDataSynthesize tabular data (自 R2024b 起)
binningTabularSynthesizerBinning-based synthesizer for tabular data synthesis (自 R2024b 起)
synthesizeTabularDataSynthesize tabular data using binning-based synthesizer (自 R2024b 起)
mmdtestTwo-sample multivariate hypothesis test using maximum mean discrepancy (MMD) (自 R2024b 起)
knntestTwo-sample multivariate hypothesis test using k-nearest neighbors (KNN) (自 R2025a 起)
fscchi2Univariate feature ranking for classification using chi-square tests
fscmrmrRank features for classification using minimum redundancy maximum relevance (MRMR) algorithm
fscncaFeature selection using neighborhood component analysis for classification
oobPermutedPredictorImportanceOut-of-bag predictor importance estimates for random forest of classification trees by permutation
permutationImportancePredictor importance by permutation (自 R2024a 起)
predictorImportanceEstimates of predictor importance for classification tree
predictorImportanceEstimates of predictor importance for classification ensemble of decision trees
relieffRank importance of predictors using ReliefF or RReliefF algorithm
selectFeaturesSelect important features for NCA classification or regression (自 R2023b 起)
sequentialfsSequential feature selection using custom criterion
gencfeaturesPerform automated feature engineering for classification (自 R2021a 起)
describeDescribe generated features (自 R2021a 起)
transformTransform new data using generated features (自 R2021a 起)
fitcautoAutomatically select classification model with optimized hyperparameters
bayesoptSelect optimal machine learning hyperparameters using Bayesian optimization
hyperparametersVariable descriptions for optimizing a fit function
optimizableVariableVariable description for bayesopt or other optimizers
learnersizeCompact size of trained machine learning model object (自 R2024b 起)
plotPlot aggregated hyperparameter optimization results (自 R2024b 起)
resumeResume hyperparameter optimization problems (自 R2024b 起)
summarySummary table for AggregateBayesianOptimization object (自 R2024b 起)
crossvalEstimate loss using cross-validation
cvpartitionPartition data for cross-validation
repartitionRepartition data for cross-validation
summarySummarize cross-validation partition with stratification or grouping variable (自 R2025a 起)
test交叉验证的测试集索引
training交叉验证的训练索引

与模型无关的局部可解释性解释 (LIME)

limeLocal interpretable model-agnostic explanations (LIME)
fitFit simple model of local interpretable model-agnostic explanations (LIME)
plotPlot results of local interpretable model-agnostic explanations (LIME)

夏普利值

shapleyShapley values (自 R2021a 起)
fitCompute Shapley values for query points (自 R2021a 起)
plotPlot Shapley values using bar graphs (自 R2021a 起)
boxchartVisualize Shapley values using box charts (box plots) (自 R2024a 起)
plotDependencePlot dependence of Shapley values on predictor values (自 R2024b 起)
swarmchartVisualize Shapley values using swarm scatter charts (自 R2024a 起)

部分依赖

partialDependenceCompute partial dependence
plotPartialDependenceCreate partial dependence plot (PDP) and individual conditional expectation (ICE) plots

混淆矩阵

confusionchartCreate confusion matrix chart for classification problem
confusionmatCompute confusion matrix for classification problem

受试者工作特征 (ROC) 曲线

rocmetricsReceiver operating characteristic (ROC) curve and performance metrics for binary and multiclass classifiers (自 R2022a 起)
addMetricsCompute additional classification performance metrics (自 R2022a 起)
aucArea under ROC curve or precision-recall curve (自 R2024b 起)
averageCompute performance metrics for average receiver operating characteristic (ROC) curve in multiclass problem (自 R2022a 起)
modelOperatingPointOperating point of rocmetrics object (自 R2024b 起)
plotPlot receiver operating characteristic (ROC) curves and other performance curves (自 R2022a 起)
perfcurveReceiver operating characteristic (ROC) curve or other performance curve for classifier output
testcholdoutCompare predictive accuracies of two classification models
testckfoldCompare accuracies of two classification models by repeated cross-validation

对象

全部展开

FeatureSelectionNCAClassificationFeature selection for classification using neighborhood component analysis (NCA)
FeatureTransformerGenerated feature transformations (自 R2021a 起)
BayesianOptimizationBayesian optimization results
HyperparameterOptimizationOptionsHyperparameter optimization options (自 R2024b 起)
AggregateBayesianOptimizationAggregate Bayesian optimization results (自 R2024b 起)

属性

ConfusionMatrixChart PropertiesConfusion matrix chart appearance and behavior
ROCCurve PropertiesReceiver operating characteristic (ROC) curve appearance and behavior (自 R2022a 起)

主题

分类学习器

特征选择

特征工程

自动模型选择

超参数优化

模型解释

交叉验证

分类性能计算