crossval

Cross-validate naive Bayes classifier

Description

example

CVMdl = crossval(Mdl) returns a cross-validated (partitioned) naive Bayes classifier (CVMdl) from a trained naive Bayes classifier (Mdl). By default, crossval uses 10-fold cross-validation on the training data to create CVMdl, a ClassificationPartitionedModel classifier.

example

CVMdl = crossval(Mdl,Name,Value) returns a partitioned naive Bayes classifier with additional options specified by one or more name-value pair arguments. For example, you can specify the number of folds or a holdout sample proportion.

Examples

collapse all

Create a cross-validated naive Bayes classifier model for Fisher's iris data set. Then, assess the classification performance by estimating the generalization error of the model.

Load the fisheriris data set. Create X as a numeric matrix that contains four petal measurements for 150 irises. Create Y as a cell array of character vectors that contains the corresponding iris species.

load fisheriris
X = meas;
Y = species;
rng('default') % for reproducibility

Train a naive Bayes classifier using the predictors X and class labels Y. A recommended practice is to specify the class names. fitcnb assumes each predictor is independent and fits predictors using normal distributions by default.

Mdl = fitcnb(X,Y,'ClassNames',{'setosa','versicolor','virginica'});

Mdl is a trained ClassificationNaiveBayes classifier.

Cross-validate the classifier using 10-fold cross-validation.

CVMdl = crossval(Mdl)
CVMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'NaiveBayes'
         PredictorNames: {'x1'  'x2'  'x3'  'x4'}
           ResponseName: 'Y'
        NumObservations: 150
                  KFold: 10
              Partition: [1x1 cvpartition]
             ClassNames: {'setosa'  'versicolor'  'virginica'}
         ScoreTransform: 'none'


  Properties, Methods

CVMdl is a ClassificationPartitionedModel cross-validated, naive Bayes classifier.

Return the first model of the 10 trained classifiers.

FirstModel = CVMdl.Trained{1}
FirstModel = 
  CompactClassificationNaiveBayes
              ResponseName: 'Y'
     CategoricalPredictors: []
                ClassNames: {'setosa'  'versicolor'  'virginica'}
            ScoreTransform: 'none'
         DistributionNames: {'normal'  'normal'  'normal'  'normal'}
    DistributionParameters: {3x4 cell}


  Properties, Methods

FirstModel is a CompactClassificationNaiveBayes model.

Estimate the cross-validated loss of the classifier. You can estimate the generalization error by passing CVMdl to kfoldLoss.

CVMdlloss = kfoldLoss(CVMdl)
CVMdlloss = 0.0467

The cross-validated loss is approximately 5%. You can expect Mdl to have a similar error rate.

Specify a holdout sample proportion for cross-validation. By default, crossval uses 10-fold cross-validation to cross-validate a naive Bayes classifier. However, you have several other options for cross-validation. For example, you can specify a different number of folds or a holdout sample proportion.

Load the ionosphere data set. Remove the first two predictors for stability.

load ionosphere
X = X(:,3:end);
rng('default'); % for reproducibility

Train a naive Bayes classifier using the predictors X and class labels Y. A recommended practice is to specify the class names. 'b' is the negative class and 'g' is the positive class. fitcnb assumes that each predictor is conditionally and normally distributed.

Mdl = fitcnb(X,Y,'ClassNames',{'b','g'});

Mdl is a trained ClassificationNaiveBayes classifier.

Cross-validate the classifier by specifying a 30% holdout sample.

CVMdl = crossval(Mdl,'Holdout',0.3)
CVMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'NaiveBayes'
         PredictorNames: {1x32 cell}
           ResponseName: 'Y'
        NumObservations: 351
                  KFold: 1
              Partition: [1x1 cvpartition]
             ClassNames: {'b'  'g'}
         ScoreTransform: 'none'


  Properties, Methods

CVMdl is a ClassificationPartitionedModel cross-validated, naive Bayes classifier.

Display the properties of the classifier trained using 70% of the data.

TrainedModel = CVMdl.Trained{1}
TrainedModel = 
  CompactClassificationNaiveBayes
              ResponseName: 'Y'
     CategoricalPredictors: []
                ClassNames: {'b'  'g'}
            ScoreTransform: 'none'
         DistributionNames: {1x32 cell}
    DistributionParameters: {2x32 cell}


  Properties, Methods

TrainedModel is a CompactClassificationNaiveBayes classifier.

Estimate the generalization error by passing CVMdl to kfoldloss.

kfoldLoss(CVMdl)
ans = 0.2095

The out-of-sample misclassification error is approximately 21%.

Reduce the generalization error by choosing the five most important predictors.

idx = fscmrmr(X,Y);
Xnew = X(:,idx(1:5));

Train a naive Bayes classifier for the new predictor.

Mdlnew = fitcnb(Xnew,Y,'ClassNames',{'b','g'});

Cross-validate the new classifier by specifying a 30% holdout sample, and estimate the generalization error.

CVMdlnew = crossval(Mdlnew,'Holdout',0.3);
kfoldLoss(CVMdlnew)
ans = 0.1429

The out-of-sample misclassification error is reduced from approximately 21% to approximately 14%.

Input Arguments

collapse all

Full, trained naive Bayes classifier, specified as a ClassificationNaiveBayes model trained by fitcnb.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: crossval(Mdl,'KFold',5) specifies using five folds in a cross-validated model.

Cross-validation partition, specified as the comma-separated pair consisting of 'CVPartition' and a cvpartition partition object created by cvpartition. The partition object specifies the type of cross-validation and the indexing for the training and validation sets.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: Suppose you create a random partition for 5-fold cross-validation on 500 observations by using cvp = cvpartition(500,'KFold',5). Then, you can specify the cross-validated model by using 'CVPartition',cvp.

Fraction of the data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range (0,1). If you specify 'Holdout',p, then the software completes these steps:

  1. Randomly select and reserve p*100% of the data as validation data, and train the model using the rest of the data.

  2. Store the compact, trained model in the Trained property of the cross-validated model.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'Holdout',0.1

Data Types: double | single

Number of folds to use in a cross-validated model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value greater than 1. If you specify 'KFold',k, then the software completes these steps:

  1. Randomly partition the data into k equally sized sets.

  2. For each set, reserve the set as validation data, and train the model using the other k – 1 sets.

  3. Store the k compact, trained models in the cells of a k-by-1 cell vector in the Trained property of the cross-validated model.

  4. Combine the generalization statistics from each fold.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'KFold',5

Data Types: single | double

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and 'on' or 'off'. If you specify 'Leaveout','on', then, for each of the n observations (where n is the number of observations excluding missing observations, specified in the NumObservations property of the model), the software completes these steps:

  1. Reserve the observation as validation data, and train the model using the other n – 1 observations.

  2. Store the n compact, trained models in the cells of an n-by-1 cell vector in the Trained property of the cross-validated model.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'Leaveout','on'

Tips

  • Assess the predictive performance of Mdl on cross-validated data using the 'KFold' name-value pair argument and properties of CVMdl, such as kfoldLoss.

  • Return a partitioned naive Bayes classifier with stratified partitioning using the name-value pair arguments 'KFold' and 'Holdout'.

  • Create a cvpartition object cvp using cvp = cvpartition(n,'KFold',k). Return a partitioned naive Bayes classifier with nonstratified partitioning using the name-value pair 'CVPartition',cvp.

Alternatives

Instead of first creating a naive Bayes classifier and then creating a cross-validation classifier, you can create a cross-validated classifier directly by using fitcnb and specifying any of these name-value pair arguments: 'CrossVal', 'CVPartition', 'Holdout', 'Leaveout', or 'KFold'.

Introduced in R2014b