Main Content

ClassificationPartitionedLinear

Cross-validated linear model for binary classification of high-dimensional data

Description

ClassificationPartitionedLinear is a set of linear classification models trained on cross-validated folds. You can estimate the quality of classification, or how well the linear classification model generalizes, using one or more kfold functions: kfoldPredict, kfoldLoss, kfoldMargin, and kfoldEdge.

Every kfold object function uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, suppose that you cross-validate using five folds. The software randomly assigns each observation into five groups of equal size (roughly). The training fold contains four of the groups (roughly 4/5 of the data), and the validation fold contains the other group (roughly 1/5 of the data). In this case, cross-validation proceeds as follows:

  1. The software trains the first model (stored in CVMdl.Trained{1}) by using the observations in the last four groups, and reserves the observations in the first group for validation.

  2. The software trains the second model (stored in CVMdl.Trained{2}) by using the observations in the first group and the last three groups. The software reserves the observations in the second group for validation.

  3. The software proceeds in a similar manner for the third, fourth, and fifth models.

If you validate by using kfoldPredict, the software computes predictions for the observations in group i by using model i. In short, the software estimates a response for every observation by using the model trained without that observation.

Note

ClassificationPartitionedLinear model objects do not store the predictor data set.

Creation

You can create a ClassificationPartitionedLinear object by using the fitclinear function and specifying one of the name-value arguments CrossVal, CVPartition, Holdout, KFold, or Leaveout.

Properties

expand all

Cross-Validation Properties

Cross-validated model name, specified as a character vector.

For example, 'Linear' specifies a cross-validated linear model for binary classification or regression.

Data Types: char

Number of cross-validated folds, specified as a positive integer.

Data Types: double

Cross-validation parameter values, e.g., the name-value pair argument values used to cross-validate the linear model, specified as an object. ModelParameters does not contain estimated parameters.

Access properties of ModelParameters using dot notation.

Number of observations in the training data, specified as a positive numeric scalar.

Data Types: double

Data partition indicating how the software splits the data into cross-validation folds, specified as a cvpartition model.

Linear classification models trained on cross-validation folds, specified as a cell array of ClassificationLinear models. Trained has k cells, where k is the number of folds.

Data Types: cell

Observation weights used to cross-validate the model, specified as a numeric vector. W has NumObservations elements.

The software normalizes W so that the weights for observations within a particular class sum up to the prior probability of that class.

Data Types: single | double

Observed class labels used to cross-validate the model, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. Y has NumObservations elements, and is the same data type as the input argument Y that you passed to fitclinear to cross-validate the model. (The software treats string arrays as cell arrays of character vectors.)

Each row of Y represents the observed classification of the corresponding observation in the predictor data.

Data Types: categorical | char | logical | single | double | cell

Other Classification Properties

Categorical predictor indices, specified as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

Data Types: single | double

This property is read-only.

Unique class labels used in training, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. ClassNames has the same data type as the class labels Y. (The software treats string arrays as cell arrays of character vectors.) ClassNames also determines the class order.

Data Types: categorical | char | logical | single | double | cell

This property is read-only.

Misclassification costs, specified as a square numeric matrix. Cost has K rows and columns, where K is the number of classes.

Cost(i,j) is the cost of classifying a point into class j if its true class is i. The order of the rows and columns of Cost corresponds to the order of the classes in ClassNames.

Data Types: double

Predictor names in order of their appearance in the predictor data, specified as a cell array of character vectors. The length of PredictorNames is equal to the number of variables in the training data X or Tbl used as predictor variables.

Data Types: cell

This property is read-only.

Prior class probabilities, specified as a numeric vector. Prior has as many elements as classes in ClassNames, and the order of the elements corresponds to the elements of ClassNames.

Data Types: double

Response variable name, specified as a character vector.

Data Types: char

Score transformation function to apply to predicted scores, specified as a function name or function handle.

For linear classification models and before transformation, the predicted classification score for the observation x (row vector) is f(x) = xβ + b, where β and b correspond to Mdl.Beta and Mdl.Bias, respectively.

To change the score transformation function to, for example, function, use dot notation.

  • For a built-in function, enter this code and replace function with a value in the table.

    Mdl.ScoreTransform = 'function';

    ValueDescription
    "doublelogit"1/(1 + e–2x)
    "invlogit"log(x / (1 – x))
    "ismax"Sets the score for the class with the largest score to 1, and sets the scores for all other classes to 0
    "logit"1/(1 + ex)
    "none" or "identity"x (no transformation)
    "sign"–1 for x < 0
    0 for x = 0
    1 for x > 0
    "symmetric"2x – 1
    "symmetricismax"Sets the score for the class with the largest score to 1, and sets the scores for all other classes to –1
    "symmetriclogit"2/(1 + ex) – 1

  • For a MATLAB® function, or a function that you define, enter its function handle.

    Mdl.ScoreTransform = @function;

    function must accept a matrix of the original scores for each class, and then return a matrix of the same size representing the transformed scores for each class.

Data Types: char | function_handle

Object Functions

kfoldEdgeClassification edge for cross-validated linear classification model
kfoldLossClassification loss for cross-validated linear classification model
kfoldMarginClassification margins for cross-validated linear classification model
kfoldPredictClassify observations in cross-validated linear classification model

Examples

collapse all

Load the NLP data set.

load nlpdata

X is a sparse matrix of predictor data, and Y is a categorical vector of class labels. There are more than two classes in the data.

Identify the labels that correspond to the Statistics and Machine Learning Toolbox™ documentation web pages.

Ystats = Y == 'stats';

Cross-validate a binary, linear classification model that can identify whether the word counts in a documentation web page are from the Statistics and Machine Learning Toolbox™ documentation.

rng(1); % For reproducibility 
CVMdl = fitclinear(X,Ystats,'CrossVal','on')
CVMdl = 
  ClassificationPartitionedLinear
    CrossValidatedModel: 'Linear'
           ResponseName: 'Y'
        NumObservations: 31572
                  KFold: 10
              Partition: [1x1 cvpartition]
             ClassNames: [0 1]
         ScoreTransform: 'none'


CVMdl is a ClassificationPartitionedLinear cross-validated model. Because fitclinear implements 10-fold cross-validation by default, CVMdl.Trained contains ten ClassificationLinear models that contain the results of training linear classification models for each of the folds.

Estimate labels for out-of-fold observations and estimate the generalization error by passing CVMdl to kfoldPredict and kfoldLoss, respectively.

oofLabels = kfoldPredict(CVMdl);
ge = kfoldLoss(CVMdl)
ge = 
7.6017e-04

The estimated generalization error is less than 0.1% misclassified observations.

To determine a good lasso-penalty strength for a linear classification model that uses a logistic regression learner, implement 5-fold cross-validation.

Load the NLP data set.

load nlpdata

X is a sparse matrix of predictor data, and Y is a categorical vector of class labels. There are more than two classes in the data.

The models should identify whether the word counts in a web page are from the Statistics and Machine Learning Toolbox™ documentation. So, identify the labels that correspond to the Statistics and Machine Learning Toolbox™ documentation web pages.

Ystats = Y == 'stats';

Create a set of 11 logarithmically-spaced regularization strengths from 10-6 through 10-0.5.

Lambda = logspace(-6,-0.5,11);

Cross-validate the models. To increase execution speed, transpose the predictor data and specify that the observations are in columns. Estimate the coefficients using SpaRSA. Lower the tolerance on the gradient of the objective function to 1e-8.

X = X'; 
rng(10); % For reproducibility
CVMdl = fitclinear(X,Ystats,'ObservationsIn','columns','KFold',5,...
    'Learner','logistic','Solver','sparsa','Regularization','lasso',...
    'Lambda',Lambda,'GradientTolerance',1e-8)
CVMdl = 
  ClassificationPartitionedLinear
    CrossValidatedModel: 'Linear'
           ResponseName: 'Y'
        NumObservations: 31572
                  KFold: 5
              Partition: [1x1 cvpartition]
             ClassNames: [0 1]
         ScoreTransform: 'none'


numCLModels = numel(CVMdl.Trained)
numCLModels = 
5

CVMdl is a ClassificationPartitionedLinear model. Because fitclinear implements 5-fold cross-validation, CVMdl contains 5 ClassificationLinear models that the software trains on each fold.

Display the first trained linear classification model.

Mdl1 = CVMdl.Trained{1}
Mdl1 = 
  ClassificationLinear
      ResponseName: 'Y'
        ClassNames: [0 1]
    ScoreTransform: 'logit'
              Beta: [34023x11 double]
              Bias: [-13.2936 -13.2936 -13.2936 -13.2936 -13.2936 -6.8954 -5.4359 -4.7170 -3.4108 -3.1566 -2.9792]
            Lambda: [1.0000e-06 3.5481e-06 1.2589e-05 4.4668e-05 1.5849e-04 5.6234e-04 0.0020 0.0071 0.0251 0.0891 0.3162]
           Learner: 'logistic'


Mdl1 is a ClassificationLinear model object. fitclinear constructed Mdl1 by training on the first four folds. Because Lambda is a sequence of regularization strengths, you can think of Mdl1 as 11 models, one for each regularization strength in Lambda.

Estimate the cross-validated classification error.

ce = kfoldLoss(CVMdl);

Because there are 11 regularization strengths, ce is a 1-by-11 vector of classification error rates.

Higher values of Lambda lead to predictor variable sparsity, which is a good quality of a classifier. For each regularization strength, train a linear classification model using the entire data set and the same options as when you cross-validated the models. Determine the number of nonzero coefficients per model.

Mdl = fitclinear(X,Ystats,'ObservationsIn','columns',...
    'Learner','logistic','Solver','sparsa','Regularization','lasso',...
    'Lambda',Lambda,'GradientTolerance',1e-8);
numNZCoeff = sum(Mdl.Beta~=0);

In the same figure, plot the cross-validated, classification error rates and frequency of nonzero coefficients for each regularization strength. Plot all variables on the log scale.

figure;
[h,hL1,hL2] = plotyy(log10(Lambda),log10(ce),...
    log10(Lambda),log10(numNZCoeff)); 
hL1.Marker = 'o';
hL2.Marker = 'o';
ylabel(h(1),'log_{10} classification error')
ylabel(h(2),'log_{10} nonzero-coefficient frequency')
xlabel('log_{10} Lambda')
title('Test-Sample Statistics')
hold off

Figure contains 2 axes objects. Axes object 1 with title Test-Sample Statistics, xlabel log_{10} Lambda, ylabel log_{10} classification error contains an object of type line. Axes object 2 with ylabel log_{10} nonzero-coefficient frequency contains an object of type line.

Choose the index of the regularization strength that balances predictor variable sparsity and low classification error. In this case, a value between 10-4 to 10-1 should suffice.

idxFinal = 7;

Select the model from Mdl with the chosen regularization strength.

MdlFinal = selectModels(Mdl,idxFinal);

MdlFinal is a ClassificationLinear model containing one regularization strength. To estimate labels for new observations, pass MdlFinal and the new data to predict.

Extended Capabilities

Version History

Introduced in R2016a

expand all