Improving Discriminant Analysis Models
Deal with Singular Data
Discriminant analysis needs data sufficient to fit Gaussian models with invertible covariance matrices. If your data is not sufficient to fit such a model uniquely, fitcdiscr
fails. This section shows methods for handling failures.
Tip
To obtain a discriminant analysis classifier without failure, set the DiscrimType
name-value pair to 'pseudoLinear'
or 'pseudoQuadratic'
in fitcdiscr
.
“Pseudo” discriminants never fail, because they use the pseudoinverse of the covariance matrix Σk (see pinv
).
Example: Singular Covariance Matrix
When the covariance matrix of the fitted classifier is singular, fitcdiscr
can fail:
load popcorn X = popcorn(:,[1 2]); X(:,3) = 0; % a zero-variance column Y = popcorn(:,3); ppcrn = fitcdiscr(X,Y); Error using ClassificationDiscriminant (line 635) Predictor x3 has zero variance. Either exclude this predictor or set 'discrimType' to 'pseudoLinear' or 'diagLinear'. Error in classreg.learning.FitTemplate/fit (line 243) obj = this.MakeFitObject(X,Y,W,this.ModelParameters,fitArgs{:}); Error in fitcdiscr (line 296) this = fit(temp,X,Y);
To proceed with linear discriminant analysis, use a pseudoLinear
or diagLinear
discriminant type:
ppcrn = fitcdiscr(X,Y,... 'discrimType','pseudoLinear'); meanpredict = predict(ppcrn,mean(X)) meanpredict = 3.5000
Choose a Discriminant Type
There are six types of discriminant analysis classifiers: linear and quadratic, with diagonal and pseudo variants of each type.
Tip
To see if your covariance matrix is singular, set discrimType
to 'linear'
or 'quadratic'
. If the matrix is singular, the fitcdiscr
method fails for 'quadratic'
, and the Gamma
property is nonzero for 'linear'
.
To obtain a quadratic classifier even when your covariance matrix is singular, set DiscrimType
to 'pseudoQuadratic'
or 'diagQuadratic'
.
obj = fitcdiscr(X,Y,'DiscrimType','pseudoQuadratic') % or 'diagQuadratic'
Choose a classifier type by setting the discrimType
name-value pair to one of:
'linear'
(default) — Estimate one covariance matrix for all classes.'quadratic'
— Estimate one covariance matrix for each class.'diagLinear'
— Use the diagonal of the'linear'
covariance matrix, and use its pseudoinverse if necessary.'diagQuadratic'
— Use the diagonals of the'quadratic'
covariance matrices, and use their pseudoinverses if necessary.'pseudoLinear'
— Use the pseudoinverse of the'linear'
covariance matrix if necessary.'pseudoQuadratic'
— Use the pseudoinverses of the'quadratic'
covariance matrices if necessary.
fitcdiscr
can fail for the 'linear'
and 'quadratic'
classifiers. When it fails, it returns an explanation, as shown in Deal with Singular Data.
fitcdiscr
always succeeds with the diagonal and pseudo variants. For information about pseudoinverses, see pinv
.
You can set the discriminant type using dot notation after constructing a classifier:
obj.DiscrimType = 'discrimType'
You can change between linear types or between quadratic types, but cannot change between a linear and a quadratic type.
Examine the Resubstitution Error and Confusion Matrix
The resubstitution error is the difference between the response training data and the predictions the classifier makes of the response based on the input training data. If the resubstitution error is high, you cannot expect the predictions of the classifier to be good. However, having low resubstitution error does not guarantee good predictions for new data. Resubstitution error is often an overly optimistic estimate of the predictive error on new data.
The confusion matrix shows how many errors, and which types, arise in resubstitution. When there are K
classes, the confusion matrix R
is a K
-by-K
matrix with
R(i,j)
= the number of observations of class i
that the classifier predicts to be of class j
.
Example: Resubstitution Error of a Discriminant Analysis Classifier
Examine the resubstitution error of the default discriminant analysis classifier for the Fisher iris data:
load fisheriris obj = fitcdiscr(meas,species); resuberror = resubLoss(obj) resuberror = 0.0200
The resubstitution error is very low, meaning obj
classifies nearly all the Fisher iris data correctly. The total number of misclassifications is:
resuberror * obj.NumObservations ans = 3.0000
To see the details of the three misclassifications, examine the confusion matrix:
R = confusionmat(obj.Y,resubPredict(obj)) R = 50 0 0 0 48 2 0 1 49 obj.ClassNames ans = 'setosa' 'versicolor' 'virginica'
R(1,:) = [50 0 0]
meansobj
classifies all 50 setosa irises correctly.R(2,:) = [0 48 2]
meansobj
classifies 48 versicolor irises correctly, and misclassifies two versicolor irises as virginica.R(3,:) = [0 1 49]
meansobj
classifies 49 virginica irises correctly, and misclassifies one virginica iris as versicolor.
Cross Validation
Typically, discriminant analysis classifiers are robust and do not exhibit overtraining when the number of predictors is much less than the number of observations. Nevertheless, it is good practice to cross validate your classifier to ensure its stability.
Cross Validating a Discriminant Analysis Classifier
This example shows how to perform five-fold cross validation of a quadratic discriminant analysis classifier.
Load the sample data.
load fisheriris
Create a quadratic discriminant analysis classifier for the data.
quadisc = fitcdiscr(meas,species,'DiscrimType','quadratic');
Find the resubstitution error of the classifier.
qerror = resubLoss(quadisc)
qerror = 0.0200
The classifier does an excellent job. Nevertheless, resubstitution error can be an optimistic estimate of the error when classifying new data. So proceed to cross validation.
Create a cross-validation model.
cvmodel = crossval(quadisc,'kfold',5);
Find the cross-validation loss for the model, meaning the error of the out-of-fold observations.
cverror = kfoldLoss(cvmodel)
cverror = 0.0200
The cross-validated loss is as low as the original resubstitution loss. Therefore, you can have confidence that the classifier is reasonably accurate.
Change Costs and Priors
Sometimes you want to avoid certain misclassification errors more than others. For example, it might be better to have oversensitive cancer detection instead of undersensitive cancer detection. Oversensitive detection gives more false positives (unnecessary testing or treatment). Undersensitive detection gives more false negatives (preventable illnesses or deaths). The consequences of underdetection can be high. Therefore, you might want to set costs to reflect the consequences.
Similarly, the training data Y
can have a distribution of classes that does not represent their true frequency. If you have a better estimate of the true frequency, you can include this knowledge in the classification Prior
property.
Example: Setting Custom Misclassification Costs
Consider the Fisher iris data. Suppose that the cost of classifying a versicolor iris as virginica is 10 times as large as making any other classification error. Create a classifier from the data, then incorporate this cost and then view the resulting classifier.
Load the Fisher iris data and create a default (linear) classifier as in Example: Resubstitution Error of a Discriminant Analysis Classifier:
load fisheriris obj = fitcdiscr(meas,species); resuberror = resubLoss(obj) resuberror = 0.0200 R = confusionmat(obj.Y,resubPredict(obj)) R = 50 0 0 0 48 2 0 1 49 obj.ClassNames ans = 'setosa' 'versicolor' 'virginica'
R(2,:) = [0 48 2]
meansobj
classifies 48 versicolor irises correctly, and misclassifies two versicolor irises as virginica.Change the cost matrix to make fewer mistakes in classifying versicolor irises as virginica:
obj.Cost(2,3) = 10; R2 = confusionmat(obj.Y,resubPredict(obj)) R2 = 50 0 0 0 50 0 0 7 43
obj
now classifies all versicolor irises correctly, at the expense of increasing the number of misclassifications of virginica irises from1
to7
.
Example: Setting Alternative Priors
Consider the Fisher iris data. There are 50 irises of each kind in the data. Suppose that, in a particular region, you have historical data that shows virginica are five times as prevalent as the other kinds. Create a classifier that incorporates this information.
Load the Fisher iris data and make a default (linear) classifier as in Example: Resubstitution Error of a Discriminant Analysis Classifier:
load fisheriris obj = fitcdiscr(meas,species); resuberror = resubLoss(obj) resuberror = 0.0200 R = confusionmat(obj.Y,resubPredict(obj)) R = 50 0 0 0 48 2 0 1 49 obj.ClassNames ans = 'setosa' 'versicolor' 'virginica'
R(3,:) = [0 1 49]
meansobj
classifies 49 virginica irises correctly, and misclassifies one virginica iris as versicolor.Change the prior to match your historical data, and examine the confusion matrix of the new classifier:
obj.Prior = [1 1 5]; R2 = confusionmat(obj.Y,resubPredict(obj)) R2 = 50 0 0 0 46 4 0 0 50
The new classifier classifies all virginica irises correctly, at the expense of increasing the number of misclassifications of versicolor irises from
2
to4
.