fitcdiscr bug: Why does "ClassNames" now have to be provided in alphanumerical order otherwise accuracy is terrible?

3 次查看(过去 30 天)
[Update: There is a known bug in kfoldLoss. See answer and workaround from Mathworks technical support in the answers section]
I couldn't work out why I was getting terrible results (as if completely random) with fitcdiscr() and I've found out that it is because I wasn't specifying the ClassNames argument in alphabetical order. Comparing MATLAB 2024a to 2022, this is new behaviour and presumably a bug. One of the reasons to specify ClassNames can be to change the order for the results summary, etc.
Example code that gives terrible accuracy:
load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species)), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3200
By simply removing the function flip(), the above code gives the expected accuracy of 0.98, otherwise it gives 0.32 (which is basically random for three classes). A side-effect of unique() is that it sorts the data into alphanumerical order, which isn't actually required for fisheriris because the observations happen to be in alphabetical order.
Here is some code that will give terrible accuracies if the order is randomised and the "stable" parameter is used to keep the random order:
load fisheriris
r = randperm(length(meas)); % Randomise the order of the occurrences
Mdl = fitcdiscr(meas(r,:), species(r), "ClassNames", unique(species(r), "stable"), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3267
If run a few times, I get results like:
validationAccuracy =
0.3267
validationAccuracy =
0.0067
validationAccuracy =
0.0067
validationAccuracy =
0.3267
validationAccuracy =
0.9800
Update: I have now been able to test the code on another computer that still has MATLAB 2022 installed and it gives the correct accuracy with the above code, so this appears to be a bug in the latest version of MATLAB! I have reported it to Mathworks.

回答(2 个)

Athanasios Paraskevopoulos
编辑:Athanasios Paraskevopoulos 2024-5-17
  • Code with Issue:
load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species, "stable")), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3200
  • Correct Code:
load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", unique(species, "stable"), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.9800
Your observation indicates a potential bug in the latest version of MATLAB that should be reported to MathWorks. Until the issue is resolved, always specify the ClassNames argument in alphabetical order to ensure correct behavior.
  1 个评论
Leon
Leon 2024-5-17
编辑:Leon 2024-5-17
You've just repeated exactly what I said. That's not an answer. I literally said that removing flip() fixed the problem and it is obvious that I only put it there to demonstrate the problem. I also specified that sorting the ClassNames alphanumerically corrects the accuracy, and that I believe this is a bug (which I have reported to Mathworks). Furthermore, the actual functionality of ClassNames to specify the order of the class names for results, etc., remains broken regardless.

请先登录,再进行评论。


Leon
Leon 2024-5-28
编辑:Leon 2024-5-28
I received the following answer and workaround (using kfoldPredict) from Mathworks technical support:
The bug here lies within the "kfoldLoss" function, rather than within "fitcdiscr". This is a bug that the development team is aware of and is investigating. In the meantime, you can compute the loss by comparing the predicted and true class labels. For example, the following code will always return 0.98 in R2024a (this would not be the case using "kfoldLoss"):
load fisheriris
species = categorical(species); % Species is a cell array so I convert it to a vector
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species)), "KFold", 10);
predictedLabels = kfoldPredict(Mdl);
correctPredictions = sum(predictedLabels == species);
valAcc = correctPredictions/numel(species)
In my code, only the model is passed to a function, so I no longer have access to the response variable directly, but this works:
predictedLabels = kfoldPredict(Mdl);
correctPredictions = sum(predictedLabels == categorical(Mdl.Y));
valAcc = correctPredictions / numel(Mdl.Y);

类别

Help CenterFile Exchange 中查找有关 Gaussian Process Regression 的更多信息

产品


版本

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by