Misclassification Costs in Binary Classification

23 次查看(过去 30 天)
Hello everyone,
I am a student and currently new to the topic of Machine Learning. I have come across the following issue in Binary Classification.
I want to train an Ensemble Boosted Tree Model with a specific cost matrix:
function Mdl = trainBoostedTrees(trainData, trainLabels, costMatrix)
template = templateTree('MaxNumSplits', 20,...
'NumVariablesToSample', 'all');
Mdl = fitcensemble(trainData,trainLabels,'Method', 'AdaBoostM1', ...
'NumLearningCycles', 30,'Learners', template,'LearnRate', 0.1,...
'Cost',costMatrix);
end
Now, I want to use a cost matrix to shift the operating point down on the ROC curve and minimize the False Negative Rate.
( Kosten = Cost in German :) )
The problem is that I have to set the misclassification costs for the case so extremely high that it seems unrealistic to me (see last point in ROC). I would appreciate an answer where someone can explain to me why this is the case and what I might be doing wrong.
Another question I have in this context is: does the magnitude of my misclassification costs depend on the number of training samples?
Additional information:
My training dataset consists of approximately 49,000 x 21 samples, divided as follows:
33.3% True Labels / 66.6% False Labels
Please let me know if you need any further information or clarification. Thank you!

采纳的回答

Franziska Albers
Franziska Albers 2023-7-27
Hi Lars,
you are using the "cost" argument correctly. However, the effect strongly depends on your data. Some classes are just inherently hard to differentiate - you may need a different algorithm to get better performance or better features. You have a lot of training examples which is good, I would not expect the effect of the cost matrix to depend on that.
As far as I understand the "false" class is the majority class. Does this mean a false negative is a case where the "true" class is identified as "false" class? Maybe you can clarify or share the confusion matrix. Also, your false negative rate seems very low at a cost of 100. What rate are you aiming for?
In general balanced data is best for classification. If the data is not balanced you may balance it, use the "RUSBoost" method or apply a cost matrix. I would recommend that you try the RUSBoost or consider balancing your data. You can do that by oversampling the minority class or undersampling the majority class.
There is also some more information on how to handle imbalanced data here: https://www.mathworks.com/help/stats/classification-with-unequal-misclassification-costs.html
  2 个评论
Lars Kilian
Lars Kilian 2023-7-28
Hi Franziska,
firstly, thank you very much for your detailed response!
Yes, you're right, the false class is the majority class. My data are irregularly distributed, but this is real-life data and a realistic or real distribution of the data. That's why I thought it made sense to consider this irregular data set, to possibly consider real conditions?
My goal is to set the model so that the False Negative Rate is reduced to a minimum (maybe even lower than the current values). I accept that the True Negative Rate and the overall accuracy might drop significantly.
My problem was that I then have to go so extremely high with the cost value, that it does not seem logical to me.
Moreover, I always thought that the cost value depends on the number of training data? Is there a direct relationship between the threshold and the cost value?
Franziska Albers
Franziska Albers 2023-7-31
Hi Killian,
To answer your question about the cost matrix:
fitcensemble uses Cost to adjust the prior class probabilities specified in Prior. Then, fitcensemble uses the adjusted prior probabilities for training. So, the training algorithms trains on more examples of the class with high misclassification cost. You can find more details on that here: https://www.mathworks.com/help/stats/supervised-learning-machine-learning-workflow-and-algorithms.html#mw_4cd1857b-b486-4247-b328-5fd810649696.
The cost does not have to depend on the number of training examples. However, for imbalanced datasets it is often recommended to start with a cost matrix that reflects the imbalance. So, if the majority class is 5 times bigger than the minority class it is a good practice to start with a classification cost of 5 for misclassifying the minority class and a classification cost of 1 for misclassifying the majority class. But that is just a recommended starting point. The only general thing to keep in mind is that classification costs are relative, so if you increase all costs by a factor of 5 nothing will change.
However, in your situation the problem is not o much imbalanced data. Your goal is an extremely low false negative rate even if it means worse overall accuracy – that is not a typical classification task. I would suggest to try to improve the machine learning model or (if possible) the data. I agree that a cost of more than a 1000 seems weird and possibly makes training and testing instable (this is another point where the size of your training dataset comes into play: for a small dataset very skewed cost matrixes can lead to instable behavior – that is also mentioned in the doc page linked above).
Another idea: Maybe you can frame your problem as anomaly detection and try models from that field? E.g. one-class support vector machines or isolation forests. See here: https://www.mathworks.com/help/stats/anomaly-detection.html

请先登录,再进行评论。

更多回答(1 个)

shreyash
shreyash about 12 hours 前
Create a new Optimizable KNN model template from the Models section. Click Costs from the Options section in the toolstrip. Modify the matrix such that the costs for (101,111) and (111,101) are 1.5. Click Save and Apply and train your model.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by