Misclassification Costs in Binary Classification
10 次查看(过去 30 天)
显示 更早的评论
Hello everyone,
I am a student and currently new to the topic of Machine Learning. I have come across the following issue in Binary Classification.
I want to train an Ensemble Boosted Tree Model with a specific cost matrix:
function Mdl = trainBoostedTrees(trainData, trainLabels, costMatrix)
template = templateTree('MaxNumSplits', 20,...
'NumVariablesToSample', 'all');
Mdl = fitcensemble(trainData,trainLabels,'Method', 'AdaBoostM1', ...
'NumLearningCycles', 30,'Learners', template,'LearnRate', 0.1,...
'Cost',costMatrix);
end
Now, I want to use a cost matrix to shift the operating point down on the ROC curve and minimize the False Negative Rate.
( Kosten = Cost in German :) )
The problem is that I have to set the misclassification costs for the case so extremely high that it seems unrealistic to me (see last point in ROC). I would appreciate an answer where someone can explain to me why this is the case and what I might be doing wrong.
Another question I have in this context is: does the magnitude of my misclassification costs depend on the number of training samples?
Additional information:
My training dataset consists of approximately 49,000 x 21 samples, divided as follows:
33.3% True Labels / 66.6% False Labels
Please let me know if you need any further information or clarification. Thank you!
0 个评论
采纳的回答
Franziska Albers
2023-7-27
Hi Lars,
you are using the "cost" argument correctly. However, the effect strongly depends on your data. Some classes are just inherently hard to differentiate - you may need a different algorithm to get better performance or better features. You have a lot of training examples which is good, I would not expect the effect of the cost matrix to depend on that.
As far as I understand the "false" class is the majority class. Does this mean a false negative is a case where the "true" class is identified as "false" class? Maybe you can clarify or share the confusion matrix. Also, your false negative rate seems very low at a cost of 100. What rate are you aiming for?
In general balanced data is best for classification. If the data is not balanced you may balance it, use the "RUSBoost" method or apply a cost matrix. I would recommend that you try the RUSBoost or consider balancing your data. You can do that by oversampling the minority class or undersampling the majority class.
There is also some more information on how to handle imbalanced data here: https://www.mathworks.com/help/stats/classification-with-unequal-misclassification-costs.html
2 个评论
Franziska Albers
2023-7-31
Hi Killian,
To answer your question about the cost matrix:
fitcensemble uses Cost to adjust the prior class probabilities specified in Prior. Then, fitcensemble uses the adjusted prior probabilities for training. So, the training algorithms trains on more examples of the class with high misclassification cost. You can find more details on that here: https://www.mathworks.com/help/stats/supervised-learning-machine-learning-workflow-and-algorithms.html#mw_4cd1857b-b486-4247-b328-5fd810649696.
The cost does not have to depend on the number of training examples. However, for imbalanced datasets it is often recommended to start with a cost matrix that reflects the imbalance. So, if the majority class is 5 times bigger than the minority class it is a good practice to start with a classification cost of 5 for misclassifying the minority class and a classification cost of 1 for misclassifying the majority class. But that is just a recommended starting point. The only general thing to keep in mind is that classification costs are relative, so if you increase all costs by a factor of 5 nothing will change.
However, in your situation the problem is not o much imbalanced data. Your goal is an extremely low false negative rate even if it means worse overall accuracy – that is not a typical classification task. I would suggest to try to improve the machine learning model or (if possible) the data. I agree that a cost of more than a 1000 seems weird and possibly makes training and testing instable (this is another point where the size of your training dataset comes into play: for a small dataset very skewed cost matrixes can lead to instable behavior – that is also mentioned in the doc page linked above).
Another idea: Maybe you can frame your problem as anomaly detection and try models from that field? E.g. one-class support vector machines or isolation forests. See here: https://www.mathworks.com/help/stats/anomaly-detection.html
更多回答(1 个)
shreyash
2024-4-29
Create a new Optimizable KNN model template from the Models section. Click Costs from the Options section in the toolstrip. Modify the matrix such that the costs for (101,111) and (111,101) are 1.5. Click Save and Apply and train your model.
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Classification Ensembles 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!