Leraning classification with most training samples in one category

Question

Michael 2011-7-13

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category

My question is not Matlab specific but more theoretical.

I'm currently using boosting to create a two class classifier and my week classifiers are trees. While I have a fairly large number of training examples in both classes, most of them are in one single class. I have the intuition that this difference in the number of examples in each class for the training set would deviate the resulting classifier from a "fair" one, towards one that benefits the class with more examples.

Am I right? what are the accepted ways to cope with this issue?

Thanks in advance!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Ilya 2011-7-14

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category#answer_15878

The answer depends on how you define a "fair" classifier. If the ultimate goal of your analysis is to minimize the overall classification error and if the class proportions in the training set are representative of the real world, you get an optimal classifier from your imbalanced data. If the class proportions in the training set are not what you normally expect or if you want to assign different costs for misclassification of the majority and minority classes, you would need to adjust your learning method accordingly.

In general, there are 4 ways of dealing with skewed data:

1. Adjusting class prior probabilities to reflect realistic proportions.

2. Adjusting misclassification costs to represent realistic penalties.

3. Oversampling the minority class.

4. Undersampling the majority class.

For binary classification, strategies 1 and 2 are equivalent.

If you use fitensemble or TreeBagger, the easiest thing would be to set 'prior' to 'uniform' for an equal mix or to whatever you like.

If you like oversampling or undersampling, nothing in official MATLAB is available out of the box. It wouldn't be too hard to code though.

For undersampling the majority class, personally I had good experience with RUSBoost:

Seiffert, C., Khoshgoftaar, T., Hulse, J.V., and Napolitano, A. (2008) Rusboost: Improving classification performance when training data is skewed, in International Conference on Pattern Recognition, pp. 1–4.

For oversampling the minority class, a popular method is SMOTE. You might want to look into its boosting extension.

Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002) Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2003) Smoteboost: improving prediction of the minority class in boosting, in VIIth European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD ´ 03), Lecture Notes on Computer Science, vol. 2838, Springer-Verlag, Lecture Notes on Computer Science, vol. 2838, pp. 107–119.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Ilya 2012-7-2

As I replied to your other post, the easiest thing to do would be to set 'prior' to 'uniform'.

An ensemble is most usually more accurate than a single tree, no matter if you learn on balanced or imbalanced data. In either case, you have to optimize your classifier to get the best result. The issue is selecting the right tree size to have enough sensitivity to the minority class, without over-training.

If you go with a single decision tree, you can set 'MinParent' to 1 to grow a deep tree and then find the optimal pruning level. If you want to use TreeBagger, you can use it with default parameters. Every tree in TreeBagger by default is grown to the deepest level, and the high variance is removed by averaging. If you go with one of the boosting algorithms available from fitensemble, you would need to optimize the tree size by playing with 'MinLeaf' or 'MinParent' option. The default for boosting is growing stumps (trees with two leaves), and stumps may not have enough sensitivity to the minority class. In that case, I would start by setting 'MinLeaf' to one half of observations in the minority class. It's impossible to say in advance if bagging or boosting would work best for you.

Ilya 2012-7-3

I should also mention that RUSBoost is one of fitensemble options in R2012b. Here is how you can get the 12b pre-release: http://www.mathworks.com/support/solutions/en/data/1-5NTATZ/index.html?solution=1-5NTATZ

请先登录，再进行评论。

Leraning classification with most training samples in one category

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Leraning classification with most training samples in one category

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论