Decision tree and pruning optimization with imbalanced data

Question

Kai Doenges 2020-3-5

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/509278-decision-tree-and-pruning-optimization-with-imbalanced-data

回答： Ayush Aniket 2025-6-4

Hi,

I am trying to build a decsion tree for a data set of imbalanced class probabilities. Furthermore my data contains discrete and continous predicter variables. In order to consider both I have opted for the following setting in the fitctree function.

fitctree(tr_Input, tr_Output, 'prior', 'uniform', 'PredictorSelection','interaction-curvature')

After bulding the tree with the test set data (70%) I prune the tree by evaluating the losses depending on the tree level for both the test (ts) and the training (tr) set of data.

%PRUNE THE INITIAL TREE
%Explore training and test error rates varying the number of nodes
[lTR,~,~] = loss(treeIni,tr, c_VariableAExplicar{:,:}, 'Subtrees', 'all', 'LossFun','classiferror');
[lTS,~,~] = loss(treeIni,ts, c_VariableAExplicar{:,:}, 'Subtrees', 'all', 'LossFun','classiferror');

To find the best pruning level I plot the error rates

f=figure(i_fig);
hold on;
plot(0:max(treeIni.PruneList), lTR,'.-');
plot(0:max(treeIni.PruneList), lTS,'.-');
set(gca,'Xdir','reverse');
xlabel('Pruning level (1 node - full tree)'); grid on;
legend('TR', 'TS'); ylabel('Error rate');

The follwig graphs show prune level 13 to be reasonable for training and test set yielding an error rate of ~6.7%

After pruning the tree

treeOpt = prune(treeIni,'Level', optPrunLevel);

I plot the confussion matrix for the train and test set.

As you can see, I am getting good results as in identifying the few occurences of the positive class (1) , however the tree performs very por concerning the negative class (0). Not percentage wise regarding all negative occurences but percentage wise regarding the few positve occurences (see red cicrcles).

What am I doing wrong?

I have tried defining differnt misqualification costs, which does balance the results but obviously I get less accuracy regarding the positive class identification, which is not desired at all. Additionally I have tried oversampling but it also doesnt improve the tree.

Do I need to change any setting in the tree function or chose a different loss function to improve pruning?

Thanks in advanced for any help!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Ayush Aniket 2025-6-4

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/509278-decision-tree-and-pruning-optimization-with-imbalanced-data#answer_1565872

Since your dataset is imbalanced, the tree is biased toward the majority class. Even though you tried oversampling, decision trees can still struggle with minority classes if the features don’t provide strong separability. Here are few suggestions that you should try:

1. Instead of uniform priors ('prior', 'uniform'), try setting class-specific priors to emphasize the minority class. The fitctree function has the default option of empirical which determines class probabilities from class frequencies in the response variable. Refer the following documenattion section to read more about this argument: https://www.mathworks.com/help/stats/fitctree.html#bt6cr7t_sep_shared-Prior

2. You are using classiferror in the loss function, which only considers misclassification rate. Try using cross-entropy loss ('LossFun', 'logloss') or Gini impurity ('LossFun', 'gdi').

3. If class 0 and class 1 have overlapping feature distributions, the tree might struggle to separate them. Try adding interaction terms or transforming features to improve separability. Refer the Classification Learner for this process: https://www.mathworks.com/help/stats/feature-selection-and-feature-transformation.html

4. Decision trees alone might not be the best choice for imbalanced data. Consider Random Forests (fitensemble) or Gradient Boosting Trees (fitcensemble) for better performance.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Decision tree and pruning optimization with imbalanced data

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

Decision tree and pruning optimization with imbalanced data

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论