fitcecoc SVM with categorical predictors not predicting the correct label for multiclass problem.

2 次查看(过去 30 天)
Building a simple SVM model in Matlab does not seem to predict the correct label when using categorical predictors, for multiclass problems.
The sample code is as follows:
% first model, train and test data are categorical
% the test data is closest to label 20
trainData = [1 1 1; 2 2 2; 2 3 3];
trainLabel = [10; 20; 30];
testData = [1 2 2];
model = fitcecoc(trainData,trainLabel,'CategoricalPredictors','all');
predictLabel = predict(model,testData);
disp(['predictLabel: ',num2str(predictLabel)]);
% second model, train and test data are same as above but represented as:
% 1 = 1 0 0, 2 = 0 1 0, 3 = 0 0 1
trainData2 = [1 0 0 1 0 0 1 0 0; 0 1 0 0 1 0 0 1 0; 0 1 0 0 0 1 0 0 1];
testData2 = [1 0 0 0 1 0 0 1 0];
model2 = fitcecoc(trainData2,trainLabel);
predictLabel2 = predict(model2,testData2);
disp(['predictLabel2: ',num2str(predictLabel2)]);
The first model should predict label 20, but chooses label 30 instead. Based on my understanding of how SVM works, it should have chosen label 20. When I transform the first model, per this link, and reduce it to it's binary representation as per model2, then it predicts the correct label 20. As fas as I'm aware, and per the previous link, the two models are logically identical. So, I may be using some incorrect syntax for the first model, or my understanding of how SVM works under the covers is incorrect (but then the two models above should have the same result), or perhaps there is a bug for multiclass ECOC categorical models.
Any help is greatly appreciated - thanks!

回答(1 个)

the cyclist
the cyclist 2020-2-13
I'm pretty sure you've got your dummy encoding wrong.
You are treating 1,2 and 3 as if they are somehow the same categories in all three columns. But those are different explanatory variables, so it could be:
  • 1st col: 1 = Blue, 2 = Red (notice there is no observation of 3 in the 1st column)
  • 2nd col: 1 = Democrat, 2 = Republican, 3 = Libertarian
  • 3rd col: 1 = Ford, 2 = BMW, 3 = Honda
Therefore, the correct dummy encoding is
trainData2 = dummyvar({categorical([1;2;2]),categorical([1;2;3]),categorical([1;2;3])});
trainData2 =
1 0 1 0 0 1 0 0
0 1 0 1 0 0 1 0
0 1 0 0 1 0 0 1
where the first two columns indicate Blue/Red, the next three colums indicate Dem/Rep/Lib, and the last three columns indicate Ford/BMW/Honda.
The correct test data for the dummy-encoded version is then
testData2 = [1 0 0 1 0 0 1 0]; % Because the test is Blue / Rep / BMW
Those inputs give me the same prediction for the dummy-encoded version as the categorical version.
  3 个评论
the cyclist
the cyclist 2020-2-16
So, let's call my dummy encoding the third model. Then,
% first model, train and test data are categorical
% the test data is closest to label 20
trainData = [1 1 1;
2 2 2;
2 3 3];
trainLabel = [10;
20;
30];
testData = [1 2 2];
model = fitcecoc(trainData,trainLabel,'CategoricalPredictors','all');
predictLabel = predict(model,testData);
disp(['predictLabel: ',num2str(predictLabel)]);
% second model, train and test data are same as above but represented as:
% 1 = 1 0 0, 2 = 0 1 0, 3 = 0 0 1
trainData2 = [1 0 0 1 0 0 1 0 0;
0 1 0 0 1 0 0 1 0;
0 1 0 0 0 1 0 0 1];
testData2 = [1 0 0 0 1 0 0 1 0];
model2 = fitcecoc(trainData2,trainLabel);
predictLabel2 = predict(model2,testData2);
disp(['predictLabel2: ',num2str(predictLabel2)]);
% third model
trainData3 = dummyvar({categorical([1;2;2]),categorical([1;2;3]),categorical([1;2;3])})
testData3 = [1 0 0 1 0 0 1 0]; % Because the test is Blue / Rep / BMW
model3 = fitcecoc(trainData3,trainLabel);
predictLabel3 = predict(model3,testData3);
disp(['predictLabel3: ',num2str(predictLabel3)]);
Weird thing is that I could have sworn that models #1 and #3 were the ones that gave the same result. I think the reason for that may have been that I was also playing around with using the name-value pair ['CategoricalPredictors','all']for the dummy-encoded models as well. When I do, then everything gives the same answer.
I'm frankly not sure at the moment if it makes sense to use that for the dummy-encoded models. I'm not able to spend time right now thinking about it, but thought I would toss that idea out there.
John Pfeifer
John Pfeifer 2020-2-16
Thanks for putting some time towards this.
As far as I can tell, for the first model Matlab should be internally converting the data into a categorical representation, similar to your model 3, but it does not seem to be happening correctly.
The workaround is to just explicitly convert the data to a categorical representation, so I'll just go with that.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Image Data Workflows 的更多信息

产品


版本

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by