Why is variance high for high K value in this KNN code?

Question

Vanditha Rao 2019-7-19

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code

编辑： Ganesh Regoti 2019-7-29

Hello,

Long post, please bear with me

I have a matlab dataset (dataset.mat) whose size is 280*3. The last column is the labels. There are total 3 classes (1, 2 and 3). I am implementing KNN on this dataset. Basically, I want to calculate the classification error, the mean and the variance of the classification error over multiple (random, but even) splits. From the plot I want to determine how k value affects the mean and the variance of the classification error. Now, I understand the concept of Bias and Variance. I also know that as the k value increases, the bias will increase and variance will decrease. When K = 1 the bias will be 0, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. But, the variance isnt decreasing in my plot (please see the attachment)

My code looks like this:

%% Loading the dataset
clear all
clc
load('dataset.mat');
%% Calculating the mean, variance and classification error for multiple splits
m = []; % empty list to store the mean of the classification error
variance = []; % empty list to store the variance of the classification error
error = []; % empty list to store the classification error
for k= 1:20 % different k values
    
    error = [];
    
    for j= 1:10 % This for loop is for random split (note: each time it is split evenly i.e. 50% into a training set and rest in a test set). 
        
        
        % dataset is split evenly (i.e. 50%), but randomly in to a training set and a test set all 10 times
        
        N = size(knn_samples,1);
        idx = randperm(N);
        
        train = knn_samples(idx(1:round(N*0.5)),:);
        test = knn_samples(idx(round(N*0.5)+1:end),:);
        X_train = train(:,1:2); % size 140*2
        y_train = train(:,3); % size 140*1
        X_test = test(:,1:2); % size 140*2
        y_test = test(:,3); % size 140*1
       
        Model = fitcknn(X_train,y_train,'NumNeighbors',k,'Standardize',1); % KNN model
        
        rloss = resubLoss(Model); % the classification loss by resubstitution
        
        [label_test,score_test,cost_test] = predict(Model,X_test);
        L = loss(Model,X_test,y_test); %how well the model classifies the data 
        C_test = confusionmat(y_test,label_test); % confusion matrix 
        idx = find(C_test ~= diag(C_test)); %to find the index of the off diagonal entries of confusion matrix i.e. classification error
        off_diag = sum(C_test(idx)); %to calculate the total value of off diagonal entries
        accuracy = sum(diag(C_test)/sum(C_test(:)));
        
        errorClass = sum(label_test ~= y_test)/length(y_test);
        error = [error, errorClass]; % classification error
        
    end
    
    m = [m, mean(error)]; %mean of the classification error
    variance = [variance, var(error)]; % variance of the classification error
    
end
figure(1)
hold on
colormat1 = y_test;
scatter(X_test(:, 1), X_test(:, 2), [], colormat1); 
l = (label_test ~= y_test); % specify wrong predictions
colormat2 = label_test(l);
mkr = 'x';
scatter(X_test(l, 1), X_test(l, 2), [], colormat2, mkr); % mark the wrong predictions
k = 1:20;
 
figure(2)
plot(k, m, 'b')
xlabel('K values')
ylabel('Mean')
title('Mean of the classification error') % over multiple splits
figure(3)
plot(k, predictiveVariance, 'k')
xlabel('K values')
ylabel('Variance')
title('Variance of the classification error')

Maybe there is a compact way of writing this code, but I am a beginner. This could be a very very basic quetion, but I am unable to figure it out. I looked online for the solution, but I didn't find anything. Almost every site talks about Bias and Variance trade-off, but I didn't find any code example or a reason on why the variance could be increasing with increasing value of k. May be there is a small glitch in the code which I am unable to figure it out. I have given up on finding solution on my own, hence looking for solution in the Matlab community. You can also suggest a better way to write this code or any link which could give me a solution for this.

Note: Please also have a look at the variance value. Is it too small (it is in 10^-3 range)

Thank you very much

2 个评论
显示无隐藏无

Ganesh Regoti 2019-7-24

Can you provide a section of dataset to test on the model?

Vanditha Rao 2019-7-28

dataset.mat

@Ganesh Regoti: What do you mean by the section of dataset? Do you want me to attach the dataset? I have attached the dataset.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

llueg 2019-7-24

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#answer_384606

I agree more information on the data would be helpful. Also, since your data set is fairly small, you can probably do more than 10 (maybe a hundred) different splits for each k, just to get a more accurate average. If the current trend is still there, it's probably due to properties specific to your data.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Ganesh Regoti 2019-7-29

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#answer_385203

编辑：Ganesh Regoti 2019-7-29

In KNN-classification, variance need not be decreasing as the K value increases. Usually it is ‘U’- shape and we find out the optimal point.

There might be certain predictors which contribute more for the classification. If those highly contributing predictors vary as such

Constant: There will be not much difference in variance graph for the entire data set.

Values vary and reach an optimum at certain point: Variance also varies accordingly (probably decreasing with increase in K value) but once optimal point is reached, it might start increasing.

So, I think that in your case optimum point is reached in the process, and continuing the process lead to increase in variance.