Non-reproducible "fitcsvm" Matlab output

1 次查看(过去 30 天)
load ionosphere
% run number 1
rng(1); % For reproducibility
SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
% run number 2
indperm = randperm(size(X,1))';
X=X(indperm,:);
Y=Y(indperm);
SVMModel2 = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
SVMModel1 and SVMModel2 are different (bias and kernel scale values), just by varying the row-sorting of input data matrix X and Y. Any idea on what's going on?
thanks for help
  4 个评论
Rik
Rik 2023-6-9
I'm not sure you fully understand what rng(1) does (or I'm misunderstanding you).
What it does is to set the state of the randomizer, making sure that the output from any random function are deterministic (though still random). An example will help:
rng(1)
A = randi(20,1);
rng(1)
B = randi(20,1);
C = randi(20,1);
% A should now be equal to B, but C may be different
A,B,C
A = 9
B = 9
C = 15
So there are two reasons why the output is not the same, despite calling rng: you have already called random functions (which advances the seed), and you are changing the input (which could affect the results).
For an example of the latter: I don't know how the internals work, but for the concept that doesn't matter anyway.
rng(1)
data = 5*rand(2000,1);
indperm = randperm(size(data,1))';
SuperFancyMachineLearningMean(data)-mean(data)
ans = 4.4409e-16
SuperFancyMachineLearningMean(data(indperm))-mean(data)
ans = 8.8818e-16
function output = SuperFancyMachineLearningMean(data);
% Calculate (well, approximate, actually) the mean of a vector.
% Split the data in N blocks.
N = min(numel(data),10);
D1 = repmat(ceil(numel(data)/N),1,N);
D2 = 1;
D1(end) = numel(data)-sum(D1(1:(end-1))); % make the last smaller to fit element count
d = mat2cell(reshape(data,[],1),D1,D2);
for n=1:numel(d)
d{n} = mean(d{n});
end
output = mean([d{:}]);
end
This is apperently not as bad as an example as I thought (unless you're working with very small numbers, but the idea carries over.
Mm
Mm 2023-6-10
Thanks for explanation. We are on the same way regarding rng. Anyway, your code hints that sumsampling performed internally for SVM kernel scale optimization is responsible for non reproducibility of final model whether changing the input sorting. Thanks for collaboration

请先登录,再进行评论。

采纳的回答

Rik
Rik 2023-6-9
I'm not familiar with the internals of what this does exactly, but is this truly unexpected?
Since this is a form of fitting your data to a function, some variation is expected. For small fitting problems you can use the entire dataset in one go, meaning that sorting may or may not affect the result, but with machine learning this is generally not feasible. That means that the order of your samples may affect the training result.
  2 个评论
Mm
Mm 2023-6-9
编辑:Mm 2023-6-9
Why this should be expectable? I expected that differences in order of samples were handled internally by the code to give reproducible results. Slight differences using linear kernels and simple optimizators, incredibly explode when using Polynomial or RBF kernels and Bayesian optimization.
Rik
Rik 2023-6-9
Would you still expect the code to sort the data internally in some way if we're talking terabytes of data? Because that is essentially what you're asking. Note that I'm not defending the current implementation of this function, I'm merely explaining why I'm not surprised that there are functions in the stats&ML toolbox for which this happens.
This is essentially the same problem when you make splits for cross-validation: the splits may determine the outcome (I don't recall whether my colleague published this, so you will have to look for it yourself if you want to see a paper). While it is true that small changes in the data may explode when extrapolating, that is not unique to systems that depend on the data input order. Every extrapollation runs this risk.

请先登录,再进行评论。

更多回答(0 个)

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by