Why the results are different by using trainNetwork and custom training loop?

Question

shuai ma 2022-3-1

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1661160-why-the-results-are-different-by-using-trainnetwork-and-custom-training-loop

回答： Samuel Somuyiwa 2022-3-14

I have defined a custom layer, and construct a simple network. when I train the network by using trainNetwork and custom training loop, the results are different. the parameters and data are the same.The codes are follows:

1.this is the network trained by trainNetwork function

clear
clc
rng(0)
%% parameters
nMFs = 32; 
init_method = 'linespace';
%% data
dataname = 'house';
load([dataname,'.mat'])
% data = xx;
data=data(all(~isnan(data),2),:);    % 去除缺失值
data = removeconstantrows(data')';   % 去除常数特征
%% data process
X=data(:,1:end-1); y=data(:,end); y=y./1e5;%y=y-mean(y);
% X=zscore(X);
X=2*(X-min(X))./(max(X)-min(X))-1;
[N0,M]=size(X);
N=round(N0*.7);
idsTrain=datasample(1:N0,N,'replace',false);
XTrain=X(idsTrain,:); yTrain=y(idsTrain);
XTest=X; XTest(idsTrain,:)=[];%XTest={XTest};
yTest=y; yTest(idsTrain)=[];%yTest={yTest};
%% rule list
nRules = nMFs;
% rule = comb(repmat(1:nMFs,M,1));
% rule = repmat([1:nMFs]',1,M);
%% learnable parameters initial method
switch init_method
    % FCM
    case 'FCM'
        [C0,U] = FuzzyCMeans(XTrain,nRules,[2 100 0.001 0]);
        Sigma0=C0;
        W0 = randn(nRules,M+1);
        for ir=1:nRules
            Sigma0(ir,:)=std(XTrain,U(ir,:));
            W0(ir,1)=U(ir,:)*yTrain/sum(U(ir,:));
        end
        Sigma0(Sigma0==0)=mean(Sigma0(:));
    case 'random'
        % random
        C0 = randn(nRules,M);
        Sigma0 = rand(nRules,M);
        W0 = randn(nRules,M+1);
        Sigma0(Sigma0==0)=mean(Sigma0(:));
    case 'linespace'
        % linespace
        C0=zeros(nMFs,M); Sigma0=C0; W0=zeros(nMFs,M+1);
        for m=1:M % Initialization
            C0(:,m)=linspace(min(XTrain(:,m)),max(XTrain(:,m)),nMFs);
            Sigma0(:,m)=std(XTrain(:,m));
        end
        Sigma0=ones(nMFs,M);
end
%% layers
layers = [
    featureInputLayer(M,'Name','Input','Normalization','none');
    TSKlayer1(C0,Sigma0,W0,'TSK1');
    regressionLayer];
options = trainingOptions(...
    'adam',...
    'GradientDecayFactor',0.9,...
    'SquaredGradientDecayFactor',0.999,...
    'Epsilon',1e-8,...
    'MaxEpochs',50,...
    'MiniBatchSize',128,...
    'InitialLearnRate',0.01,...
    'LearnRateSchedule','piecewise',...
    'LearnRateDropPeriod',100,...
    'LearnRateDropFactor',1,...
    'Shuffle','every-epoch',...
    'ValidationData',{XTest,yTest},...
    'ValidationFrequency',10,...
    'ValidationPatience',1000,...
    'OutputNetwork','best-validation-loss',...
    'L2Regularization',0,...
    'ResetInputNormalization',false,...
    'GradientThreshold',inf,...
    'Plots','training-progress');
%% Train the nn
tic
[net,tinfo] = trainNetwork(XTrain,yTrain,layers,options);
toc

the results: the minimum RMSE is about 0.344

2.this is trained by custom training loop

clear
clc
rng(0)
%% parameters
nMFs = 32; 
learnRate = 0.01;
decay = 1;
gradientDecayFactor = 0.9;
squaredGradientDecayFactor = 0.999;
epsilon = 1e-8;
numEpochs = 50;
miniBatchSize = 128;
init_method = 'linespace';
%% data
dataname = 'house';
load([dataname,'.mat'])
data=data(all(~isnan(data),2),:);    % 去除缺失值
data = removeconstantrows(data')';   % remove constant features
%% data process
X=data(:,1:end-1); y=data(:,end); y=y./1e5;%y=y-mean(y);
X=2*(X-min(X))./(max(X)-min(X))-1;
[N0,M]=size(X);
N=round(N0*.7);
idsTrain=datasample(1:N0,N,'replace',false);
XTrain=X(idsTrain,:); yTrain=y(idsTrain);
XTest=X; XTest(idsTrain,:)=[];%XTest={XTest};
yTest=y; yTest(idsTrain)=[];%yTest={yTest};
XTest = dlarray(XTest','CB');
yTest = dlarray(yTest','CB');
%% rule list
nRules = nMFs;
% rule = comb(repmat(1:nMFs,M,1));
% rule = repmat([1:nMFs]',1,M);     % not used
%% initial method
switch init_method
% FCM
    case 'FCM'
        [C0,U] = FuzzyCMeans(XTrain,nRules,[2 100 0.001 0]);
        Sigma0=C0;
        W0 = randn(nRules,M+1);
        for ir=1:nRules
            Sigma0(ir,:)=std(XTrain,U(ir,:));
            W0(ir,1)=U(ir,:)*yTrain/sum(U(ir,:));
        end
        Sigma0(Sigma0==0)=mean(Sigma0(:));
    case 'random'
        % random
        C0 = randn(nRules,M);
        Sigma0 = rand(nRules,M);
        W0 = randn(nRules,M+1);
        Sigma0(Sigma0==0)=mean(Sigma0(:));
    case 'linespace'
        % linespace
        C0=zeros(nMFs,M); Sigma0=C0; W0=zeros(nMFs,M+1);
        for m=1:M % Initialization
            C0(:,m)=linspace(min(XTrain(:,m)),max(XTrain(:,m)),nMFs);
%             Sigma0(:,m)=std(XTrain(:,m));
        end
        Sigma0=ones(nMFs,M);
end
%% data format
dsXTrain = arrayDatastore(XTrain);
dsyTrain = arrayDatastore(yTrain);
dsTrain = combine(dsXTrain,dsyTrain);
layers = [
    featureInputLayer(M,'Name','Input','Normalization','none');
    TSKlayer1(C0,Sigma0,W0,'TSK1');
    ];
lgraph = layerGraph(layers);
dlnet = dlnetwork(lgraph);
plots = "training-progress";
% plots = "nan";
% Train Model
% Train the model using a custom training loop. Initialize the velocity parameter for the SGDM solver.
velocity = [];
% accfun = dlaccelerate(@modelGradients);
% clearCache(accfun)
%% mini batch
mbq = minibatchqueue(dsTrain,...
    'MiniBatchSize',miniBatchSize,...
    'MiniBatchFcn', @preprocessMiniBatch,...
    'MiniBatchFormat',{'CB','CB'}); 
% Initialize the training progress plot.
if plots == "training-progress"
    figure
    lineLossTrain = animatedline('Color',[0.85 0.325 0.098]);
    lineLossTest = animatedline('Color',[0 0 0]);
    ylim([0 inf])
    xlabel("Iteration")
    ylabel("Loss")
    grid on
end
averageGrad = [];
averageSqGrad = [];
iteration = 0;
start = tic;
% Loop over epochs.
for epoch = 1:numEpochs
    learnRate = learnRate*decay;
    % Shuffle data.
    shuffle(mbq)
    % Loop over mini-batches.
    while hasdata(mbq)
        iteration = iteration + 1;
        % Read mini-batch of data.
        [dlX1,dlY] = next(mbq);
        % Evaluate the model gradients, state, and loss using dlfeval and the
        % modelGradients function and update the network state.
        [gradients,state,loss] = dlfeval(@modelGradients,dlnet,dlX1,dlY);
        dlnet.State = state;
        % Update the network parameters using the SGDM optimizer.
%         [dlnet, velocity] = adamupdate(dlnet, gradients, velocity, learnRate, momentum);
        [dlnet,averageGrad,averageSqGrad] = ...
            adamupdate(dlnet, gradients, ...
            averageGrad, averageSqGrad, iteration, ...
            learnRate, gradientDecayFactor, squaredGradientDecayFactor,epsilon);
        yPreVal = predict(dlnet,XTest);
%         yPreVal(isnan(yPreVal)) = yTest(isnan(yPreVal));
        test_error = sqrt(mse(yPreVal,yTest))
        if plots == "training-progress"
            % Display the training progress.
            D = duration(0,0,toc(start),'Format','hh:mm:ss');
            %completionPercentage = round(iteration/numIterations*100,0);
            title("Epoch: " + epoch + ", Elapsed: " + string(D));
            addpoints(lineLossTrain,iteration,double(gather(extractdata(sqrt(loss)))))
            addpoints(lineLossTest,iteration,double(extractdata(test_error)))
            drawnow limitrate    
        end
    end
end
function [gradients,state,loss] = modelGradients(dlnet,dlX1,Y)
[dlYPred,state] = forward(dlnet,dlX1);
loss = mse(dlYPred,Y);
gradients = dlgradient(loss,dlnet.Learnables);
end
function [X,Y] = preprocessMiniBatch(XCell,YCell)
% Extract  feature data from cell and concatenate.
X = cat(1,XCell{:});
X = X';
% Extract label data from cell and concatenate.
Y = cat(2,YCell{:});
end

and the results: the minimum RMSE is about 0.24

Why the results are so different?

the TSK1 layer is my custom layer with backword function.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Samuel Somuyiwa 2022-3-2

The RMSE in the training plot of trainNetwork does not include the factor of half, whereas in the custom training loop you used sqrt(mse(x,y)), and mse includes the factor of half. l2loss does not include the factor of half, so it should be the right function to use in this case. Did you try sqrt(l2loss(x,y))?.

Rik 2022-3-3

Comment posted as flag by @shuai ma:

they are the same now, thanks so much

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Samuel Somuyiwa 2022-3-14

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1661160-why-the-results-are-different-by-using-trainnetwork-and-custom-training-loop#answer_917384

The RMSE in the training plot of trainNetwork does not include the factor of half, whereas in the custom training loop you used sqrt(mse(x,y)), and mse includes the factor of half. l2loss does not include the factor of half, so the right way to compute RMSE in this case is sqrt(l2loss(x,y)).