Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?

Question

Sim 2023-5-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1970219-why-is-the-calculated-rsquare-different-between-the-embedded-fit-function-and-the-ezyfit-function-f

编辑： Sim 2023-5-29

采纳的回答： John D'Errico

Sorry, I think I have overwritten a new question over an old one... is there a way to restore the old question?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

John D'Errico 2023-5-22

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1970219-why-is-the-calculated-rsquare-different-between-the-embedded-fit-function-and-the-ezyfit-function-f#answer_1241574

编辑：John D'Errico 2023-5-22

在 MATLAB Online 中打开

Do you understand that R^2 is not valid, when computed for a model with no constant term? Instead, I recall there are variations of R^2 that are more valid for models with no constant term.

Did you notice that R^2 for one of those computations was negative? That clearly suggests a problem. That your model has NO constant term also suggests why there is a problem.

That your data has one outlier in it, that would heavily influence the fit is another problem. I won't get into that at all.

Simple one number measures like R^2 are a problem. They are a far larger problem if you don't understand what they are telling you. And if you are worried about R^2 at all, on a problem with no constant term, then you don't understand R^2.

Let me spend some time writing and explaining...

x1 = [
46674110029
86718990168
18495599912
16876863704
25401459449
17734249873
79267910209
395721875175
12705609501
45890484099
60164096334
90470480267
88763478115
87007296104
95766710343
55469767365
08584136511
10668583756];
y1 = [
03428582923
49016736711
98595650498
68253981628
64494736076
09183624669
87690315152
01379244271
91071657769
38381778387
46183991586
49502991896
73593671295
44200564969
35882024405
8042370884
2292759675
18628227003];

Essentially, R^2 is a very simple measure that compares the variability in the data itself, IF we had essentially no model at all. We can get that from the variance.

var(y1)
ans = 1.1056e+06

Note that the variance SUBTRACTS OFF THE MEAN OF THE DATA. It implicitly assumes the model for this process is a constant model, with gaussian noise added. So the implicit model in that variance computation is just

y = a0 + noise

And we can recover the best least square estimate of a0 from the mean.

a0 = mean(y1)
a0 = 3.0491e+03

But what did you fit? You tried to fit a model that lacks a constant term.

y = a1*x

We can get that directly from backslash, or we can use fit. Since we will do these computations essentially by hand, I'll use backslash.

format long g
a1 = x1\y1
a1 = 
          1.62650678745712

I can also use fit though, just to convince you that backslash was correct.

mdl = fittype('a1*x1','indep','x1');
[fittedmdl,G] = fit(x1,y1,mdl)
Warning: Start point not provided, choosing random start point.
fittedmdl = 
     General model:
     fittedmdl(x1) = a1*x1
     Coefficients (with 95% confidence bounds):
       a1 =       1.627  (1.289, 1.964)
G = struct with fields:
           sse: 26350446.5536834
       rsquare: -0.401930869730069
           dfe: 17
    adjrsquare: -0.401930869730069
          rmse: 1245.00050918212
fittedmdl.a1
ans = 
           1.6265067878172

So the same value, fit should be actually a little less accurate here, because fit uses an iterative procedure. So the slop lies in the convergence tolerance.

You will notice that fit returns a NEGATIVE R^2. Again, that should be a hint. NO CONSTANT TERM.

Now, what does R^2 tell us? R^2 compares how well the current model does in terms of reducing the variability in the data, compared to no model more complex than assuming the prcess is simply a constant one, plus noise.

SSbase = sum((y1 - mean(y1)).^2)
SSbase = 
          18795824.4751091
SSmdl = sum((y1 - a1*x1).^2)
SSmdl = 
          26350446.5536833

As you can see, the base sum of squares, where I subtracted off only the mean is SMALLER then the sum of squares when I subtracted off the estimate from this model. The R^2 computation is now a simple one.

R2bad = 1 - SSmdl/SSbase
R2bad = 
        -0.401930869730068

So again, a negative R^2 tells us that your model does not fit the data better than if you had just used a constant approximation for the process.

Finally, we might consider if a better meaure (for THIS process) is how well the fit reduces the simple sum of squares of your data, had we not subtracted off the mean.

R2nocon = 1 - SSmdl/sum(y1.^2)
R2nocon = 
         0.858439152113143

This assume that the default model for the process is

y = noise

with a presumed mean of zero. And that is probably what ezyfit computed, since it apparently knew the model has no constant term.

Finally, I'll plot the various models.

plot(x1,y1,'mo')

hold on

plot(fittedmdl,'r')

yline(mean(y1),'b')

The blue horizontal line is a model of the process where there is no signal at all, just random noise. In fact, that actually fits the data better in terms of explaining the sum of squares of errors, compared to the linear fit with no constant term.

In the end, mono-numerosis is a BAD thing. NEVER RELY ON A SIMPLE NUMBER TO TELL YOU IF YOUR MODEL IS ANY GOOD, certainly not if you don't understand the number in the first place. I would even go further, to tell you to rely on your eyes and your brain, NOT on any number. If the fit looks right, it is right. At the very least, think about what you are doing. Is the fit adequate for what you need?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Sim 2023-5-22

编辑：Sim 2023-5-23

在 MATLAB Online 中打开

Thanks a lot @John D'Errico for your great answer!

I tried to understand and reproduce what you did, performing the fitting/linear regression for the following cases:

(a) only slope, with the original equation for R^2 (done manually)
(b) only slope, with the corrected equation for R^2 (done manually)
(c) slope and intercept, with the original equation for R^2 (done manually)
(d) this is an additional case, that I add here, just to compare to the other cases: slope and intercept detected automatically and R^2 calculated automatically with "fitlm" (done automatically)

clc;close all;

% input

x1 = [

1734.46674110029

1721.86718990168

1456.18495599912

1748.16876863704

1262.25401459449

1584.17734249873

1859.79267910209

864.395721875175

1952.12705609501

1503.45890484099

1976.60164096334

3862.90470480267

1914.88763478115

1373.87007296104

1826.95766710343

1515.55469767365

1765.08584136511

1318.10668583756];

y1 = [

4289.03428582923

2246.49016736711

1540.98595650498

2038.68253981628

3541.64494736076

4039.09183624669

3602.87690315152

1747.01379244271

4285.91071657769

2935.38381778387

2432.46183991586

4121.49502991896

2455.73593671295

4682.44200564969

3633.35882024405

1763.8042370884

1803.2292759675

3724.18628227003];

% (1) Manual fitting/regression

% find slopes (and intercept)

b1 = x1\y1; % <-- (a & b) only slope

yfit1 = b1*x1;

x2 = [ones(length(x1),1) x1];

b2 = x2\y1; % <-- (c) slope and intercept

yfit3 = x2*b2;

% draw a plot

hold on

scatter(x1,y1,[],'b','filled')

plot(x1,yfit1,'-.k')

plot(x1,yfit3,'-r')

hold off

xlabel('x1')

ylabel('y1')

legend('Data','(a & b) only slope','(c) slope and intercept','Location','best');

% calculate R^2 and RMSE statistics

Rsq1 = 1 - sum((y1 - yfit1).^2)/sum((y1 - mean(y1)).^2); % <-- (a) only slope, with original equation for R^2

MSE1 = sum((y1 - yfit1) .^ 2) / numel(y1);

RMSE1 = sqrt(MSE1);

Rsq2 = 1 - sum((y1 - yfit1).^2)/sum((y1 ).^2); % <-- (b) only slope, with the corrected equation for R^2

MSE2 = MSE1; % <-- both MSE and RMSE are the same as the previous case since do not depend on mean(y1)

RMSE2 = RMSE1;

Rsq3 = 1 - sum((y1 - yfit3).^2) /sum((y1 - mean(y1)).^2); % <-- (c) slope and intercept, with original equation for R^2

MSE3 = sum((y1 - yfit3) .^ 2) / numel(y1);

RMSE3 = sqrt(MSE3);

% make a summary

t1 = table({'a'},{'NaN'},b1, Rsq1, MSE1, RMSE1);

t2 = table({'b'},{'NaN'},b1, Rsq2, MSE2, RMSE2);

t3 = table({'c'},b2(1),b2(2),Rsq3, MSE3, RMSE3);

t1.Properties.VariableNames = {'case','intercept','slope','R^2','MSE','RMSE'};

t2.Properties.VariableNames = t1.Properties.VariableNames;

t3.Properties.VariableNames = t1.Properties.VariableNames;

[t1;t2;t3]

ans = 3×6 table

case intercept slope R^2 MSE RMSE _____ ______________ ______ ________ __________ ______ {'a'} {'NaN' } 1.6265 -0.40193 1.4639e+06 1209.9 {'b'} {'NaN' } 1.6265 0.85844 1.4639e+06 1209.9 {'c'} {[2.2105e+03]} 0.4832 0.076757 9.6406e+05 981.87

% (2) matlab function for linear regression

mdl = fitlm(x1,y1) % <-- (d) slope, intercept and R^2 automatically calculated

mdl =

Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ ________ (Intercept) 2210.5 767.46 2.8802 0.010878 x1 0.4832 0.41896 1.1533 0.26571 Number of observations: 18, Error degrees of freedom: 16 Root Mean Squared Error: 1.04e+03 R-squared: 0.0768, Adjusted R-Squared: 0.0191 F-statistic vs. constant model: 1.33, p-value = 0.266

figure

plotAdded(mdl)

In summary:

Yes, in my case, with the original definition of R^2, I got R^2 = -0.4 because the variance around the least-squares fit line is greater than the variance around the mean. This is clear, but I did not know that it was related to the lack of intercept, and I did not understand why Ezyfit was giving a different result.
I did not know that R^2 loses its meaning if there is not the intercept. I understood that, for that case, we need to use a slithgly different definition for R^2, i.e.

Rsq2 = 1 - sum((y1 - yfit1).^2)/sum((y1           ).^2);  % <-- (b) only slope, with the corrected equation for R^2

However, I still do not understand why the mean(y) cannot be subtracted for that case. Something is explained in Removal of statistically significant intercept term increases 𝑅2 in linear model, but I did not grab the idea yet.
I understood that, for my case, and by using the corrected equation for R^2, i.e. that one where the mean(y) is not subtracted, I would get "y = noise" as default model. However, if I do not subtract mean(y) in the equation of R^2, it would be equivalent to say that mean(y)=0. But mean(y) is around 3000, as you have showed in your plot. I do not understand this part well.
It might be that Ezyfit calculates R^2 = 0.84396 without subtracting mean(y) in the equation of R^2, even though, with that small modification, we get R^2 = 0.85844. A light difference.
Cases (c) and (d) produce the same output, i.e. same slope, intercept and R^2. That means that the embedded function "fitlm" considers - obviously - the intercept in the calculation of the linear regression. OK, good to know.
Besides R^2, I could use other statistical metrics, as MSE and RMSE, since they do not depend on "mean(y)", but only on the variance around the least-sqaures fit line. Yes, by adding the intercept in the fitting/linear regression, the MS Error and its square root decrease, which is a good sign. However, how to interpet them...?

请先登录，再进行评论。

Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论