Removing leverage = infinite loop?

Question

wesleynotwise 2017-5-29

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/342396-removing-leverage-infinite-loop

评论： Star Strider 2017-6-1

I have a nonlinear model and I used plotDiagnostics (model, 'leverage') to find observations with high leverage.

However,after removing those which are extremely high (about 6 points), I noticed that the threshold of the new leverage plot has changed, and there are new points fall beyond this new threshold. If I were to remove all points lying above the threshold (about 20 of them), it is very likely that some points will exceed the new threshold, and eventually will this whole process of removing leverage and re-plotting leverage plot become an endless loop?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Star Strider 2017-5-29

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/342396-removing-leverage-infinite-loop#answer_268823

It would not be ‘endless’, since you would eventually run out of points to exclude!

I would stop with the first iteration. That identifies the most extreme outliers.

Actually, unless I had reason to believe that the outliers were due to some methodological problem in my data collection, which would certainly be a reason to examine them and possibly exclude them, I would keep all of them. This is an argument in favour of defining a protocol first, with exclusion criteria for data, such as sick animals, thermal noise in measuring equipment, equipment nonlinearities such as amplifier saturation or ‘railing’, and similar problems. Ideally, you will always have a valid reason for excluding some data, other than that they are simply ‘outliers’.

16 个评论
显示 14更早的评论隐藏 14更早的评论

Star Strider 2017-5-29

Removing potential outliers does not improve the model. It improves the fit of the model to your data. The latter does not imply the former.

In other words, if your model accurately describes the process that created your data, then: no, since removing the outliers simply improves the statistics. If other models (with the same or different numbers of parameters) describe it better, the one with the least variance of the residuals may be correct. This becomes a problem in determining the ‘best’ model. This is not trivial, especially with models with different numbers of parameters, since more parameters may produce lower residual variances without accurately describing the process that created your data. A polynomial with a sufficient number of parameters could be an excellent fit and explain nothing at all about your data.

The variance from the fitted model are noise (of myriad sources) that you simply have to live with, if there is no way for you to reduce the noise when you acquired your data.

You have a very large data set (nearly 600 if I remember correctly), so expect a large variance unless you are extremely lucky and your model perfectly describes the process that created your data. Your statistics on the fit are excellent, so that may be the best you can hope for.

Star Strider 2017-5-30

To the best of my knowledge, none of the MATLAB regression functions (other than polyfit) will automatically sense and warn about a need to center and scale your data. The easiest way to scale them is to express them in different units (such as kilometres rather than metres). The scaling is usually done to prevent loss of precision that could create matrix (or Jacobian) loss-of-rank. Centreing them could be as straightforward as starting one or both of the independent and dependent variables at zero, rather than, for example, at a date number for the current date. This largely depends on your model and the assumptions underlying it.

Thank you! I routinely use inverse-variance weighting so the more accurate data have higher weights, and the less accurate data have lower weights.

As I interpret it, the heteroscedasticity could be an argument for weighting. I would be extremely hesitant to do data transformations, since they also transform the errors. (This is the reason that, for example, an nonlinear exponential fit will always provide more accurate parameter estimates than a linear log-transformed fit.)

wesleynotwise 2017-6-1

编辑：wesleynotwise 2017-6-1

Thanks for the suggestion. Sorry to bother you further, I do not know how this can be done.

In total, there are 4 numerical variables (x1 ... x4) and 4 categorical variables (x5 ... x8) in the model. You are right, I have nearly 600 observations, and they were obtained from different sources, i.e. not from a single laboratory. Are you suggesting that I should:

Firstly, arrange the data in a certain order, eg based on variable x1 in an ascending order?
Secondly, segment them into few groups, with each contains about 30 - 60 data.
Thirdly, 'generate variance estimates for each range'. Do the 'estimates' mean the real measured y in the data set but not the estimated/predicted y from the model? Here is the complication. As the instrument measurement error is not available in this case, and if the variance is calculated based on the measured or estimated/predicted y using the ' V = var([y1:y60])' function, the variance values will be affected depending how the data is arranged, right?
Lastly, do the regression with the weight.

I know the visual assessment is highly biased, and I'm hesitant to do it too. Out of desperation, I see it as the only solution.

Before I start the modelling, I come across your reply for a post, saying that there isn't any difference between fitnlm and nlinfit, but the former is slightly easier to use (the important keyword for a newbie like me). That is why fitnlm is used. Given that I am now slightly better off than where I was, if the inverse variance method only works for nlinfit, I can definitely give it a try.

Sidetrack: if these two functions are essentially the same, why Matlab wants to keep both of them in the stat toolbox?

wesleynotwise 2017-6-1

Quick reply :) Sorry I didn't make it clear in my comment, different sources here mean different publications. The data was mainly extracted from journal papers. That is why I was thinking to rank them based on their quality (of course, based on a set of criteria, but still subject to potential bias-ness).

The background: the dependent variable (y) is affected by several independent variables, but one of them, let say x1, is the research focus. In general, y increases as x1 is increased. The magnitude of the increase depends on other independent variables (x2 ... xn).

So when come to segmenting/ grouping of data, I see two problems, perhaps I need your input here:

Finding the variance within from each individual source (publication) may not be right, because different x1 will result in different y values.
However, I think I can first group the data based on x1, eg: data where x1 ranging from 20 - 40 as group one, 41 - 60 as group two and let say five main groups are created. Then, within each of these groups, another three subgroups are created based on another variable, let say x2. And, no sub-subgroups are created as this will make the data thin. So, in total, I will have 5 main x 3 sub = 15 groups. I think this is the most appropriate point where I find the variance for these 15 groups?

Spoken with a statistician before, it does not prove to be any useful.

wesleynotwise 2017-6-1

在 MATLAB Online 中打开

Thanks for your advice. I do think Method (2) works. I will see if there is any specific techniques. So once the variance is determined, then weight = 1/variance, and it will be used in this function, I assume?

[beta,R,J,CovB] = nlinfit(X,y,@hougen,beta0,'Weights',weights)

Star Strider 2017-6-1

My pleasure.

That is how I would do it.

请先登录，再进行评论。

Removing leverage = infinite loop?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

16 个评论
显示 14更早的评论隐藏 14更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Removing leverage = infinite loop?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

16 个评论 显示 14更早的评论隐藏 14更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

16 个评论
显示 14更早的评论隐藏 14更早的评论