Removing leverage = infinite loop?
1 次查看(过去 30 天)
显示 更早的评论
I have a nonlinear model and I used plotDiagnostics (model, 'leverage') to find observations with high leverage.
However,after removing those which are extremely high (about 6 points), I noticed that the threshold of the new leverage plot has changed, and there are new points fall beyond this new threshold. If I were to remove all points lying above the threshold (about 20 of them), it is very likely that some points will exceed the new threshold, and eventually will this whole process of removing leverage and re-plotting leverage plot become an endless loop?
采纳的回答
Star Strider
2017-5-29
It would not be ‘endless’, since you would eventually run out of points to exclude!
I would stop with the first iteration. That identifies the most extreme outliers.
Actually, unless I had reason to believe that the outliers were due to some methodological problem in my data collection, which would certainly be a reason to examine them and possibly exclude them, I would keep all of them. This is an argument in favour of defining a protocol first, with exclusion criteria for data, such as sick animals, thermal noise in measuring equipment, equipment nonlinearities such as amplifier saturation or ‘railing’, and similar problems. Ideally, you will always have a valid reason for excluding some data, other than that they are simply ‘outliers’.
16 个评论
wesleynotwise
2017-5-29
Yes, it would not be endless, but my workload is, definitely. You are right, and I also think that I should not remove all of them in the first place, as leverage indicate the presence of extreme x values of an observation, but the observation is not necessary an outlier.
Modelling of mass data is like asking someone to find a pattern from a bundle of tangled ropes.
wesleynotwise
2017-5-29
Is there any other methods to improve the model other than removing the potential outliers?
Star Strider
2017-5-29
Removing potential outliers does not improve the model. It improves the fit of the model to your data. The latter does not imply the former.
In other words, if your model accurately describes the process that created your data, then: no, since removing the outliers simply improves the statistics. If other models (with the same or different numbers of parameters) describe it better, the one with the least variance of the residuals may be correct. This becomes a problem in determining the ‘best’ model. This is not trivial, especially with models with different numbers of parameters, since more parameters may produce lower residual variances without accurately describing the process that created your data. A polynomial with a sufficient number of parameters could be an excellent fit and explain nothing at all about your data.
The variance from the fitted model are noise (of myriad sources) that you simply have to live with, if there is no way for you to reduce the noise when you acquired your data.
You have a very large data set (nearly 600 if I remember correctly), so expect a large variance unless you are extremely lucky and your model perfectly describes the process that created your data. Your statistics on the fit are excellent, so that may be the best you can hope for.
wesleynotwise
2017-5-29
编辑:wesleynotwise
2017-5-29
Thanks for your reply. You've been very helpful.
So far, I only develop one model which takes all the important parameters into account. Are you suggesting to develop another model (or few more models), using different numbers of parameters? Yes, my data set is quite large, I did expect to have a certain degree of noises, but the pathetic adj-R square is at 40%, which is simply not a good sign.
Can I check with you if the fitnlm function is programmed in such a way that it will standardise/normalise/center the data by itself first before the analysis?
Star Strider
2017-5-30
My pleasure.
There is no need to develop any alternative models if your current model is of the process that created your data. That is the best you can possibly do. With your large number of data, an R² of 0.4 is probably not a problem.
Scaling and centering depends on the nature of your independent variable.
An alternative that might be preferable is weighting your data according to whatever they are and what they represent. You would need to research weighting, and decide based on what you know about your data. For example if you acquired your data with a particular instrument and if you know the instrument measurement error in various ranges (by calibrating it), the weighting would be inversely proportional to the measurement error (specifically, error variance). You could interpolate (using the interp1 function) between measured error variance values to get the estimated variances for the entire range of your data. That could significantly improve your model fit to your data, and actually improve the accuracy of the parameter estimates.
wesleynotwise
2017-5-30
编辑:wesleynotwise
2017-5-30
I know that the scaling and centering depends on the nature, (the magnitude, right? like you can have a variable with value in 4 decimal places, while another variable is in the order of hundred). Just want to know if the fitnlm function will auto-detect such a difference in the input, and cleverly do the scaling/centering before the analysis?
I think the idea of weighting is brilliant! Not sure how it looks like, but I assume it is like a coefficient (weight) that is attached to each pieces of data before the regression, so data carries with higher weight will have more significant influence in the regression. I will need to do bit more readings.
Also, I found that one or two groups of my data show a hint of heteroskedasticity in the residual plot. Strangely enough, this suggests that important parameters have been omitted, which is very unlikely in my case. I think I might need to transform my variable in different forms, eg: square root/log, and see how it goes?
Star Strider
2017-5-30
To the best of my knowledge, none of the MATLAB regression functions (other than polyfit) will automatically sense and warn about a need to center and scale your data. The easiest way to scale them is to express them in different units (such as kilometres rather than metres). The scaling is usually done to prevent loss of precision that could create matrix (or Jacobian) loss-of-rank. Centreing them could be as straightforward as starting one or both of the independent and dependent variables at zero, rather than, for example, at a date number for the current date. This largely depends on your model and the assumptions underlying it.
Thank you! I routinely use inverse-variance weighting so the more accurate data have higher weights, and the less accurate data have lower weights.
As I interpret it, the heteroscedasticity could be an argument for weighting. I would be extremely hesitant to do data transformations, since they also transform the errors. (This is the reason that, for example, an nonlinear exponential fit will always provide more accurate parameter estimates than a linear log-transformed fit.)
wesleynotwise
2017-6-1
Hello!
I've to admit that I am fairly slow... I just realised that one can run weighted nonlinear regression in Matlab using
wnlm = fitnlm(x,y,modelFun,start,'Weight',w) %facepalm!
% Also I assume you're talking about this in your comment?
I do not have sufficient information of the instrument measurement error in my data set, I am going to do the weighting based on the quality of the data (eg: 3 = good, 2 = normal, 1 = bad). Hope this last resort will work a miracle.
Star Strider
2017-6-1
Hi!
I would segment the data over ranges of the independent variable and generate variance estimates for each range, specifying them at the midpoint of each range. I would then interpolate to estimate the inverse-variance weight for each point. You have nearly 600 observations if I remember correctly, so segmenting your data into 30 to 60-point groups and calculating the variances to use for the interpolation would be my approach. Visually assessing the quality of the data and assigning weights on that basis is not be the approach I would recommend.
I don’t use fitnlm as much as I use nlinfit and lsqcurvefit, although I remember that nlinfit has a weighting option, and with a bit of coding (that I would have to look up since I’ve not used it in a while), weighting with lsqcurvefit is also possible.
wesleynotwise
2017-6-1
编辑:wesleynotwise
2017-6-1
Thanks for the suggestion. Sorry to bother you further, I do not know how this can be done.
In total, there are 4 numerical variables (x1 ... x4) and 4 categorical variables (x5 ... x8) in the model. You are right, I have nearly 600 observations, and they were obtained from different sources, i.e. not from a single laboratory. Are you suggesting that I should:
- Firstly, arrange the data in a certain order, eg based on variable x1 in an ascending order?
- Secondly, segment them into few groups, with each contains about 30 - 60 data.
- Thirdly, 'generate variance estimates for each range'. Do the 'estimates' mean the real measured y in the data set but not the estimated/predicted y from the model? Here is the complication. As the instrument measurement error is not available in this case, and if the variance is calculated based on the measured or estimated/predicted y using the ' V = var([y1:y60])' function, the variance values will be affected depending how the data is arranged, right?
- Lastly, do the regression with the weight.
I know the visual assessment is highly biased, and I'm hesitant to do it too. Out of desperation, I see it as the only solution.
Before I start the modelling, I come across your reply for a post, saying that there isn't any difference between fitnlm and nlinfit, but the former is slightly easier to use (the important keyword for a newbie like me). That is why fitnlm is used. Given that I am now slightly better off than where I was, if the inverse variance method only works for nlinfit, I can definitely give it a try.
Sidetrack: if these two functions are essentially the same, why Matlab wants to keep both of them in the stat toolbox?
Star Strider
2017-6-1
My pleasure.
Since the data were from different sources (by definition different instrumentation, and new information), I would first consider weighting based on the variance of the sources, or (probably preferably) consider the sources themselves as separate predictors. (I am guessing here. This is getting a bit complicated for me to follow.)
When you have decided how to deal with those, I would then see if the weighting scheme I described earlier is necessary. Weighting a dependent variable is usually with respect to an independent variable, so if you have more than one independent (predictor) variable, this becomes very complicated very quickly, and probably beyond my expertise.
You probably need to discuss this with a statistician in your institution who can guide you through these complexities. I will of course help as I can.
wesleynotwise
2017-6-1
Quick reply :) Sorry I didn't make it clear in my comment, different sources here mean different publications. The data was mainly extracted from journal papers. That is why I was thinking to rank them based on their quality (of course, based on a set of criteria, but still subject to potential bias-ness).
The background: the dependent variable (y) is affected by several independent variables, but one of them, let say x1, is the research focus. In general, y increases as x1 is increased. The magnitude of the increase depends on other independent variables (x2 ... xn).
So when come to segmenting/ grouping of data, I see two problems, perhaps I need your input here:
- Finding the variance within from each individual source (publication) may not be right, because different x1 will result in different y values.
- However, I think I can first group the data based on x1, eg: data where x1 ranging from 20 - 40 as group one, 41 - 60 as group two and let say five main groups are created. Then, within each of these groups, another three subgroups are created based on another variable, let say x2. And, no sub-subgroups are created as this will make the data thin. So, in total, I will have 5 main x 3 sub = 15 groups. I think this is the most appropriate point where I find the variance for these 15 groups?
Spoken with a statistician before, it does not prove to be any useful.
Star Strider
2017-6-1
No worries.
You are doing a meta-analysis. There are specific statistical tools to do such studies, none of which I am familiar with.
I would agree with (2.). However, I would discuss this with your statistician. At the very least, consult papers and textbooks on meta-analysis techniques.
wesleynotwise
2017-6-1
Thanks for your advice. I do think Method (2) works. I will see if there is any specific techniques. So once the variance is determined, then weight = 1/variance, and it will be used in this function, I assume?
[beta,R,J,CovB] = nlinfit(X,y,@hougen,beta0,'Weights',weights)
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)