How reliable is fitdist with regard to the gamma-distribution?
3 次查看(过去 30 天)
显示 更早的评论
Hello everyone,
I stumbled across the following issue concerning the reliability of the fitdist-function with regard to the gamma-distribution and could not find anything in the answers. The code-snippet below can be used by everyone for testing purposes. I generated a virtual dataset Y1 from a gamma-distribution with parameters p and b. The variable scaling simulates the virtual amount of measurements/observations. In order to test the dataset Y1 for gamma-distrubtion using the chi-squared-test, the procedure highlighted here was used.
The issue is that the fitted distribution pd has a very different shape than the original (gamma-distributed) dataset. Consequently, this yields to a rejection of null-hypothesis of the chi-squared test (h=1), although the null-hypothesis (Y1 is gamma-distributed) is correct.
Is that a known issue, or does someone has a hint on how to challenge this problem? As far as I know, fitdist utilizes the Maximum Likelihood estimator. Is there any option to modify the estimator or have an insight on how the parameters of the gamma-distribution are guessed/optimized?
X= 1:1:20;
p = 1;
b = 9;
scaling = 1000;
Y1 = scaling*gampdf(X,p,b);
pd = fitdist(X','Gamma','Frequency',round(y1',0));
expCounts = scaling * pdf(pd,X');
[h,pc,st] = chi2gof(X','Ctrs',X',...
'Frequency',y1', ...
'Expected',expCounts,...
'NParams',2, ...
'Alpha', 0.05);
0 个评论
采纳的回答
Jeff Miller
2020-8-21
If your overall goal is to check on the accuracy of fitdist, then I think your method is flawed because the data you construct (i.e., a set of X's with frequencies given by y1) do not actually represent the true distribution very well. Consider this example:
% Examine the true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp([' True mean & variance are ' num2str(mean(pd)) ' and ' num2str(var(pd))])
% construct the data that supposedly represent that distribution
lower = 1;
upper = 20;
X= lower:1:upper;
scaling = 1000;
y1 = scaling*gampdf(X,p,b);
constructed_mean = sum(X.*y1)/sum(y1);
constructed_EXsqr = sum(X.^2.*y1)/sum(y1);
constructed_var = constructed_EXsqr - constructed_mean^2;
disp(['Constructed mean & variance are ' num2str(constructed_mean) ' and ' num2str(constructed_var)])
Running this code gives
True mean & variance are 6.8081 and 62.156
Constructed mean & variance are 6.0679 and 23.9548
Since the data in y1 have the wrong mean and variance, you can't really expect fitdist to return the true parameter values as estimates.
If you actually want to check fitdist, then a better way to generate data to represent the distribution is to use the cdf, as John said. For example:
% define equally spaced points along the cdf
cdf_grain = 0.01;
cdf_points = cdf_grain/2:cdf_grain:(1-cdf_grain/2);
% define true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp([' True parameters ' num2str(pd.a) ' and ' num2str(pd.b)])
% look up the data points
X = icdf(pd,cdf_points);
% use fitdist with those data points
pd = fitdist(X','Gamma');
disp(['Estimate parameters ' num2str(pd.a) ' and ' num2str(pd.b)])
Here you will find that the estimated parameters match the true parameters rather well, and the match improves as cdf_grain is reduced.
(Check the mean and variance of these generate data points if you want; all are equally likely. They will not be perfect, but they will be closer to the true distribution than those obtained with your method.)
FWIW, I would not use chi2gof at all in this situation. As I see it, the only issue is how well the estimated parameters match the known true values.
2 个评论
Jeff Miller
2020-8-21
With that overall goal, you might also find it illuminating to:
- generate random datasets (of the size you have) from each of your candidate distributions.
- fit each dataset with all of the candidate distributions, and then
- see how often the known true underlying distribution provides a better fit to the dataset than any of the candidate distributions.
When I have done this, I have always been disappointed in the results. With (say) 4-5 similar candidate distributions, it is rarely the case that the true distribution provides the best fit to more than about half of the datasets (at least with datasets of the size I work with). Bottom line: identifying the true underlying distribution family is really hard, and in some cases it may be an unrealistic goal.
更多回答(1 个)
Jeff Miller
2020-8-21
I think fitdist is doing the best it can, but you gave it a non-gamma sample to fit because you truncated the X range at 20. Note that y1 has lots of predicted scores at 18, 19, 20 but then stops suddenly, which would not be expected in a true random sample from this gamma distribution. Try X=1:1:50. Or, if you really want to consider a truncated distribution, look at MATLAB's truncate command for truncating probability distributions.
4 个评论
John D'Errico
2020-8-21
编辑:John D'Errico
2020-8-21
I think part of the problem is you are not truly sending in true gamma samples into fitdist.
You are scaling the values of a pdf. So even the re-scaling is not truly correct, because to be correct, you would need to scale those effective histogram counts by the number of samples that would have fallen into a histogram bin.
X = 1:1:20;
p = 1;
b = 9;
scaling = 1000;
y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1 = round(scaling*y1);
Even at that, this is still effectively censored data. If you were generating gamma data, and then binning it as if from a histogram, then:
>> gamcdf(1,1,9)
ans =
0.10516068318563
>> gamcdf(20,1,9)
ans =
0.891631976778104
we see that you have implicitly discarded all samples representing roughly 20% of your data. So you were effectively using a truncated distribution. I suppose censoring is another way to look at it.
So a better result might be gained if you did this:
y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1(1) = gamcdf(x(1) + .5,p,b) - 0;
y1(end) = 1 - gamcdf(x(end) - .5,p,b);
y1 = round(scaling*y1);
This is NOT the same thing as just using pdf to scale your data into frequencies! A pdf does NOT return a probability!
Jeff's suggestion of extending the range will help some I think, but even there, use of a cdf is correct, not using the pdf.
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!