How reliable is fitdist with regard to the gamma-distribution?

Question

Markus Franke 2020-8-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/582413-how-reliable-is-fitdist-with-regard-to-the-gamma-distribution

评论： Jeff Miller 2020-8-21

Hello everyone,

I stumbled across the following issue concerning the reliability of the fitdist-function with regard to the gamma-distribution and could not find anything in the answers. The code-snippet below can be used by everyone for testing purposes. I generated a virtual dataset Y1 from a gamma-distribution with parameters p and b. The variable scaling simulates the virtual amount of measurements/observations. In order to test the dataset Y1 for gamma-distrubtion using the chi-squared-test, the procedure highlighted here was used.

The issue is that the fitted distribution pd has a very different shape than the original (gamma-distributed) dataset. Consequently, this yields to a rejection of null-hypothesis of the chi-squared test (h=1), although the null-hypothesis (Y1 is gamma-distributed) is correct.

Is that a known issue, or does someone has a hint on how to challenge this problem? As far as I know, fitdist utilizes the Maximum Likelihood estimator. Is there any option to modify the estimator or have an insight on how the parameters of the gamma-distribution are guessed/optimized?

X= 1:1:20;
p = 1;
b = 9;
scaling = 1000;
Y1 = scaling*gampdf(X,p,b);
pd = fitdist(X','Gamma','Frequency',round(y1',0));
expCounts = scaling * pdf(pd,X');
[h,pc,st] = chi2gof(X','Ctrs',X',...
                        'Frequency',y1', ...
                        'Expected',expCounts,...
                        'NParams',2, ...
                        'Alpha', 0.05);

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Jeff Miller 2020-8-21

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/582413-how-reliable-is-fitdist-with-regard-to-the-gamma-distribution#answer_483266

在 MATLAB Online 中打开

If your overall goal is to check on the accuracy of fitdist, then I think your method is flawed because the data you construct (i.e., a set of X's with frequencies given by y1) do not actually represent the true distribution very well. Consider this example:

% Examine the true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp(['       True mean & variance are ' num2str(mean(pd)) ' and ' num2str(var(pd))])
% construct the data that supposedly represent that distribution
lower = 1;
upper = 20;
X= lower:1:upper;
scaling = 1000;
y1 = scaling*gampdf(X,p,b);
constructed_mean = sum(X.*y1)/sum(y1);
constructed_EXsqr = sum(X.^2.*y1)/sum(y1);
constructed_var = constructed_EXsqr - constructed_mean^2;
disp(['Constructed mean & variance are ' num2str(constructed_mean) ' and ' num2str(constructed_var)])

Running this code gives

       True mean & variance are 6.8081 and 62.156
Constructed mean & variance are 6.0679 and 23.9548

Since the data in y1 have the wrong mean and variance, you can't really expect fitdist to return the true parameter values as estimates.

If you actually want to check fitdist, then a better way to generate data to represent the distribution is to use the cdf, as John said. For example:

% define equally spaced points along the cdf
cdf_grain = 0.01;
cdf_points = cdf_grain/2:cdf_grain:(1-cdf_grain/2);
% define true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp(['    True parameters ' num2str(pd.a) ' and ' num2str(pd.b)])
% look up the data points
X = icdf(pd,cdf_points);
% use fitdist with those data points
pd = fitdist(X','Gamma');
disp(['Estimate parameters ' num2str(pd.a) ' and ' num2str(pd.b)])

Here you will find that the estimated parameters match the true parameters rather well, and the match improves as cdf_grain is reduced.

(Check the mean and variance of these generate data points if you want; all are equally likely. They will not be perfect, but they will be closer to the true distribution than those obtained with your method.)

FWIW, I would not use chi2gof at all in this situation. As I see it, the only issue is how well the estimated parameters match the known true values.

2 个评论
显示无隐藏无

Markus Franke 2020-8-21

Thank you very much for your answer, Jeff!

The overall goal is to check for given (large) dataset(s) which distribution fits best. In that process, I stumbled across the outlined issue. To check the goodness of fit, I used the chi-squared test. As a rule of thumb, the amount of bins to use is sqrt(n), with n being the sample size of the data set, limiting the resolution.

You are totally right. The only issue is how well the estimated parameters match the true values. However, it is a bit problematic, if the whole procedure (especially fitdist) is so sensitive with regard to the resolution.

I will use the cdf, as John mentioned, in combination with the Kolmogorov-Smirnov test in the next tries. But this is another topic, I guess.

Therefore, I will mark this topic as answered.

Thanks again Jeff and John.

Jeff Miller 2020-8-21

With that overall goal, you might also find it illuminating to:

generate random datasets (of the size you have) from each of your candidate distributions.
fit each dataset with all of the candidate distributions, and then
see how often the known true underlying distribution provides a better fit to the dataset than any of the candidate distributions.

When I have done this, I have always been disappointed in the results. With (say) 4-5 similar candidate distributions, it is rarely the case that the true distribution provides the best fit to more than about half of the datasets (at least with datasets of the size I work with). Bottom line: identifying the true underlying distribution family is really hard, and in some cases it may be an unrealistic goal.

请先登录，再进行评论。

Answer 2

Jeff Miller 2020-8-21

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/582413-how-reliable-is-fitdist-with-regard-to-the-gamma-distribution#answer_482876

I think fitdist is doing the best it can, but you gave it a non-gamma sample to fit because you truncated the X range at 20. Note that y1 has lots of predicted scores at 18, 19, 20 but then stops suddenly, which would not be expected in a true random sample from this gamma distribution. Try X=1:1:50. Or, if you really want to consider a truncated distribution, look at MATLAB's truncate command for truncating probability distributions.

4 个评论
显示 2更早的评论隐藏 2更早的评论

John D'Errico 2020-8-21

编辑：John D'Errico 2020-8-21

在 MATLAB Online 中打开

I think part of the problem is you are not truly sending in true gamma samples into fitdist.

You are scaling the values of a pdf. So even the re-scaling is not truly correct, because to be correct, you would need to scale those effective histogram counts by the number of samples that would have fallen into a histogram bin.

X = 1:1:20;
p = 1;
b = 9;
scaling = 1000;
y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1 = round(scaling*y1);

Even at that, this is still effectively censored data. If you were generating gamma data, and then binning it as if from a histogram, then:

>> gamcdf(1,1,9)
ans =
          0.10516068318563
>> gamcdf(20,1,9)
ans =
         0.891631976778104

we see that you have implicitly discarded all samples representing roughly 20% of your data. So you were effectively using a truncated distribution. I suppose censoring is another way to look at it.

So a better result might be gained if you did this:

y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1(1) = gamcdf(x(1) + .5,p,b) - 0;
y1(end) = 1 - gamcdf(x(end) - .5,p,b);
y1 = round(scaling*y1);

This is NOT the same thing as just using pdf to scale your data into frequencies! A pdf does NOT return a probability!

Jeff's suggestion of extending the range will help some I think, but even there, use of a cdf is correct, not using the pdf.

Markus Franke 2020-8-21

在 MATLAB Online 中打开

To me, this feels like a work-around that is not really reliable. If you change p to p = 0.5, the null-hypothesis gets rejected again and the parameters of the fitted distribution are a bit off.

I just found out, that more reliable and better results can be obtained by doing the parameters estimation using the the least-squares approach.

p = 0.7457;
b = 9.129764;
lower = 1;
upper = 20;
X= lower:1:upper;
scaling = 1000;
y1 = scaling*gampdf(X,p,b);
fun = @(x)sseval(x,y1,X, scaling);
x0 = [0.5, 3];
bestx = fminsearch(fun,x0)
pd = makedist('Gamma', 'a', bestx(1), 'b', bestx(2));
expCounts = scaling * pdf(pd,X');
[h,pc,st] = chi2gof(X','Ctrs',X',...
                        'Frequency',y1', ...
                        'Expected',expCounts,...
                        'NParams',2, ...
                        'Alpha', 0.05);
                    
                    
function sse = sseval(x, ydata, bins, scaling)
p = x(1);
b = x(2);
pd = makedist('Gamma', 'a', p, 'b', b);
expC = scaling* pdf(pd,bins');
sse = sum((ydata' - expC).^2);
end