Goodness of fit for a Zipf distribution

Question

z8080 2023-3-30

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution

编辑： Torsten 2023-4-4

I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.

From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:

x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)'); 
plot(f,'k') % plot model curve

However, this returns

rho = 28.05 (-Inf, Inf)

and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.

My questions:

1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?

2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.

3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?

Many thanks for any help!

2 个评论
显示无隐藏无

Torsten 2023-3-30

I don't understand why you use curve fitting when you have to do distribution fitting.

Read about the difference and the adequate functions:

https://de.mathworks.com/help/stats/curve-fitting-and-distribution-fitting.html

z8080 2023-3-31

Thanks for pointing this out: indeed, seems I was oblivious about the difference between those two, and it's really distribution fitting that I need, since my x axis is merely rank.

The documentation you pointed to suggests to me it's fitdist that I should use, however that doesn't have the distribution I need (Zipf) among its possible distname values. Any more advice?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Torsten 2023-3-31

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution#answer_1205984

编辑：Torsten 2023-4-2

在 MATLAB Online 中打开

I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):

https://en.wikipedia.org/wiki/Zeta_distribution

https://en.wikipedia.org/wiki/Zipf%27s_law

If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);

by

fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));

x = [1:6]';

y = [80 40 20 10 5 2]';

X = [];

for i=1:numel(x)

X = [X,x(i)*ones(1,y(i))];

end

hold on

histogram(X,'Normalization','pdf')

%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

s = 1.5:0.001:3;

fs = fun(s);

[~,index] = max(fs);

smax = s(index)

smax = 2.1160

zipf = 1./x.^smax / zeta(smax);

%zipf = 1./x.^smax / sum(1./x.^(smax));

plot(x,zipf)

hold off

grid on

11 个评论
显示 9更早的评论隐藏 9更早的评论

Torsten 2023-4-2

编辑：Torsten 2023-4-2

1) My data points are discrete, but I'm still not sure which of the two expressions you suggested (Zipf of Zeta) I'd need to choose. Using one or other when running the code, their respective plots look very similar to me.

If all of the positive integers can be taken as values for the rank, choose zeta, if only ranks up to a maximum value can be taken, use Zipf with N being this maximal rank.

2) As I change values in my example x vector, smax stays the same regardless of how good a fit the Zipf curve is. Which variable/parameter tells me how well the Zipf distribution fits the values in x?

If you change x, you also have to change the maximum likelihood function I gave you.

There is no direct parameter that can tell you how well the fitting was done. You might want to make a graph of the empirical and the theoretical cumulative distribution functions to get at least a visual impression.

3) Can fitdist not help here?

How should fitdist help if the probability distribution you want cannot be selected ? Maybe your distribution is a special case of one of the distributions that can be chosen - I did not look into it more deeply. But fitdist does the same as I did: Define the maximum likelihood function with your data and determine the parameter(s) for which it is maximized. fitdist also does not return a parameter on how well the fitting was done, as far as I know. Maybe confidence intervals for the parameters - I'm not sure.

Torsten 2023-4-2

编辑：Torsten 2023-4-2

I think I understand how it works in principle, but I'm still not sure (other than by attempting some hacky & probably buggy code with lots of for loops) how to obtain that expression given an input vector x that will have a different number of elements each time.

Yes, make an attempt with hacky and probably buggy code. You know what function should come out from the example above. If you have problems, include your code, and I will try to unhack and debug.

I wonder if there is some other (higher-level) way of obtaining the parameters of the fitted distribution (`s`, if that's the only one).

This is the usual way. Often the logarithm is taken of the maximum likelihood function in order to make numerics easier. Look it up under Wikipedia.

Also, ultimately I need to numerically assess how well Zipf fits to each of my vectors, as those numbers are then used further downstream. Assessing goodness of fit visually via the plot for each one, as you suggested, is unfortunately not an option. s merely tells us the shape of the Zipf if I understand well, but not how well it fits the data per se.

The visualization of your empirical distribution together with the theoretical approximation should be the starting point. I understand that this cannot be used in computations, but if one has problems interpreting theoretical results, a plot often helps.

But for theoretically estimating goodness of fit for distributions I reach my statistical limits. You mentionned Anderson-Darling - this could be a point to start.

You may find the discussion and the links given here of interest for your application:

https://de.mathworks.com/matlabcentral/answers/1913185-how-to-calculate-r-2-using-1-ssr-sst-for-normal-fit-distribution?s_tid=srchtitle

z8080 2023-4-4

编辑：z8080 2023-4-4

在 MATLAB Online 中打开

Here's my attempt so far (for some reason code highlighting doesn't work):

% Define some empirical frequency distribution

x = 1:10;

freq = 100 ./ x; % textbook zipf!

% Define the Zipf distribution

alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with

N = sum(freq); % Total number of observations

k = 1:length(x); % Rank of each observation

zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution

% Plot our empirical frequency distribution alongside the Zipf distribution

figure;

bar(x, freq); % or freq\N

hold on;

plot(x, zipf_dist, 'r--');

xlabel('Rank');

ylabel('Frequency');

legend('Observed', 'Zipf');

% Compute the goodness of fit using the chi-squared test

expected_freq = zipf_dist .* N;

chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);

dof = length(freq) - 1;

p_value = 1 - chi2cdf(chi_squared, dof);

% Display the results

fprintf('Chi-squared statistic = %.4f\n', chi_squared);

Chi-squared statistic = 170591.8168

fprintf('p-value = %.4f\n', p_value);

p-value = 0.0000

if p_value < 0.05

fprintf('Conclusion: The data is not from a Zipf distribution.\n');

else

fprintf('Conclusion: The data is from a Zipf distribution.\n');

end

Conclusion: The data is not from a Zipf distribution.

As a sanity check, I used a textbook-Zipf rank distribution, which unfortunately doesn't pass the statistical test. If that doesn't, nothing will. Clearly something is wrong, but I don't know what. Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.

THoughts?

Torsten 2023-4-4

编辑：Torsten 2023-4-4

在 MATLAB Online 中打开

Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.

To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).

After this, zipf must be defined as

zipf = 1./x.^alpha / sum(1./x.^alpha);

% Define some empirical frequency distribution

x = 1:10;

freq = 100 ./ x; % textbook zipf!

epdf = freq/sum(freq);

% Define the Zipf distribution

alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with

zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution

% Plot our empirical frequency distribution alongside the Zipf distribution

figure(1);

bar(x, epdf); % or freq\N

hold on;

plot(x, zipf_dist, 'r--');

xlabel('Rank');

ylabel('Probability');

legend('Observed', 'Zipf');

hold off

fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));

alpha = 0.5:0.001:3;

falpha = arrayfun(@(alpha)fun(alpha),alpha);

figure(2)

plot(alpha,falpha)

[~,index] = max(falpha);

alphamax = alpha(index)

alphamax = 1

% Compute the goodness of fit using the chi-squared test

expected_freq = zipf_dist .* sum(freq);

chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);

dof = length(freq) - 1;

p_value = 1 - chi2cdf(chi_squared, dof);

% Display the results

fprintf('Chi-squared statistic = %.4f\n', chi_squared);

Chi-squared statistic = 0.0000

fprintf('p-value = %.4f\n', p_value);

p-value = 1.0000

if p_value < 0.05

fprintf('Conclusion: The data is not from a Zipf distribution.\n');

else

fprintf('Conclusion: The data is from a Zipf distribution.\n');

end

Conclusion: The data is from a Zipf distribution.

请先登录，再进行评论。

Answer 2

Walter Roberson 2023-3-30

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution#answer_1205209

The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.

4 个评论
显示 2更早的评论隐藏 2更早的评论

z8080 2023-3-31

编辑：z8080 2023-3-31

Thanks, but since the task I'm aiming for is fairly simple (see how well a Zipf distribution fits a bunch of numbers), surely there's a straightforward way to achieve that in Matlab? Edge cases such as the one you demonstrated suggest, if anything, that what I want cannot be achieved, which I doubt.

In sum, if restricting rho to positives doesn't get the job done, what does get it done?

Walter Roberson 2023-3-31

One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.

请先登录，再进行评论。

Goodness of fit for a Zipf distribution

2 个评论
显示无隐藏无

采纳的回答

11 个评论
显示 9更早的评论隐藏 9更早的评论

更多回答（1 个）

4 个评论
显示 2更早的评论隐藏 2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Goodness of fit for a Zipf distribution

2 个评论 显示 无隐藏 无

采纳的回答

11 个评论 显示 9更早的评论隐藏 9更早的评论

更多回答（1 个）

4 个评论 显示 2更早的评论隐藏 2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

11 个评论
显示 9更早的评论隐藏 9更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论