Goodness of fit for a Zipf distribution
11 次查看(过去 30 天)
显示 更早的评论
I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!
2 个评论
Torsten
2023-3-30
I don't understand why you use curve fitting when you have to do distribution fitting.
Read about the difference and the adequate functions:
采纳的回答
Torsten
2023-3-31
编辑:Torsten
2023-4-2
I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):
If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);
by
fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));
x = [1:6]';
y = [80 40 20 10 5 2]';
X = [];
for i=1:numel(x)
X = [X,x(i)*ones(1,y(i))];
end
hold on
histogram(X,'Normalization','pdf')
%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
s = 1.5:0.001:3;
fs = fun(s);
[~,index] = max(fs);
smax = s(index)
zipf = 1./x.^smax / zeta(smax);
%zipf = 1./x.^smax / sum(1./x.^(smax));
plot(x,zipf)
hold off
grid on
11 个评论
Torsten
2023-4-4
编辑:Torsten
2023-4-4
Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.
To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).
After this, zipf must be defined as
zipf = 1./x.^alpha / sum(1./x.^alpha);
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
epdf = freq/sum(freq);
% Define the Zipf distribution
alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with
zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure(1);
bar(x, epdf); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Probability');
legend('Observed', 'Zipf');
hold off
fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));
alpha = 0.5:0.001:3;
falpha = arrayfun(@(alpha)fun(alpha),alpha);
figure(2)
plot(alpha,falpha)
[~,index] = max(falpha);
alphamax = alpha(index)
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* sum(freq);
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
fprintf('p-value = %.4f\n', p_value);
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
更多回答(1 个)
Walter Roberson
2023-3-30
The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.
4 个评论
Walter Roberson
2023-3-31
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Hypothesis Tests 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!