Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

Question

Neuropragmatist 2020-7-23

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/569460-histograms-why-does-the-smallest-binsize-always-give-the-smallest-mean-integrated-squared-error

评论： Neuropragmatist 2020-7-23

Hi all,

I have a bit of a specialised question involving histograms and mean integrated squared error (MISE). I want to find the 'best' way to construct a histogram for some data using a quantifiable method. I thought that MISE would provide a good way to do this as I can simulate data and then compare different histograms to the underlying probability distribution.

However, surprisingly (to me at least) I keep finding that histograms made with tiny bins always have a smaller MISE than ones with larger bins, even though the latter seem to reflect the data more accurately.

For an example I have made some mock code (below) which just simulates some random numbers from a normal distribution, bins them into a histogram and compares this histogram to the actual underlying distribution.

If we look at MISE with respect to bin size we get this:

So the tiny bins have the smallest error, however, if we plot some examples:

The red is the actual probability distribution, the blue line is a histogram built using the smallest bin size (with the smallest MISE) and the green is a histogram built using a larger binsize (with a larger MISE but 'looks' closer to the real distribution).

So what's going on? Is this just a property of MISE or am I making a mistake? Some areas where I think I might have made a mistake:

Before calculating MISE I sum normalise both distributions, this makes sense to me as I should be comparing probability distributions but maybe they should be normalised in a different manner?
Sometimes when MISE is expressed there is also an 'Expected Value' coefficient which I have not been able to identify. This seems to be an average, but an average of what? I think this might fix the problem by scaling the MISE according to the average bin contents but I'm not sure how to apply it.

Any help would be greatly appreciated,

NP.

% distribution mean
mu = 0;
% distribution standard deviation
std = 2;
% binsizes we want to test
bin_size = 0.1:0.1:3;
% random values from this distribution
vals = normrnd(mu,std,100,1);
    
% preallocate
mise_values = NaN(length(bin_size),1);
% run through every bin size
for bb = 1:length(bin_size)
    % values to evaluate distributions at
    xi = -10:bin_size(bb):10;
    
    % histogram of values
    kpdf = histcounts(vals,xi);
    
    % locations of bin centers (so PDF will match histogram)
    xi2 = movmean(xi,2,'Endpoints','discard');
    % underlying probability distribution
    updf = normpdf(xi2,mu,std);
    % mean integrated squared error between the histogram and the PDF
    % first normalise both
    updf = updf ./ nansum(updf);
    kpdf = kpdf ./ nansum(kpdf);
    % calculate MISE
    mise_values(bb) = sum( sum( (updf - kpdf).^2 ) ) .* bin_size(bb);
end
% plot MISE vs bin size
figure
scatter(bin_size,mise_values,'k');
refline
% plot different distributions
figure
xi = -10:0.1:10;
plot(xi,normpdf(xi,mu,std),'r'); hold on;
% plot 'best' binsize
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'b')
% plot a better one
xi = -10:0.8:10;
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'g')

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

John D'Errico 2020-7-23

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/569460-histograms-why-does-the-smallest-binsize-always-give-the-smallest-mean-integrated-squared-error#answer_469803

编辑：John D'Errico 2020-7-23

在 MATLAB Online 中打开

Your error is a subtle one, but important to understand why it happens.

x = randn(1000,1);
xi = -5:0.1:5;
histogram(x,xi,'norm','pdf')
hold on
fplot(@(x) normpdf(x))
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(x,xi);
plot(xi2,f./nansum(f),'r')
legend('histogram - pdf normalization','true pdf','histogram - relative counts')

The red curve at the bottom (look carefully, it is hard to see there) is the one you plotted. It is a simple relative number of counts per bin, so normalized to sum to 1. However, a pdf is normalized to have unit area.

Instead, see the difference here:

figure
dx = 0.1;
plot(xi2,f./nansum(f)/dx,'r')
hold on
fplot(@(x) normpdf(x))
legend('histogram - pdf normalization','true pdf')

Do you see the difference? I used your same data, but now the histogram is properly normalized, in a way that is consistent with a pdf.

While you think it makes sense for the simple frequency histogram to sum to 1, it was NOT normalized to INTEGRATE to have an area of 1. That only happened when I scaled it by dividing by dx.

As far as the smaller bin size being better, that should just reflect the idea that a smaller bin size can better approximate the true distribution.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Neuropragmatist 2020-7-23

Of course, this is exactly the problem! This is extremely helpful, thank you. Now, when I normalise for integration the MISE is actually high for small bins and then decreases to a plateau of nice bin sizes.

Do you also have any insight on the 'expected value' in the MISE formula?

https://en.wikipedia.org/wiki/Mean_integrated_squared_error

I have seen a few papers where this was omitted, so I'm not sure what purpose it serves.

Thanks,

NP.

请先登录，再进行评论。

Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Histograms - why does the smallest binsize always give the smallest mean integrated squared error?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论