Hi all,
I have a bit of a specialised question involving histograms and mean integrated squared error (MISE). I want to find the 'best' way to construct a histogram for some data using a quantifiable method. I thought that MISE would provide a good way to do this as I can simulate data and then compare different histograms to the underlying probability distribution.
However, surprisingly (to me at least) I keep finding that histograms made with tiny bins always have a smaller MISE than ones with larger bins, even though the latter seem to reflect the data more accurately.
For an example I have made some mock code (below) which just simulates some random numbers from a normal distribution, bins them into a histogram and compares this histogram to the actual underlying distribution.
If we look at MISE with respect to bin size we get this:
So the tiny bins have the smallest error, however, if we plot some examples:
The red is the actual probability distribution, the blue line is a histogram built using the smallest bin size (with the smallest MISE) and the green is a histogram built using a larger binsize (with a larger MISE but 'looks' closer to the real distribution).
So what's going on? Is this just a property of MISE or am I making a mistake? Some areas where I think I might have made a mistake:
- Before calculating MISE I sum normalise both distributions, this makes sense to me as I should be comparing probability distributions but maybe they should be normalised in a different manner?
- Sometimes when MISE is expressed there is also an 'Expected Value' coefficient which I have not been able to identify. This seems to be an average, but an average of what? I think this might fix the problem by scaling the MISE according to the average bin contents but I'm not sure how to apply it.
Any help would be greatly appreciated,
NP.
mu = 0;
std = 2;
bin_size = 0.1:0.1:3;
vals = normrnd(mu,std,100,1);
mise_values = NaN(length(bin_size),1);
for bb = 1:length(bin_size)
xi = -10:bin_size(bb):10;
kpdf = histcounts(vals,xi);
xi2 = movmean(xi,2,'Endpoints','discard');
updf = normpdf(xi2,mu,std);
updf = updf ./ nansum(updf);
kpdf = kpdf ./ nansum(kpdf);
mise_values(bb) = sum( sum( (updf - kpdf).^2 ) ) .* bin_size(bb);
end
figure
scatter(bin_size,mise_values,'k');
refline
figure
xi = -10:0.1:10;
plot(xi,normpdf(xi,mu,std),'r'); hold on;
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'b')
xi = -10:0.8:10;
xi2 = movmean(xi,2,'Endpoints','discard');
f = histcounts(vals,xi);
plot(xi2,f./nansum(f),'g')