What is the origin of the discrepancy in binning caused by the built-in Freedman–Diaconis method in the histcounts function?

Question

Drew 2024-6-13

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2128486-what-is-the-origin-of-the-discrepancy-in-binning-caused-by-the-built-in-freedman-diaconis-method-in

编辑： Deep 2024-8-6，8:38

I am attempting to use the Freedman–Diaconis (FD) rule to determine appropriate bins for a right-skewed dataset. Within the histcounts MATLAB function, the parameter 'BinMethod' has a built-in FD formula via the parameter value 'fd.' On the MATLAB reference page for histcounts, the formula for the FD rule that is supposedly applied to calculate bin-width when one calls 'fd' is correctly cited as the following:

2*iqr(X(:))*numel(X)^(-1/3)

In my testing, I have found that the actual bins produced by the 'fd' method vary significantly from what one expects to observe when manually applying the FD rule. Even accounting for the fact that "histcounts adjusts the number of bins slightly so that the bin edges fall on 'nice' numbers, rather than using these exact formulas" as stated on the MATLAB reference page, the differences in binning are substantial. While I have not been able to discern an exact pattern regarding the differences between MATLAB and manual FD binning, I have observed that the MATLAB FD typically reduces the number of total bins by a factor of 2-10 across various datasets.

To illustrate this issue, I’ve attached a MATLAB script that re-creates this discrepancy. With this seed, the manual FD rule is generating 239 bins and the MATLAB FD rule is generating 106 bins. I’ve also attached the output figures. Similar discrepancies occur regardless of seed or population size.

clear
clc
%Data Generation
rng(1);
data = randn(10000, 1);
skewed_data = exp(data);
%Manual FD
bin_length = 2*iqr(skewed_data(:))*numel(skewed_data)^(-1/3); %Formula from MATLAB references
edges = min(skewed_data):bin_length:max(skewed_data);
b = histcounts(skewed_data, edges);
figure
bar(edges(1:end-1), b, 'histc');
xlabel('Value');
ylabel('Frequency');
title('Manual FD');
disp('Number of bins for manual FD:')
disp(length(b));
%MATLAB FD
[b, edges]   = histcounts(skewed_data,'BinMethod','fd');
figure
bar(edges(1:end-1), b, 'histc');
xlabel('Value');
ylabel('Frequency');
title('MATLAB FD');
disp('Number of bins for MATLAB FD:')
disp(length(b));

Regarding my interest in this discrepancy: I used both MATLAB FD and manual FD binning on a dataset before conducting non-linear optimization, and I found that the resulting models were best when the MATLAB FD was applied. I am preparing to publish, and I want to be able to explain exactly how my data is being binned to maximize performance. As a result, I would like to know how the MATLAB FD method, and broadly histcounts is binning my data.

Of note: I've found similar differences in binning between the MATLAB and real real formula for Sturges and Scott as well. While I imagine that the origin of this difference may be similar, I am primarily concerned with Freedman–Diaconis currently.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Deep 2024-8-6，8:38

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2128486-what-is-the-origin-of-the-discrepancy-in-binning-caused-by-the-built-in-freedman-diaconis-method-in#answer_1495196

编辑：Deep 2024-8-6，8:38

在 MATLAB Online 中打开

Hi Drew,

I observe the same discrepancy as you have stated – the bin width for the Freedman-Diaconis rule implemented by MATLAB is sometimes different from what manual calculation is giving.

If you attach a breakpoint to your “histcounts” function call, MATLAB will let you step into the function. There, you can find an IQR guarding calculation that explains the discrepancy observed.

MATLAB's implementation of the Freedman–Diaconis method safeguards against a small IQR (Interquartile Range). The IQR value used to calculate the bin width is considered to be at least 1/10th of the range of the data. This effectively prevents the bin width from being extremely narrow and mitigates the influence of outliers in the data.

Given below is an implementation for the manual calculation of FD rule. This should match the output of MATLAB’s "histcounts":

%% Data generation 
rng(1); 
data = randn(10000, 1); 
skewed_data = exp(data); 
%% MATLAB FD 
[bins, edges]   = histcounts(skewed_data,'BinMethod','fd','BinLimits', [min(skewed_data) max(skewed_data)]); 
disp('Number of bins for MATLAB FD:'); 
disp(length(bins)) 
%% Manual FD calculation (with an IQR guard) 
iq = max(iqr(skewed_data(:)), range(skewed_data(:))/10);  % This is a guard against small IQR 
bin_length = 2*iq*numel(skewed_data)^(-1/3); 
nbins = ceil((max(skewed_data)-min(skewed_data))/bin_length); 
disp('Number of bins for Manual FD:'); 
disp(nbins);