# How to calculate mean of values based on bins created from a corresponding vales?

29 views (last 30 days)
Palash Dhande on 13 Nov 2019
Commented: Adam Danz on 13 Nov 2019
I have two column vectors, lets call them A and B, and I have created an ordered paring from the values in these two vector.
I would like to make bins from the A values.
Then I would like to calculate the mean, max, standard deviation of the corresponding B values in the bins created from A values.
I have tried using histcounts,splitapply, accumarray, but i havent been able to find a correct solution. Any hints?
The A and B vectors are distance and intensity, respectively
range_intensity is the combined matrix of these two column vectors.
range_intensity =
[NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
26.040001 0.011764706
26.080000 0.019607844
26.112000 0.023529412
26.232000 0.023529412
26.184000 0.031372551
26.240000 0.027450981
26.260000 0.031372551
26.271999 0.031372551
26.275999 0.031372551
26.316000 0.035294119
26.312000 0.035294119
26.351999 0.031372551
26.351999 0.031372551
26.372000 0.031372551
26.424000 0.031372551
26.424000 0.031372551
26.452000 0.031372551
26.480000 0.039215688
26.496000 0.035294119
26.572001 0.031372551
26.552000 0.035294119
26.604000 0.031372551
26.620001 0.035294119
26.680000 0.035294119
26.684000 0.035294119
26.719999 0.035294119
26.747999 0.027450981
26.784000 0.031372551
26.820000 0.031372551
26.848000 0.027450981
26.875999 0.031372551
26.872000 0.031372551
26.920000 0.027450981
26.944000 0.027450981
26.972000 0.031372551
27.020000 0.031372551
27.044001 0.027450981
27.115999 0.035294119
27.132000 0.031372551
27.164000 0.031372551
27.184000 0.035294119]
edges = [0:0.5:250];
[distance_count, indx] = histc(range_intensity(:,1), edges);
% function res=my_mean_omitnan(in)
% res=mean(in,'omitnan');
% end
mean = accumarray(indx+1, range_intensity(indx+1,2), [],@(x)mean(x,'omitnan'));
max = accumarray(indx+1, range_intensity(indx+1,2), [], @max);
std = accumarray(indx+1, range_intensity(indx+1,2), [], @std);
bar(max);
hold on;
plot(mean);
grid on;
One problem is that the lenght of edges vector and mean, max vectors doesnt match, so i cant plot the mean and max agianst the edges.
There are also NaN values in the two vectors, which should be discarded for mean, max and standard deviation calculation.
Furtheremore, what would be the best way to visualize this data?

KALYAN ACHARJYA on 13 Nov 2019
Can you share A & B examples?
Lokking for>>
1.....
2.....
Guillaume on 13 Nov 2019
It's a bit unclear what binning method you want to use. An example would indeed be useful.
accumarray or groupsummary is probably the easiest way to do what you want. mean has a 'omitnan' option so it's not a problem ignoring NaNs.

Adam Danz on 13 Nov 2019
Edited: Adam Danz on 13 Nov 2019
Generally the edges should cover the span of your data, no more and no less with the exception that the final edge should be slightly larger than your maximum value to ensure that the final bin isn't absorbing extra values.
I suggest using discretize() to group the values in column 1 into discrete groups. The line below uses the range of your data to determine the range of bin edges.
edges = floor(min(range_intensity(:,1))) : .5 : ceil((max(range_intensity(:,1))+.001)*10/5)*5/10;
bins = discretize(range_intensity(:,1), edges);
The code above uses floor() to define the minimum bin edge. Bins are 0.5 units wide. It uses ceil() to define the maximum bin edge but to ensure that the max edge doesn't fall on your maximum data value, it adds 0.001 and then rounds up to the nearest 0.5 (hense, *10/5)*5/10)
Computing group statistics
If you have the statistics and machine learning toolbox, use grpstats() to compute grouped statistics.
[meanVal, maxVal, stdVal] = grpstats(range_intensity(:,2),bins,{@mean, @max, @std});
If you do not have access to the stats and ML toolbox, use splitapply() (or accumarray or other alternatives) to compute your grouped stats.
meanVal = splitapply(@mean,range_intensity(:,2), bins); % Repeat for other stats
Plotting the results
By definition, bin edges will always have 1 additional value than the number of bins. One way to plot binned data is to compute the bin center and use that as the x-value.
binCenters = edges(2:end) - (edges(2)-edges(1))/2;
If the bin edges were set up correctly following the steps above, you should end up with a vector of binCenters that is the same size as your grouped stats values. Plotting is then as simple as
figure()
bar(binCenters,maxVal)
hold on
plot(binCenters, meanVal,'ms')
grid on Adam Danz on 13 Nov 2019
I wonder if there are bins that do not contain any data and whether groupsummary is merely skipping over those bins.
Guillaume on 13 Nov 2019
"I don't know why I keep forgetting about groupsummary()"
Probably because you have the stats toolbox and I don't.
groupsummary will returns as many rows as numel(unique(bins)), so if some bin indices are not present, indeed these will be skipped. The second output of groupsummary will give you the bins matching the rows of the 1st output, so:
[meanmaxstd, bin] = groupsummary(range_intensity(:,2), bins, {'mean', 'max', 'std'});
edit: Or put the whole lot (range_intensity and bins) into a table and you'll get everything as one neat table as output (including number of elements used for each bin).
Adam Danz on 13 Nov 2019
This gives me the idea of building a function recommendation engine that skims all available command history and custom functions/scripts to get a sense of what functions a user typically uses and then recommends related functions that have rarely been used recently. The goal would be to expose the user to new functions outside of their repertoire.
Supposed the engine would search the user's content and list the top 500 most commonly used functions. For each function recognized by Matlab (ie, not custom functions), the engine could reference the function's official see also section of the documentation page to list related functions and would eliminate ones that are already in the top 500. The ones that are left over can be ranked in order of the number of times they appeared across all of the 500 functions (or at least those that had a documentation page that included a 'see also' section).
For example, if size() and length() are frequently used, both of those functions list numel() in their documentation pages which would then be recommended to the user to check out. It would be a dirty, ugly mess but machine learning algorithms have made products out of uglier messes.
I'll put that on my (long) list of spare-time ideas....