How to calculate the confidence interval
544 次查看(过去 30 天)
显示 更早的评论
Hi
I have a vector x with e.g. 100 data point. I can easy calculate the mean but now I want the 95% confidence interval. I can calculate the 95% confidence interval as follows:
CI = mean(x)+- t * (s / square(n))
where s is the standard deviation and n the sample size (= 100).
Is there a method in matlab where I just can feed in the vector and then I get the confidence interval?
Or I can write my own method but I need at least the value of t (critical value of the t distribution) because it depends on the number of samples and I don't want to lookup it in a table everytime. Is this possible?
Would be very nice if somebody could give an example.
Last but not least, I want 95% confidence in a 5% interval around the mean. For checking that I just have to calculate the 95% confidence interval and then check if the retrieved value is less than 5% of my mean, right?
4 个评论
Andrei Keino
2017-7-18
How to calculate confidence interval for linear model. It's the same for mean value, but the number of degree of freedom (dof) is equal to 1
Adam Danz
2019-8-5
Jennifer Wade
2022-2-15
I use something like this for a generic data vector, A.....
N = length(A)
STDmean = mean(A)/sqrt(N)
dof = N - 1; %Depends on the problem but this is standard for a CI around a mean.
studentst = tinv([.025 0.975],dof) %tinv is the student's t lookup table for the two-tailed 95% CI ...
CI = studentst*STDmean
I'm looking into bootci now!
采纳的回答
Star Strider
2014-10-20
This works:
x = randi(50, 1, 100); % Create Data
SEM = std(x)/sqrt(length(x)); % Standard Error
ts = tinv([0.025 0.975],length(x)-1); % T-Score
CI = mean(x) + ts*SEM; % Confidence Intervals
You have to have the Statistics Toolbox to use the tinv function. If you do not have it, I can provide you with a few lines of my code that will calculate the t-probability and its inverse.
25 个评论
Star Strider
2014-10-20
编辑:Star Strider
2014-10-20
My pleasure!
It will give you the 95% confidence interval using a two-tailed t-distribution. This is the centre 95%, so the lower and upper 2.5% tails of the distribution are not included.
Note that it also considers that you are only estimating one parameter (the mean) and so has n-1 degrees-of-freedom.
Sepp
2014-10-21
Thank you for the clarification. So CI has now two values, one above the mean and one below. So if I want to plot the confidence interval I just add (upper bound) and subtract (lower bound) the ts*SEM to the mean and plot it, right?
And if I want to calculate if my measurements (one parameter) are withing a 5% interval I just calculate ts*SEM and chck if it is less than 5% of the mean. is this right?
Star Strider
2014-10-21
My pleasure.
Yes. That is what I did in my calculation of the ‘CI’ variable.
If you want to determine if a value is within the 95% confidence limits, test to see if it is >=CI(1) and <=CI(2).
Also, it is termed a 95% confidence interval, not 5%. I mention this to avoid confusion.
Sepp
2014-10-21
编辑:Sepp
2014-10-21
Hmmm ok, but I have to use the following: "For experiments, fix a target (typically 95% confidence in a 5 - 10% interval around the mean) and repeat the experiments until the level of confidence is reached."
Does this not mean just checking if CI(2) - CI(1) is < 5-10% of mean?
Star Strider
2014-10-21
I’d have to know more about what you’re doing. The statement "For experiments, fix a target (typically 95% confidence in a 5 - 10% interval around the mean) and repeat the experiments until the level of confidence is reached." makes no sense to me. The confidence interval is defined by the parameter (or parameters) you are estimating. You can’t play fast and loose with the definition!
Sepp
2014-10-21
If I collect more data points, i.e. if N increases then the confidence interval will get narrower most likely. I have to do experiments where I'm measuring the throughput over e.g. 10minutes. If I would run it 30 minutes I would get more data points and thus the CI is narrower.
Star Strider
2014-10-21
True, but that also assumes the standard deviation (SD) does not change. If you collect 3x as many samples, and your SD remains the same, your standard error (SE) would decrease by about 40%, 1-1/sqrt(3). The t-distribution approaches the normal distribution above about 30 degrees-of-freedom, so the t-statistic would not change significantly.
Sepp
2014-10-22
So, this means that I just can check if CI(2) - CI(1) is < 5-10% of mean for some number of data points, right?
Star Strider
2014-10-22
You could certainly do that, but I’m not sure how meaningful it would be. Consider two vectors of random numbers with the same (normal) distribution, the only difference between them being a fixed offset:
x1 = randn(1,100);
x2 = randn(1,100)+10;
SEM1 = std(x1)/sqrt(length(x1));
SEM2 = std(x2)/sqrt(length(x2));
RR1 = SEM1/mean(x1);
RR2 = SEM1/mean(x2);
taking the ratio as in ‘RR1’ and ‘RR2’ would produce a significantly lower ratio for ‘RR2’ (in this illustration) in spite of the data themselves being essentially the same.
The way I understand your latest comment (no promises that I do), you might want to compare CI values with increasing numbers of data points and compare them. I suspect they will become asymptotic to some non-zero value and not decrease further.
To illustrate:
x = randn(1,1E+6);
st = 10000;
for k1 = 1:st:length(x)
xs = x(1:st+(k1-1));
SE = std(xs)/sqrt(length(xs));
xss(1+fix((k1-1)/st)) = length(xs);
CI(1+fix((k1-1)/st)) = SE*tinv(0.975,length(xs)-1)*2;
end
figure(1)
stairs(xss,CI)
grid
xlabel('Sample Size')
ylabel('CI')
I’m certain there is an analytic proof of this available, but I’m not up to looking for it just now.
Sepp
2014-10-24
编辑:Sepp
2014-10-24
Hi. Sorry for disturbing you again. I have now played around with the confidence interval and I recognized a very strange behaviour.
I have the following values: length(x) = 4508338, std(x) = 2036.04818, mean(x) = 1246.88844.
With your formula I get ts*SEM = +- 1.879438088697457 which is incredible small compared to the standard deviation.
Why? I thought that when having a very large standard deviation I also must have a large confidence interval.
Star Strider
2014-10-24
That is a correct result.
This is the confidence interval for the mean, indicating that these are the limits based on the sample that would include the mean of the population. So the larger your sample, the more likely you are to estimate the mean of the population, and therefore the confidence interval decreases with increasing sample size. As the sample size approaches infinity, the standard error and therefore the confidence interval approach zero.
Sepp
2014-10-25
编辑:Sepp
2014-10-25
Ok, but why is then the standard deviation twice as high as the mean? Do the stanard deviation and the confidence interval not correlate? So if I'm looking only at the stanard deviation, this would tell me that the measurement is very bad (inaccurate) but when I'm looking only a the confidence intervall this would tell me that the measurement is very good (accurate).
Star Strider
2014-10-25
The standard deviation tells you the dispersion of your data. The confidence interval on the mean is calculated from the standard deviation, so in that sense they definitely correlate. However the confidence interval on the mean is an estimate of the dispersion of the true population mean, and since you are usually comparing means of two or more populations to see if they are different, or to see if the mean of one population is different from zero (or some other constant), that is appropriate.
A standard deviation twice the mean indicates that the data can go negative a large part of the time (about 27% based on my normcdf calculation). If they cannot — if they are always positive — then the normal distribution is not appropriate and you have to use an alternate distribution, depending on the nature of your data and the distribution that best describes it (lognormal for instance).
rihab
2015-8-20
Hi I am facing the same problem. Could you please tell me matlab code that to calculate the t-probability and its inverse?(I dont have statistics toolbox).
Adam Danz
2019-8-21
编辑:Adam Danz
2022-7-12
Here's an anonymous function based on Star Strider's answer. It uses tinv() which means the stats toolbox is required. This function also uses "omitnan" flags so that NaN values are ignored which requires r2016a or later. Note that the t-distribution method assumes the data form an approximately normal distribution but this can be fairly robust to skewed data.
% x is a vector, matrix, or any numeric array of data. NaNs are ignored.
% p is a the confident level (ie, 95 for 95% CI)
% The output is 1x2 vector showing the [lower,upper] interval values.
CIFcn = @(x,p)std(x(:),'omitnan')/sqrt(sum(~isnan(x(:)))) * tinv(abs([0,1]-(1-p/100)/2),sum(~isnan(x(:)))-1) + mean(x(:),'omitnan');
Alternatively, you could compute CI of the mean using bootstrapping along with the percentile method. This approach does not assume a normal distribution and is more robust than the t-distribution method.
Here's a demo comparing both methods to show a small difference in CI.
Generate skewed data
rng('default')
x = raylrnd(5,[1,2000]); % requires stats & machine learning toolbox
Compute CI using the t-distribution method
CIFcn = @(x,p)std(x(:),'omitnan')/sqrt(sum(~isnan(x(:)))) * tinv(abs([0,1]-(1-p/100)/2),sum(~isnan(x(:)))-1) + mean(x(:),'omitnan');
p = 95;
CItdist = CIFcn(x,p)
CItdist = 1×2
6.1236 6.4105
Compute CI using bootstrapping & percentile method
bootci requires the stats & machine learning toolbox. However, this is fairly easy to compute without the bootci function. Simply create a for-loop with n iterations for n bootstraps (I've chosen 1000 here). In each iteration of the for-loop, sample your data with replacement (use the randi function) and store the mean of the resampled data. After you have n means, compute the 95% CI of the means using prctile.
Note, I would not use mean as the statistic for a non-normal distribution. The median would be a much better approach.
[CIbsMean, CImeans] = bootci(1000, {@mean, x}, 'type','per','alpha', 0.05);
disp(CIbsMean')
6.1355 6.4138
Plot the results.
The first axes shows the distribution of the raw data and both sets of CIs. You can see that they are so close the nearly overlap. The second axes show the same sets of CIs but magnified to see the difference. The last axes shows the distribution of the means from the bootstrap. Notice that they are approximately normally distributed even though the underlying data are not normally distributed. Herein lies the magic of bootstrapping with the percentile method. Thanks to the central limit theorm, the distribution of bootstrapped means will always be normally distributed no matter what the underlying distribution is from the raw data!
figure()
tiledlayout(3,1,'TileSpacing','Compact');
nexttile
histogram(x)
x1 = xline(CItdist,'k:','LineWidth',1,'DisplayName','tinv');
x2 = xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
x3 = xline(mean(x),'k-','DisplayName','mean');
legend([x1(1),x2(1),x3],'Location','EastOutside')
title('CI and underlying data')
nexttile
x1 = xline(CItdist,'k:','LineWidth',1,'DisplayName','tinv');
x2 = xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
x3 = xline(mean(x),'k-','DisplayName','mean');
legend([x1(1),x2(1),x3],'Location','EastOutside')
title('CIs')
box on
nexttile
histogram(CImeans)
xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
title('Bootstrapped means')
Lastly, for people looking to compute the bootstrapped CI on the distribution rather than the mean of the distribution, you can simply use the prctile function:
% x is a vector, matrix, or any numeric array of data. NaNs are ignored.
% p is the confidence level (ie, 95 for 95% CI)
% The output is 1x2 vector showing the [lower,upper] interval values.
CIFcn = @(x,p)prctile(x,abs([0,100]-(100-p)/2));
Demo:
figure
x = pearsrnd(0,1,1,4,100,1);
histogram(x);
CItdist = CIFcn(x,p);
xline(CItdist,'k:','LineWidth',1,'DisplayName','CI Mean')
CIFcn = @(x,p)prctile(x,abs([0,100]-(100-p)/2));
CIDist = CIFcn(x,95);
xline(CIDist,'m--','LineWidth',1,'DisplayName','CI Distribution')
Phi Phan
2022-2-18
Hi @Star Strider, thank you for the useful comments. I'd like to to calculate the CI but I do not have the statistics tool box. It'd be really great help if you could share with me the codes you suggested that you'd be able to provide. Thank you very much in advance!
Star Strider
2022-2-18
Phi Phan —
Sure!
I am posting them as commented code in order to include the ‘documentation’ for them (such as it is), however it is only necessary to remove the comments in the appropriate lines (anonymous functions) to use the functions —
% % % % % T-DISTRIBUTIONS —
% % Variables:
% % t: t-statistic
% % v: degrees of freedom
%
% tdist2T = @(t,v) (1-betainc(v/(v+t^2),v/2,0.5)); % 2-tailed t-distribution
% tdist1T = @(t,v) 1-(1-tdist2T(t,v))/2; % 1-tailed t-distribution
%
% % This calculates the inverse t-distribution (parameters given the
% % probability ‘alpha’ and degrees of freedom ‘v’:
% t_inv = @(alpha,v) fzero(@(tval) (max(alpha,(1-alpha)) - tdist1T(tval,v)), 5); % T-Statistic Given Probability ‘alpha’ & Degrees-Of-Freedom ‘v’
These use only basic MATLAB functions. Another option for ‘t_inv’ could be interp1, however I have never used it in this context.
.
Ishmaal Erekson
2022-7-12
@Adam Danz, in your example provided 21 Aug 2019, you compare the tinv method to the prctile method, stating that the prctile method is more robust because it doesn't assume a normal distribution. I am somewhat new to these statistical methods, so please correct me if I am wrong, but aren't those methods calculating two different things? The tinv method you use provides the confidence interval of the mean and as explained by Star Strider, will decrease with increased number of samples. The prctile method you use will simply tell you the bounds in which lies 95% of your samples -- assuming sufficient samples are taken, the confidence intervals using this method will not change much as the number of samples increases.
Is there a way to calculate the confidence interval on the mean (such as is done in the tinv method), but without assuming a normal distribution?
Adam Danz
2022-7-12
Thanks for the comment, @Ishmaal Erekson. You're correct. My 2019 comment was misleading and I'll update it to avoid confusion.
> Is there a way to calculate the confidence interval on the mean (such as is done in the tinv method), but without assuming a normal distribution?
Yes. I'll update my comment in a moment to include this.
Agata Oskroba
2022-11-26
@Star Strider could you explain how to get the t-score without the toolbox?
Star Strider
2022-11-26
Try this —
% Variables:
% t: t-statistic
% v: degrees of freedom
tdist2T = @(t,v) (1-betainc(v/(v+t^2),v/2,0.5)); % 2-tailed t-distribution
tdist1T = @(t,v) 1-(1-tdist2T(t,v))/2; % 1-tailed t-distribution
% This calculates the inverse t-distribution (parameters given the
% probability ‘alpha’ and degrees of freedom ‘v’:
t_inv = @(alpha,v) fzero(@(tval) (max(alpha,(1-alpha)) - tdist1T(tval,v)), 5); % T-Statistic Given Probability ‘alpha’ & Degrees-Of-Freedom ‘v’
.
Niraj Desai
2023-8-25
@Star Strider Thank you so much for your answers (over the course of eight years !!!) I realize this thread started in 2014, but I only found it today. It clarified something that I had been confused about. I'm grateful.
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)