Which Anova test and how to use it?

Question

Franck paulin Ludovig pehn Mayo 2022-7-12

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1758000-which-anova-test-and-how-to-use-it

评论： Franck paulin Ludovig pehn Mayo 2022-7-20

Good afernoon everyone,

I would like to use an anova test but i unfortunately does not know which one to use.

I have attached an excel file with the datas.

For instance,

I would like to know the relevance when Thickness and orientation are involved. These are the data of 9 individuals with 5 repetitions.

The correct/not represent whether the participants have found the correct answer or not. correct =1 and Not =0

28 个评论
显示 26更早的评论隐藏 26更早的评论

Adam Danz 2022-7-12

I assume the "recognition rate" is the same as accuracy which is indicated by a "1" in the correct/not column. Let me rephrase your goal with how I interpret it and you can let me know if my interpretation is incorrect.

You've got two independent variables (thickness and orientation). Thickness is on a continuous scale and has 6 levels, orientation is categorical and has 3 levels (horz, vert, control). You've got 1 independent variable which is binary, true/false, that describes some kind of decision so true (1) means correct and false (0) mean not-correct.

There are 9 participants with 5 reps and 18 conditions (6*3) which would result in 810 data points (rows of table) if all participants repeated all conditions 5 times but I only see 721 rows of data.

I still don't know the research question that motivated this design so I can only guess at the null hypothesis. In general, the order of events in a research project is

define the question (sometimes the hardest part)
define the null hypothesis given the question
decide on methodology given the quesiton and null hyp.
collect data
analyze and interpret

For example, perhaps thickness or orientaiton is the main variable under question while the other one is a control condition that is not expected to have an effect. Or perhaps you're wondering whether horizontal and vertical orientation statistically differ from the control orientation condition. Another question might involve individual differences between participants. Each of those may have completely different statistical tests.

It sounds like you want to know if there is a statistical difference between some groups and, given the groups are similar, the difference might be small, but you need to find out if the small differences is significant. If you provide more detail on the question you're asking (and the null hypothesis would be nice to know, too), I could help further.

Adam Danz 2022-7-12

在 MATLAB Online 中打开

I don't know how you made this bar plot or how you computed the means. The only thickness that nests with horizontal and vertical orientations is 0.4. You mentioned in a previous comment that some thickness values could be combined but that explanation was confusing. For example, you mentioned 0.02=0.04 but those are treated as separate conditions and the data in your plots how that they have different summary values.

T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1062725/Anovan_Ptestdata.xlsx','VariableNamingRule','preserve');
Tc = groupcounts(T,["thicknesss","orientation"])
Tc = 7×4 table
    thicknesss     orientation      GroupCount    Percent
    __________    ______________    __________    _______

          0       {'Control'   }       180           25  
       0.02       {'vertical'  }        90         12.5  
       0.03       {'vertical'  }        90         12.5  
       0.04       {'horizontal'}        90         12.5  
       0.04       {'vertical'  }        90         12.5  
       0.06       {'horizontal'}        90         12.5  
       0.08       {'horizontal'}        90         12.5  

Franck paulin Ludovig pehn Mayo 2022-7-18

编辑：Franck paulin Ludovig pehn Mayo 2022-7-18

@Scott MacKenzie for you to have an idea of my data let me explain the aim of the experiment. I designed a ring with haptic motors embedded in it. The motors will be running with just two amplitudes (Level 1 and Level 2 which are the lowest). On a touschreen, each participant (blindfolded) will have to find out the line (vertical and horizontal orientaions). The lines have different thicknesses 0.02 , 0.03 and 0.04 and 0(control) . The latter corresponds to blank , nothing is displayed.

So i was aiming to find the impact of amplitudes, thicknesses and orientation in the recognition rate. There were 5 repetitions as i said before but everything was ramdomly done. To find the start and the end of each partcipant is easy but to dertermine the repetitions rank or order is likely impossible.

To draw the graphs, I used the MEAN of all participants and seeing the graphs there were not a tremendous difference among the bars that's why i wanted to go for an anova test to check on the variance. I came accross the Anova, T-test just like a week ago.

Adam Danz 2022-7-18

@Franck paulin Ludovig pehn Mayo, your question is about applying a statistic to the data but the majority of this thread is back-and-forth questions trying to understand your data. It's really confusing to say that some conditions are actually other conditions. This thread currently has 267 views and almost 30 comments since it was posted 6 days ago which suggests a lot of time has been put into this. It shouldn't be this difficult to explain 12 data points (the number of bars in your figure).

I want to see you succeed in this goal so please let me gives some advice.

In the future, it would benefit you to spend some time cleaning up the data so it's very easy to explain and understand before you ask the question. Also, whenever you generate a plot, provide the code so we don't have to figure out what you're doing. That adds additional tasks we must figure out before we even get to your question. It looks like those bar plots were done outside of MATLAB but taking the time to figure out how to do it in MATLAB so you can ask a clearer question would help out a lot.

Franck paulin Ludovig pehn Mayo 2022-7-20

@Adam Danz okay i understand. Just came across this statitics methods like a week ago, trying to understand what is what and which one can suits my issue.

The few info i got seems to lead me towards confidence interval.

Adam Danz 2022-7-20

编辑：Adam Danz 2022-7-20

I came across these statistical methods 15 years ago and am still trying to understand which ones suit different sets of data and questions. It wasn't until about 5 years ago that I realized my long-tem confusion wasn't a problem with my understanding -- it's a problem in the field of statistics in general. So many peer-reviewed articles apply statistics incorrectly or do not show that the data are fit for the selected statistics. Worse yet, some people keep applying different statistics until they get the results they want which is p-hacking. Three years ago hundreds of scientists and statisticians around the globe supported a movement to change how we think about and practice statistics (see list of articles at the bottom of this answer). What's nice about bootstrapped CIs is that they can be used to visualize how closely related are two distributions rather than just providing a number such as p<0.005.

I'm not swaying you away from using an ANOVA method - but I am arguing that the movement mentioned is a big step forward in statistics.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Adam Danz 2022-7-13

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1758000-which-anova-test-and-how-to-use-it#answer_1006865

在 MATLAB Online 中打开

I recommend using bootstrapped confidence intervals. The idea is to resample your accuracy data with replacement and compute the mean on the sample for each condition. If you repeat this many times (1000, for example), you'll have a distribution of means which can be used to compute the middle 95% interval. Fortunately MATLAB has a function that does most of the work: bootci which is demo'd in this comment. After you have the CIs for each condition, you can plot them using errorbar. If the CIs do not overlap between two conditions, it is likely that the data from those condtions come from different distributions.

Here's a demo that performs bootstrapped CIs for a single condition in your data. I would set up the loop to compute CIs for all conditions but I still do not understand which conditions to compare since the data do not appear to be nested. Perhaps if the 'thickness' values were corrected in some way, it would be clearer. But first you give it a shot.

T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1062725/Anovan_Ptestdata.xlsx','VariableNamingRule','preserve');

thickIdx = T.thicknesss == 0.04;

orientIdx = strcmp(T.orientation, 'vertical');

CI = bootci(1000, {@mean, T.("correct/not")(thickIdx & orientIdx)}, 'Type', 'per')

CI = 2×1

0.7667 0.9111

mu = mean(T.("correct/not")(thickIdx & orientIdx));

bar(mu)

hold on

errorbar(1, mu, mu-CI(1), mu-CI(2), 'k-','LineWidth',1)

14 个评论
显示 12更早的评论隐藏 12更早的评论

Scott MacKenzie 2022-7-17

@Franck paulin Ludovig pehn Mayo, I'm just seeing your question now. There are comments and an answer from @Adam Danz, so perhaps we're done here. However, let me add a comment.

To me, the most informative part of your question is the grouped bar chart. It shows the relationship between two independent variables (x-axis) and a dependent variable (y-axis). The independent variables are orientation with 3 levels (horizontal, vertical, control) and thickness with 4 levels (0.02, 0.03, 0.04, and control). The dependent variable is recognition rate (%). This looks appropriate for an analysis of variance. And you are not alone in wondering how to do this in MATLAB: There are at least 5 MATLAB anova functions! The anova will help answer three questions:

Is there a significant effect of orientation on recognition rate?
Is there a significant effect of thickness on recognition rate?
Is there a significant Orientation x Thickness interaction effect on recognition rate?

This can be setup fairly easily in MATLAB, but, first, there are some issues that need to be clarified. The experiment engaged nine participants ("individuals" in the question) with five repetitions of the measurements for each participant on each condition. But, there is no column in the data set indicating which rows correspond to which participants. Ditto for repetition. Can you add columns for the participant codes and repetition numbers?

Also, I assume "0" in the thickness column corresponds to the "control" level for thickness, but please confirm.

Finally, note that there is a small labelling error in the bar chart. The x-axis label corresponds to the bar groups. This should be "Thickness", not "Orientation". Orientation appears via the bars within groups. So, if you wish to include "Orientation" in the chart, it should appear as the title for the legend entries.

Franck paulin Ludovig pehn Mayo 2022-7-17

Anovan_Ptestdata_k.xlsx

@Scott MacKenzie

"The experiment engaged nine participants ("individuals" in the question) with five repetitions of the measurements for each participant on each condition. But, there is no column in the data set indicating which rows correspond to which participants. Ditto for repetition"

I thought the particpant's information were not needed. Secondly, concerning the partcipants, after each 80 rows comes a new participant. (I have attached a file; A = first partcipant ,B= second...)

ex: From row 2 to row 81 , First partcipant

From row 82 to row 161 , second particpants and so on...

Concerning the repetitions, it will be impossible actually to detect since the experiment was ramdomly done each time .

ex : so from row 2 to row 81 , there are 5 repetitions within, but i cannot tell you the order.

So what i did , i copied all the results from each participants and i just pasted them in excel.

Yes, indeed the labbeling in the error bar is wrong thank you for me letting me know, indeed it is thickness instead of orientation.

The fact is that , there are lot of parameters that i will have to check the "relevance" such as:

Thickness - Orientation (Recognition Rate)
Amplitude - Orientation (Recognition Rate)
Thickness- amplitude (Recognition Rate)
Recognition Time when amplitude is involved
Recognition Time when Thickness is involved.

Yes "0" thickness corresponds to the Control

Franck paulin Ludovig pehn Mayo 2022-7-19

编辑：Franck paulin Ludovig pehn Mayo 2022-7-19

在 MATLAB Online 中打开

@Adam Danz i did something but i dont know quite how to interpret the results . I did it based on your previous works

T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1068365/Newfile.xlsx');

thickIdx = T.Var2 == 0.04;

orientIdx = strcmp(T.Var4, 'vertical');

data = T.("Var5")(thickIdx & orientIdx);

%number of bootstapps

nBoot = 1000;

[bci,bmeans] = bootci(nBoot, {@mean,data}, 'Type', 'per');

% bootstrap sample mean

bmu = mean(bmeans);

%mu = mean(data);

% Now repeat that process with lower-level bootstrapping

% using the same sampling proceedure and the same data.

bootMeans = nan(1,nBoot);

for i = 1:nBoot

bootMeans(i) = mean(data(randi(numel(data),size(data))));

end

CI = prctile(bootMeans,[5,95]);

mu = mean(bootMeans);

% Plot

figure()

ax1 = subplot(2,1,1);

histogram(bmeans);

hold on

xline(bmu, 'k-', sprintf('mu = %.2f',bmu),'LineWidth',2)

xline(bci(1),'k-',sprintf('%.1f',bci(1)),'LineWidth',2)

xline(bci(2),'k-',sprintf('%.1f',bci(2)),'LineWidth',2)

title('bootci()')

% plot the lower-level, direct computation results

ax2 = subplot(2,1,2);

histogram(bootMeans);

hold on

xline(mu, 'k-', sprintf('mu = %.2f',mu),'LineWidth',2)

xline(CI(1),'k-',sprintf('%.1f',CI(1)),'LineWidth',2)

xline(CI(2),'k-',sprintf('%.1f',CI(2)),'LineWidth',2)

title('Lower level')

linkaxes([ax1,ax2], 'xy')

% bar(bmu)

% hold on

% errorbar(1, mu, mu-CI(1), mu-CI(2), 'k-','LineWidth',1)

Adam Danz 2022-7-19

在 MATLAB Online 中打开

I'll break down your code below.

Here, you're looking at data in column "Var5" of your table from rows that from conditions Var2==0.4 and Var4="vertical".

T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1068365/Newfile.xlsx');
thickIdx = T.Var2 == 0.04;
orientIdx = strcmp(T.Var4, 'vertical'); 
data =  T.("Var5")(thickIdx & orientIdx)
data = 90×1
     0
     0
     1
     1
     1
     1
     0
     1
     1
     1

Then you're bootstrapping the mean from that selection of data. "bci" is the 95% confidence interval (CI) of the mean and "bmeans" are the 1000 bootstrapped means. See bootci for details.

%number of bootstapps
nBoot = 1000;
[bci,bmeans] = bootci(nBoot, {@mean,data}, 'Type', 'per')
bci = 2×1
7667
9111
bmeans = 1000×1
8444
8889
8889
8556
9000
8556
8111
8556
8556
8111

I don't know why you want the mean of the bootstrapped means. Maybe you have good reason for this. The line you commented out computes the mean of the raw data.

% bootstrap sample mean
bmu = mean(bmeans); 
%mu = mean(data);

I'm not sure what "lower level bootstrapping" is. Is that a term I used somewhere in another thread? The for-loop merely implements the same type of bootstrapping that the bootci function does above. I was probably comparing the bootci functionality to another method of directly implementing bootstrapping (still not sure where you saw this but it does look like mine). The randi function resamples the data with replacement which is important to do in bootstrapping. Then the prctile line computes the CIs with the percentile method in the same way bootci does when type='per'.

% Now repeat that process with lower-level bootstrapping
% using the same sampling proceedure and the same data.
bootMeans = nan(1,nBoot); 
for i = 1:nBoot
    bootMeans(i) = mean(data(randi(numel(data),size(data)))); 
end
CI = prctile(bootMeans,[5,95]); 
mu = mean(bootMeans); 

This part plots the distribution of bootstrapped means from bootci

% Plot
figure()
ax1 = subplot(2,1,1);
histogram(bmeans); 

this adds the mean of the bootstrap means. Maybe you want to show the mean of the data instead.

hold on
xline(bmu, 'k-', sprintf('mu = %.2f',bmu),'LineWidth',2)

Here you add the bootci CIs

xline(bci(1),'k-',sprintf('%.1f',bci(1)),'LineWidth',2)
xline(bci(2),'k-',sprintf('%.1f',bci(2)),'LineWidth',2)
title('bootci()')

Then you repeat with the lower level bootstrapping method which unsurprisingly has the same results.

% plot the lower-level, direct computation results

ax2 = subplot(2,1,2);

histogram(bootMeans);

hold on

xline(mu, 'k-', sprintf('mu = %.2f',mu),'LineWidth',2)

xline(CI(1),'k-',sprintf('%.1f',CI(1)),'LineWidth',2)

xline(CI(2),'k-',sprintf('%.1f',CI(2)),'LineWidth',2)

title('Lower level')

linkaxes([ax1,ax2], 'xy')

% bar(bmu)

% hold on

% errorbar(1, mu, mu-CI(1), mu-CI(2), 'k-','LineWidth',1)

But this isn't what your initial goal is. This is useful to compute the CIs (use one method or the other, no need to do both). Your initial goal is to compute the CI, not to plot the distributions and such.

Once you have the CIs for each condition, you can add them to your bar plot using the errorbar function.

Franck paulin Ludovig pehn Mayo 2022-7-20

编辑：Franck paulin Ludovig pehn Mayo 2022-7-20

在 MATLAB Online 中打开

@Adam Danz i wanted to do a loop for all the conditions but i was just having mistakes. so i went like "manually".

Also, concerning the graphs how can i have the error bars to be more distincts. Also , concerning the "o" thickness, i couldnt implement it . Also, how can i add "horizontal"?

was having this:

"BOOTFUN returns a NaN or Inf." , I guess it is because the mean cannot be computed with 0. Is there anyway , i can sort it out ?

T = readtable('Newfile.xlsx');
%Var2 = thickness
%thickIdx1 = T.Var2 == 0;
thickIdx2 = T.Var2 == 0.02;
thickIdx3 = T.Var2 == 0.03;
thickIdx4 = T.Var2 == 0.04;
%Var4= orientation
%orientIdx1 = strcmp(T.Var4, 'vertical'); 
orientIdx2 = strcmp(T.Var4, 'vertical'); 
orientIdx3 = strcmp(T.Var4, 'vertical'); 
orientIdx4 = strcmp(T.Var4, 'vertical'); 
%var5= correct/not
%data1 =  T.("Var5")(thickIdx1 & orientIdx1);
data2 =  T.("Var5")(thickIdx2 & orientIdx2);
data3 =  T.("Var5")(thickIdx3 & orientIdx3);
data4 =  T.("Var5")(thickIdx4 & orientIdx4);
%number of bootstapps
nBoot = 1000;
%CI1 = bootci(nBoot, {@mean,data1}, 'Type', 'per');
CI2 = bootci(nBoot, {@mean,data2}, 'Type', 'per')
CI3 = bootci(nBoot, {@mean,data3}, 'Type', 'per')
CI4 = bootci(nBoot, {@mean,data4}, 'Type', 'per')
% mu1 = mean(data1);
mu2 = mean(data2);
mu3 = mean(data3);
mu4 = mean(data4);
% bar(mu1)
bar(mu2)
bar(mu3)
bar(mu4)
hold on
% errorbar(1, mu1, mu1-CI1(1), mu2-CI2(2), 'k-','LineWidth',1)
errorbar(1, mu2, mu2-CI2(1), mu2-CI2(2), 'k-','LineWidth',1)
errorbar(1, mu3, mu3-CI3(1), mu3-CI3(2), 'k-','LineWidth',1)
errorbar(1, mu4, mu4-CI4(1), mu4-CI4(2), 'k-','LineWidth',1)

Adam Danz 2022-7-20

To "fix it" can be hairy.

Sometimes an insuffient amount of data is collected such that the sample of data does not reflect the unobservable full population of data. For example, if I'm calling people randomly to ask what their favorite ice cream is, maybe I accidentally called a disproportionaly high number of lactose intolerance people. In that case, then yes, collecting more data can reveal a more accurate picture of the population.

But if your sample of data already relfects the population, collecting more data will not change the outcome.

Most importantly, the amount of data you collect should not be decided from the resultant statistic. In other words, you should decide how much data to collect independtly from the results. Otherwise, that p-hacking and it's really bad science.

If your data reflect the underlying population, and if your bars overlap, then that's the result, that's the answer to your quesiton, that's reality. In that case, you cannot conclude that these two populations of means come from different distributions.

I did a study for 4 years and had those unexpected results - that two groups did not differ even though everyone expected them to differ. This is an opportunity to investigate why. Maybe previous studies had a different methodology or maybe the model should be viewed differently.

About comparing different conditions, all you have to do is change your indexing.

BTW I just noticed that your variables orientIdx2 orientIdx3 orientIdx4 are all the same thing. You only need one of those. Take some time to understand what these lines are doing.

Adam Danz 2022-7-20

This is more of an art form than a science. There are lots of bits of advice out there to know when enough is enough. It's been obvoius to me when I don't have enough data but less obvious when I've collected enough. I have use cross validation to help make that decision. The main idea is, if I remove something like 10-20% of my data and get approximately the same results, then I have enough data.
It wouldn't be surprising if the CIs differ by a very small amount between runs. bootci uses and random selection of your data so the results can differ by a very small amount. If you're getting noticable different results between runs, someting is wrong. Either you're not runing enough boot straps (1000 should be enough but you could try more) or you're not providing the same exact input data between runs. This is definitely something you want to investigate.
I still don't understand your dataset enough to imagine this comparison. If any given data point has a thickness property and an orientation property and you want to know whether thickness or orientation has a stronger effect, then I don't think you can do that with this bootstrapping method which makes me fear that this entire multiple-day thread has nothing to do with your actual goals. The main lesson, if this is the case, is that the data and the goals must be crystal clear to you and to the readers before a useful answer can be written.

I realized you previously asked about NaNs in your bootci results but I forgot to address that question. By default, mean does not ignore NaNs and if there is a NaN in the data, the mean will be NaN. You want to omit nans using

___ = bootci(nBoot, {@(x)mean(x,'omitnan'),data}, 'Type', 'per')

That's all the time I have for this thread @Franck paulin Ludovig pehn Mayo. I hope these ideas will be helpful to you even if you don't end up needing them.

Franck paulin Ludovig pehn Mayo 2022-7-20

在 MATLAB Online 中打开

@Adam Danz Thank you very much , i have grasped the concept. I have an idea how i will go from here.

The last input i would like to know is to fix the Nan . i have implemented it but unfortunately i am still having the same error.

BOOTFUN returns a NaN or Inf.

T = readtable('Newfile.xlsx');
%Var2 = thickness
thickIdx1 = T.Var2 == 0;
thickIdx2 = T.Var2 == 0.02;
thickIdx3 = T.Var2 == 0.03;
thickIdx4 = T.Var2 == 0.04;
%Var4= orientation 
orientIdx = strcmp(T.Var4, 'vertical');  
%var5= correct/not
data1 =  T.("Var5")(thickIdx1 & orientIdx);
data2 =  T.("Var5")(thickIdx2 & orientIdx);
data3 =  T.("Var5")(thickIdx3 & orientIdx);
data4 =  T.("Var5")(thickIdx4 & orientIdx);
%number of bootstapps
nBoot = 1000;
CI1 =bootci(nBoot, {@(x)mean(x,'omitnan'),data1}, 'Type', 'per')
CI2 = bootci(nBoot, {@mean,data2}, 'Type', 'per')
CI3 = bootci(nBoot, {@mean,data3}, 'Type', 'per')
CI4 = bootci(nBoot, {@mean,data4}, 'Type', 'per')
mu1 = mean(data1);
mu2 = mean(data2);
mu3 = mean(data3);
mu4 = mean(data4);
bar(mu1)
bar(mu2)
bar(mu3)
bar(mu4)
hold on
bar([1 2 3 4])
hold on
errorbar([1 2 3 4], 1:4, rand(1,4), rand(1,4),'k-','LineStyle','none','LineWidth',1)

请先登录，再进行评论。

Which Anova test and how to use it?

28 个评论
显示 26更早的评论隐藏 26更早的评论

回答（1 个）

14 个评论
显示 12更早的评论隐藏 12更早的评论

另请参阅

类别

标签

Community Treasure Hunt

Which Anova test and how to use it?

28 个评论 显示 26更早的评论隐藏 26更早的评论

回答（1 个）

14 个评论 显示 12更早的评论隐藏 12更早的评论

另请参阅

类别

标签

Community Treasure Hunt

28 个评论
显示 26更早的评论隐藏 26更早的评论

14 个评论
显示 12更早的评论隐藏 12更早的评论