All Pairwise Comparisons Among Means
David M. Lane
Between Two Means (Independent Groups)
- Define pairwise comparison
- Describe the problem with doing t tests among
all pairs of means
- Calculate the Tukey HSD test
- Explain why the Tukey test should not necessarily be considered a follow-up
Many experiments are designed to compare more
than two conditions. We will take as an example the case study
"Smiles and Leniency."
In this study, the effect of different types of smiles on the
leniency shown to a person was investigated. An obvious way
to proceed would be to do a t
test of the difference between each
group mean and each of the other group means. This procedure would lead
to the six comparisons shown in Table 1.
The problem with this approach is that if you
did this analysis, you would have six chances to make a Type
I error. Therefore, if you were using the 0.05 significance level,
the probability that you would make a Type I error on at least
one of these comparisons is greater than 0.05. The more means
that are compared, the more the Type I error rate is inflated.
Figure 1 shows the number of possible comparisons between pairs
of means (pairwise comparisons) as a
function of the number of means. If there are only two means,
then only one comparison can be made. If there are 12 means, then
there are 66 possible comparisons.
Figure 1. Number of pairwise comparisons as a function
of the number of means.
Figure 2 shows the probability of a Type I error
as a function of the number of means. As you can see, if you have an experiment with 12 means, the probability is
about 0.70 that at least one of the 66 comparisons among means
would be significant even if all 12 population means were the
Figure 2. Probability of a Type I error
as a function of the number of means.
The Type I error rate can be controlled using a
test called the Tukey Honestly Significant Difference test or Tukey HSD for short. The Tukey HSD is
based on a variation of the t distribution
that takes into account the number of means being compared. This
distribution is called the studentized range
Let's return to the leniency study to see how to
compute the Tukey HSD test. You will see that the computations
are very similar to those of an independent-groups
t test. The steps are outlined below:
- Compute the means and variances of each group. They are shown
- Compute MSE, which is simply the mean of the variances. It
is equal to 2.65.
for each pair of means, where Mi is
one mean, Mj is the other mean, and
n is the number of scores in each group. For these data,
there are 34 observations per group. The value in the denominator
- Compute p for each comparison using the Studentized
Range Calculator. The degrees of freedom is equal to the
total number of observations minus the number of means. For
this experiment, df = 136 - 4 = 132.
The tests for these data are shown in Table 2.
Table 2. Six Pairwise Comparisons.
False - Felt
False - Miserable
False - Neutral
Felt - Miserable
Felt - Neutral
Miserable - Neutral
The only significant comparison is between the false smile and
the neutral smile.
It is not unusual to obtain results that on the
surface appear paradoxical. For example, these results appear
to indicate that (a) the false smile is the same as the miserable
smile, (b) the miserable smile is the same as the neutral control,
and (c) the false smile is different from the neutral control.
This apparent contradiction is avoided if you are careful not
to accept the null hypothesis when you fail to reject it. The
finding that the false smile is not significantly different from
the miserable smile does not mean that they are really the same.
Rather it means that there is not convincing evidence that they
are different. Similarly, the non-significant difference between
the miserable smile and the control does not mean that they are
the same. The proper conclusion is that the false smile is higher
than the control and that the miserable smile is either (a) equal
to the false smile, (b) equal to the control, or (c) somewhere
The assumptions of the Tukey test are essentially
the same as for an independent-groups
t test: normality, homogeneity of variance, and independent
observations. The test is quite robust to violations of normality.
Violating homogeneity of variance can be more problematical than
in the two-sample case since the MSE is based on data from all
groups. The assumption of independence of observations is important
and should not be violated.
For most computer programs, you should format
your data the same way you do for an independent-groups
t test. The only difference is that if you have, say, four
groups, you would code each group as 1, 2, 3, or 4 rather than
just 1 or 2.
Although full-featured statistics programs such
as SAS, SPSS, R, and others can compute Tukey's test, smaller
programs (including Analysis Lab) may not. However, these programs
are generally able to compute a procedure known as Analysis
of Variance (ANOVA). This procedure will be described in detail
in a later chapter. Its
relevance here is that an ANOVA computes the MSE that is used
in the calculation of Tukey's test. For example, the following
shows the ANOVA summary table for the "Smiles and Leniency" data.
The column labeled MS stands for
"Mean Square" and therefore the value 2.6489 in the
"Error" row and the MS column is the "Mean Square
Error" or MSE. Recall that this is the same value computed here
(2.65) when rounded off.
Tukey's Test Need Not be a Follow-Up
Some textbooks introduce the Tukey
test only as a follow-up to an analysis of variance. There is
no logical or statistical reason why you should not use the Tukey
test even if you do not compute an ANOVA (or even know what one
is). If you or your instructor do not wish to take our word for
this, see the excellent article on this and
other issues in statistical analysis by Leland Wilkinson and
the APA Board of Scientific Affairs' Task Force on Statistical Inference,
published in the American Psychologist, August 1999, Vol. 54,
No. 8, 594–604.
Computations for Unequal Sample Sizes (optional)
The calculation of MSE for unequal sample sizes
is similar to its calculation in an independent-groups
t test. Here are the steps:
- Compute a Sum of Squares Error (SSE) using the following formula
where Mi is the mean of the ith group
and k is the number of groups.
- Compute the degrees of freedom error (dfe) by subtracting
the number of groups (k) from the total number of observations
dfe = N - k.
- Compute MSE by dividing SSE by dfe:
MSE = SSE/dfe.
- For each comparison of means, use the harmonic mean of the
n's for the two means (nh).
All other aspects of the calculations are the
same as when you have equal sample sizes.
Make sure to put the data files in the default directory.
leniency = read.csv(file = "leniency.CSV")
leniency.f <- factor(leniency$smile, levels = c("1", "2", "3", "4"))
leniency_model <- lm(leniency~ leniency.f, data = leniency)
leniency_aov <- aov(leniency_model)
TukeyHSD(leniency_aov, ordered = FALSE)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = leniency_model)
diff lwr upr p adj
2-1 -0.4558824 -1.483012 0.5712478 0.6562329
3-1 -0.4558824 -1.483012 0.5712478 0.6562329
4-1 -1.2500000 -2.277130 -0.2228699 0.0102192
3-2 0.0000000 -1.027130 1.0271301 1.0000000
4-2 -0.7941176 -1.821248 0.2330125 0.1888804
4-3 -0.7941176 -1.821248 0.2330125 0.1888804
Please answer the questions: