|
Box Plots
Author(s)
David M. Lane
Prerequisites
Percentiles
Learning Objectives
- Define basic terms including hinges, H-spread, step, adjacent value,
outside value, and far out value
- Create a box plot
- Create parallel box plots
- Determine whether a box plot is appropriate for a given dataset
In this
section we present an important graph called a box
plot. Box plots are useful for
identifying outliers and for comparing distributions. We will
explain box plots with the help of data from an in-class experiment.
Students in Introductory Statistics were presented with a page
containing 30 colored rectangles. Their task was to name the
colors as quickly as possible, and their times were recorded.
We'll compare the scores for the 16 men and 31 women who participated
in the experiment by making separate box plots for each gender.
(Such a display is said to involve parallel
box plots.)
There are several steps in constructing a box plot.
The first relies on the 25th, 50th, and
75th percentiles in the distribution of scores.
the bottom of each box is the 25th percentile,
the top is the 75th percentile,
and the line in the middle is the 50th percentile.
Before proceeding, the terminology in Table 1 is
helpful.
Continuing with the box plots, we put "whiskers" above
and below each box to give additional information about the spread
of data. Whiskers are vertical lines that end in a horizontal
stroke. Whiskers are drawn from the upper and lower hinges to
the upper and lower adjacent values . Although we don't draw whiskers
all the way to outside or far out values, we still wish to represent
them in our box plots. This is achieved by adding additional marks
beyond the whiskers. Specifically, outside values are indicated
by small "o's, and far out values are indicated by asterisks.
In our data, there are no far out values, and just one outside
value. There is one more mark to include in box plots (although
sometimes it is omitted). We indicate the mean score for a group
by inserting a plus sign. Figure 1 shows the result of adding
means to our box plots.
Figure 1 provides a revealing summary of the data.
Since half the scores in a distribution are between the hinges
(recall that the hinges are the 25th and 75th
percentiles), we see that half the women's times are between 17
and 20 whereas half the men's times are between 19 and 25. We
also see that women generally named the colors faster than the
men did, although one woman was slower than almost all of the
men. Figure 2 shows the box plot for the women's data with detailed
labels.
Variations on box plots
Statistical analysis programs may offer options
on how box plots are created. For example, the box plot in Figure
2 is constructed from our data but differs from the previous box
plot in several ways.
- First, it does not mark outliers.
- Second, the means are indicated by green lines rather than
plus signs.
- The mean of all scores is indicated by a grey line.
- Individual scores are represented by dots. Since the scores
have been rounded to the nearest second, any given dot might
represent more than one score.
- The box for the women is wider than the box for the men because
the widths of the boxes are proportional to the number of subjects
of each gender (31 women and 16 men).
Each dot in Figure 2 represents a group of subjects
with the same score (rounded to the nearest second). An alternative
graphing technique is to jitter the points.
This means spreading out different dots at the same horizontal
position, one dot for each subject. The exact horizontal position
of a point is determined randomly (under the constraint that
different dots don’t overlap). Spreading out the dots
allows you to see multiple occurrences of a given score. Figure
3 shows what jittering looks like.
Different styles of box plots are best for different
situations, and there are no firm rules for which to use. When
exploring your data you should try several ways of visualizing
them. Which graph you include in your report should depend on
how well different graphs reveal the aspects of the data you consider
most important.
Please answer the questions:
|
|