Identify errors of prediction in a scatterplot with a regression line

In simple linear regression, we predict
scores on one variable from the scores on a second variable.
The variable we are predicting is called the criterion
variable and is referred to
as Y. The variable we are basing our predictions on is called
the predictor variable and is
referred to as X. When there is only one predictor variable,
the prediction method is called simple regression.
In simple linear regression, the topic of this section, the predictions
of Y when plotted as a function of X form a straight line.

The
example data in Table 1 are plotted in Figure 1. You can see
that there is a positive relationship between X and Y. If you
were going to predict Y from X, the higher the value of X,
the higher your prediction of Y.

Table 1. Example data.

X

Y

1.00
2.00
3.00
4.00
5.00

1.00
2.00
1.30
3.75
2.25

Figure 1. A scatterplot of the example
data.

Linear regression consists of finding the best-fitting
straight line through the points. The best-fitting line is called
a regression line. The black diagonal
line in Figure 2 is the regression line and consists of the predicted
score on Y for each possible value of X. The vertical lines from
the points to the regression line represent the errors of prediction.
As you can see, the red point is very near the regression line;
its error of prediction is small. By contrast, the yellow point
is much higher than the regression line and therefore its error
of prediction is large.

Figure 2. A scatterplot of the example
data. The black line consists of the predictions, the points
are the actual data, and the vertical lines between the
points and the black line represent errors of prediction.

The error of prediction for a point is the value of the point
minus the predicted value (the value on the line). Table 2 shows
the predicted values (Y') and the errors of prediction (Y-Y').
For example, the first point has a Y of 1.00 and
a predicted Y of 1.21. Therefore its error of prediction is -0.21.

Table 2. Example data.

X

Y

Y'

Y-Y'

(Y-Y')^{2}

1.00
2.00
3.00
4.00
5.00

1.00
2.00
1.30
3.75
2.25

1.210
1.635
2.060
2.485
2.910

-0.210
0.365
-0.760
1.265
-0.660

0.044
0.133
0.578
1.600
0.436

You may have noticed that we did not specify what
is meant by
"best fitting line." By far the most commonly used criterion
for the best fitting line is the line that minimizes the sum of
the squared errors of prediction. That is the criterion that was
used to find the line in Figure 2. The last column in Table 2
shows the squared errors of prediction. The sum of the squared
errors of prediction shown in Table 2 is lower than it would be
for any other regression line.

The formula for a regression line is

Y' = bX + A

where Y' is the predicted score, b is the slope
of the line, and A is the Y intercept. The equation for the line
in Figure 2 is

Y' = 0.425X + 0.785

For X = 1,

Y' = (0.425)(1) + 0.785 = 1.21.

For
X = 2,

Y' = (0.425)(2) + 0.785 = 1.64.

Computing the Regression
Line

In the age of computers, the regression line
is typically computed with statistical software. However, the
calculations are relatively easy are given here for anyone who
is interested. The calculations are based on the statistics shown
in Table 3. MX is the mean of X, MY is the mean of Y, sX is
the standard deviation of X, sY is the
standard deviation of Y, and r is the correlation between
X and Y.

Note that the calculations have all been shown
in terms of sample statistics rather than population parameters.
The formulas are the same; simply use the parameter values
for means, standard deviations, and the correlation.

Assumptions

It may surprise you, but the calculations shown
in this section are assumption free. Of course, if the relationship
between X and Y is not linear, a different shaped function could
fit the data better. Inferential
statistics in
regression are based on several assumptions, and these assumptions
are in a section of this chapter.