Introduction to Regression
Analysis
Correlation is often
but a first step in multivariate analysis. Suppose that we
have some reason to believe that A is a cause of B, and we
would like to predict B when we know A. Correlation can tell
us if there is association, but that is all it can do.
However, regression analysis can take us far beyond
that.
Regression analysis is a way to fit a line through a
scatter of points. For example, in the diagram below some
upward-sloping line would fit through the points better than
a downward-sloping line. However, there are an infinite
number of upward-sloping lines that we could draw. Which one
is the best one?
Regression gives us a line, which will be an equation of
the form Y = a + bX, that has three desirable properties.
First, when we find the errors between what the line
predicts for Y and the actual Ys, and we sum these errors,
they will sum to zero. This is also a property of the mean
when we have only one variable, and it is desirable because
it tells us that the regression line does not on average
predict too high or too low.
A second property is a bit harder to appreciate. If we
look at the correlation between the X values (which are the
independent values that we are using to make the predictions
of Y) and the errors, we find that the correlation is zero.
This means that we have extracted all the information about
Y from the X values. If there were still information
available in the X values, there would be some other line
using this information that would predict better.
A third property of the line is that if we square the
error terms and sum them, we get the smallest possible sum
of squared errors. (For this reason, the procedure is often
called "least-squares regression.") This is also a property
of the mean when we deal with one variable: the mean also
minimizes the sum of squared errors.
Most textbooks include formulas for finding a and b in
the equation Y = a + bX. However, in practice no one
computes regression lines manually (though it makes a good
exercise for students to do once or twice). The computations
for realistic problems are too laborious and prone to error.
One of the first uses of computers when they entered
universities and businesses in the 1950s and 1960s was for
statistical computation. Without computers most modern
statistical analysis would not be done.
Statistical programs not only compute the regression
line, but a host of other numbers. One of the most important
is the R-square. In simple, two-variable regressions, it is
the square of the correlation. It also represents the amount
of variation in Y that the regression line can explain. So
if the R-square is .10, then the line explains only 10% of
the original variation, which in most cases is not a lot.
However, if it is .96, then the line explains 96% of the
original variation, which in most cases is considered to be
good. The R-squared is often the first number that
researchers look at, and it is almost always included with
any published results.
|