Regression

Introduction to Regression Analysis

Correlation is often but a first step in multivariate analysis. Suppose that we have some reason to believe that A is a cause of B, and we would like to predict B when we know A. Correlation can tell us if there is association, but that is all it can do. However, regression analysis can take us far beyond that.

Regression analysis is a way to fit a line through a scatter of points. For example, in the diagram below some upward-sloping line would fit through the points better than a downward-sloping line. However, there are an infinite number of upward-sloping lines that we could draw. Which one is the best one?

Regression gives us a line, which will be an equation of the form Y = a + bX, that has three desirable properties. First, when we find the errors between what the line predicts for Y and the actual Ys, and we sum these errors, they will sum to zero. This is also a property of the mean when we have only one variable, and it is desirable because it tells us that the regression line does not on average predict too high or too low.

A second property is a bit harder to appreciate. If we look at the correlation between the X values (which are the independent values that we are using to make the predictions of Y) and the errors, we find that the correlation is zero. This means that we have extracted all the information about Y from the X values. If there were still information available in the X values, there would be some other line using this information that would predict better.

A third property of the line is that if we square the error terms and sum them, we get the smallest possible sum of squared errors. (For this reason, the procedure is often called "least-squares regression.") This is also a property of the mean when we deal with one variable: the mean also minimizes the sum of squared errors.

Most textbooks include formulas for finding a and b in the equation Y = a + bX. However, in practice no one computes regression lines manually (though it makes a good exercise for students to do once or twice). The computations for realistic problems are too laborious and prone to error. One of the first uses of computers when they entered universities and businesses in the 1950s and 1960s was for statistical computation. Without computers most modern statistical analysis would not be done.

Statistical programs not only compute the regression line, but a host of other numbers. One of the most important is the R-square. In simple, two-variable regressions, it is the square of the correlation. It also represents the amount of variation in Y that the regression line can explain. So if the R-square is .10, then the line explains only 10% of the original variation, which in most cases is not a lot. However, if it is .96, then the line explains 96% of the original variation, which in most cases is considered to be good. The R-squared is often the first number that researchers look at, and it is almost always included with any published results.

Start

Problems

Computer Problems