Regression

Multivariate Regression

Regression analysis is a way of explaining the variations in one variable in terms of variations in another variable or variables. Because it is not limited to two variables as correlation is, it is a much more powerful statistical tool. Regression with multiple independent variables allows us to see associations that would be obscured in the raw data and might not show up in bivariate analysis. Regression analysis is the main statistical tool used in a great deal of social science and physical science research.

When researchers use regression for hypothesis testing, the procedure is another use of the t-test that we met in univariate statistics. With univariate statistics we computed a sample mean and a standard error, and then asked how many standard errors the sample mean was from the claimed mean. This gave us a t-statistic that allowed us to compute the probability of getting this type of result by random chance. In regression we use the sample to compute the coefficients for the regression line and also standard errors for those regression coefficients. We know that with a different sample those regression coefficients would be a bit different, and the standard error tells us how much variation from sample to sample we could expect. We then ask how many standard errors the regression coefficients are from zero, which is the default claim used in regression analysis. This gives us t-statistics that are interpreted in exactly the same way as t-statistics are interpreted when testing about means.

The graphical interpretation of the regression equation when there is only one independent variable is that the equation shows a line. We can draw this line on a graph. When there are two independent variables, the regression equation describes a plane, which is still possible to visualize. With more than two independent variables it is best not to try to visualize the equation.

It is important to realize that regression finds linear relationships but often real-world relationships are not linear. Statisticians try to attack these situations by using various mathematical transformations to make the procedure work with the data. This is a topic beyond introductory statistics.

In univariate statistics we can count on the central limit theorem to justify our use of the t-distribution as long as we have taken a random sample. In regression there are a variety of situations in which the t-distribution will give inaccurate results, and statisticians have procedures for trying to identify and remedy these problems. Something as simple as omitting an important explanatory variable will distort the t-values, which is a reason that researchers want high R-squares. With a high R-square there is little variation that is left unexplained, so its seems doubtful that some vital explanatory variable was omitted. Other problems have exotic names such as heteroscedasticity, multicollinearity, and serial correlation, all things that more advanced statistic courses study.

Start

Problems

Computer Problems