Multivariate Regression
Regression analysis is a way of explaining the variations
in one variable in terms of variations in another variable
or variables. Because it is not limited to two variables as
correlation is, it is a much more powerful statistical tool.
Regression with multiple independent variables allows us to
see associations that would be obscured in the raw data and
might not show up in bivariate analysis. Regression analysis
is the main statistical tool used in a great deal of social
science and physical science research.
When researchers use regression for hypothesis testing,
the procedure is another use of the t-test that we met in
univariate statistics. With
univariate statistics we computed a sample mean and a
standard error, and then asked how many standard errors the
sample mean was from the claimed mean. This gave us a
t-statistic that allowed us to compute the probability of
getting this type of result by random chance. In regression
we use the sample to compute the coefficients for the
regression line and also standard errors for those
regression coefficients. We know that with a different
sample those regression coefficients would be a bit
different, and the standard error tells us how much
variation from sample to sample we could expect. We then ask
how many standard errors the regression coefficients are
from zero, which is the default claim used in regression
analysis. This gives us t-statistics that are interpreted in
exactly the same way as t-statistics are interpreted when
testing about means.
The graphical interpretation of the regression equation
when there is only one independent variable is that the
equation shows a line. We can draw this line on a graph.
When there are two independent variables, the regression
equation describes a plane, which is still possible to
visualize. With more than two independent variables it is
best not to try to visualize the equation.
It is important to realize that regression finds linear
relationships but often real-world relationships are not
linear. Statisticians try to attack these situations by
using various mathematical transformations to make the
procedure work with the data. This is a topic beyond
introductory statistics.
In univariate statistics we can count on the central
limit theorem to justify our use of the t-distribution as
long as we have taken a random sample. In regression there
are a variety of situations in which the t-distribution will
give inaccurate results, and statisticians have procedures
for trying to identify and remedy these problems. Something
as simple as omitting an important explanatory variable will
distort the t-values, which is a reason that researchers
want high R-squares. With a high R-square there is little
variation that is left unexplained, so its seems doubtful
that some vital explanatory variable was omitted. Other
problems have exotic names such as heteroscedasticity,
multicollinearity, and serial correlation, all things that
more advanced statistic courses study.
|