Correlation
If statistics were limited to estimation and testing of
single variables, it would be a niche subject taught to
graduate students. Fortunately, statistics can also look at
the relationship between or among variables. The most basic
statistical tool for doing this is the correlation
coefficient, which measures the strength of linear
relationship.
To explain linear relationship, let us suppose that a
college wants to know if there is any relationship between
an entrance test it gives and how well students do their
freshman year. After it collected the data, it probably
would start with a scatter diagram. It could find a picture
that would suggest that there is no relationship between how
the students did on the test and their results from a year
of college.
They probably would expect to find a chart like the one
below, which indicates that those who did well on the test
did well on grades and those who did poorly on the test had
poor grades.
However, it is also possible that they could find a
result that indicated that those who did well on the test
tended to do poorly in classes, and those who did poorly on
the test were successful in classes. If that were what
happened, they might get a chart like this one:
What the correlation coefficient does is measure the
strength and direction of the relationship. The strongest
possible connection would be for the points to line up on a
straight line. The weakest would be for the points to form a
circular cloud. A straight line that has a negative slope
would give a correlation of 1. As the points get more and
more variable around that line, the correlation coefficient
moves toward zero. At a zero correlation we have a blob of
points that show no relationships. As we move away from the
blob toward a scatter that shows a positive relationship,
the correlation becomes positive. The stronger the
relationship, the closer to +1 the correlation, until at a
correlation of +1 all the points form a straight,
upwardsloping line.
The correlation coefficient is found by taking zscores
of all the xvalues and of all the yvalues. These new
values are then multiplied together, summed, and divided by
the number of observations we have.
Correlation says nothing about causation. Things A and B
can be correlated if A in some way influences B, or B in
some way influences A, or if both lines of influence are
present, or if some other variable, C, influences A and
B.
There are tests of significance for the correlation
coefficient. If we compute the correlation coefficient for
two random sets of numbers, we will rarely get a correlation
that is zero. Almost always it will be a small number close
to zero. What the level of significance tells us is what the
probability is of getting by random chance a correlation
coefficient as far away from zero as the one we got. If we
get a level of significance of .4, then there is a 40%
chance of getting a result like the one we have even if
there is no relationship. Hence, we would accept the default
claim that the variables are not linearly related. On the
other hand, if we get a level of significance of .001, there
is only one chance in a thousand that we would get a
correlation like the one we obtained by random chance, so we
usually would reject the claim that there is no relationship
and argue that our variables are in some way connected.
