Correlation

If statistics were limited to estimation and testing of single variables, it would be a niche subject taught to graduate students. Fortunately, statistics can also look at the relationship between or among variables. The most basic statistical tool for doing this is the correlation coefficient, which measures the strength of linear relationship.

To explain linear relationship, let us suppose that a college wants to know if there is any relationship between an entrance test it gives and how well students do their freshman year. After it collected the data, it probably would start with a scatter diagram. It could find a picture that would suggest that there is no relationship between how the students did on the test and their results from a year of college.

They probably would expect to find a chart like the one below, which indicates that those who did well on the test did well on grades and those who did poorly on the test had poor grades.

However, it is also possible that they could find a result that indicated that those who did well on the test tended to do poorly in classes, and those who did poorly on the test were successful in classes. If that were what happened, they might get a chart like this one:

What the correlation coefficient does is measure the strength and direction of the relationship. The strongest possible connection would be for the points to line up on a straight line. The weakest would be for the points to form a circular cloud. A straight line that has a negative slope would give a correlation of -1. As the points get more and more variable around that line, the correlation coefficient moves toward zero. At a zero correlation we have a blob of points that show no relationships. As we move away from the blob toward a scatter that shows a positive relationship, the correlation becomes positive. The stronger the relationship, the closer to +1 the correlation, until at a correlation of +1 all the points form a straight, upward-sloping line.

The correlation coefficient is found by taking z-scores of all the x-values and of all the y-values. These new values are then multiplied together, summed, and divided by the number of observations we have.

Correlation says nothing about causation. Things A and B can be correlated if A in some way influences B, or B in some way influences A, or if both lines of influence are present, or if some other variable, C, influences A and B.

There are tests of significance for the correlation coefficient. If we compute the correlation coefficient for two random sets of numbers, we will rarely get a correlation that is zero. Almost always it will be a small number close to zero. What the level of significance tells us is what the probability is of getting by random chance a correlation coefficient as far away from zero as the one we got. If we get a level of significance of .4, then there is a 40% chance of getting a result like the one we have even if there is no relationship. Hence, we would accept the default claim that the variables are not linearly related. On the other hand, if we get a level of significance of .001, there is only one chance in a thousand that we would get a correlation like the one we obtained by random chance, so we usually would reject the claim that there is no relationship and argue that our variables are in some way connected.

Start

Problems

Computer Exercises