Measuring Dispersion

Consider these three sets of numbers:

1, 2, 3, 4, 5
2, 2, 3, 4, 4
0, 3, 3, 3, 6

All have the same average (mean) and median, 3, but they are different in how spread out they are. Clearly the second set of numbers is more compact than the first set, but is it also more compact than the third set?

There are a variety of ways to measure the variability or dispersion of a set of numbers. The easiest to compute is the range, which is the different between the largest number in the set and the smallest. The ranges for these three sets of numbers are 4, 2, and 6.

A problem with the range and its close relatives (such as the interdecile or interquartile range) is that only two numbers determine the measure. A way to give each number a vote would be to subtract the mean from each number and find an average of the differences. However, no matter what set of numbers we have, this procedure will always give zero as the answer; this is a property of the mean.

We need to get rid of the negative deviations, and the easiest way to do this is to use absolute values. This computation for the first set of numbers is shown below:

x |x-mean|

1 2

2 1

3 0

4 1

5 2

Adding the absolute deviations gives six, and dividing by five gives 1.2. The results for the other two sets of numbers are .8 and 1.2. According to this measure of dispersion, which is called the mean deviation, the first and third set of numbers are equally dispersed.

The mean deviation may seem a sensible and obvious way to measure dispersion, but another measure that is less obvious and intuitive has better mathematical properties, so statisticians rarely use the mean deviation. The better measure gets rid of the negative signs by squaring the deviations from the mean. For the first set of numbers the computation is:

x (x-mean)²

1 4

2 1

3 0

4 1

5 4

Adding the squared deviation gives 10 and dividing by n, the number of observations, gives 10/5=2. This measure is called the variance. Taking the square root, which puts us back into the terms of the original data, gives us the standard deviation, the most commonly-used measure of dispersion in statistics.

However, we have one more complication to toss into this story. If the numbers are a sample, that is, they are generated by some process, the division is not by n but by n-1. Hence, the variance is not 2, but 2.5, and the standard deviation of these numbers is the square root of 2.5, or approximately 1.58. The variances for the other two sets of numbers are 1 and 4.5, so their standard deviations are 1 and approximately 2.12. Notice that the standard deviation gives more weight to large deviations than the mean deviations does.

Why divide by n-1 rather than n? When we are dealing with a sample, we want an estimate of what the real standard deviation is, the standard deviation of the population or the process. Whether we divide by n or n-1, we may be too big or too small, but on average when we divide by n we tend to get an estimate that is a bit too small, while dividing by n-1 gives us an estimate that will be, on the average, correct. In other words, dividing by n gives us a biased estimator, while dividing by n-1 gives us an unbiased estimator.

Here is another way of thinking about it. When we have a sample, we use up some of the information in the sample to estimate the mean of the population. So we do not have n bits of information left--we have less than that.

Start

Problems