

Sum of 
df 
Mean Square 
F 
Sig 

Between Groups 
12046.467 
3 
4015.489 
6.226 
.001 
The probability of getting a
pattern as different from what we expect by random chance is
only one in a thousand. In other words, it is highly
unlikely that we would get this pattern randomly, so it
appears that the different groups are in fact different.
One of the great things about levels of significance is that
it lets you interpret a statistical result even if you do
not understand the test. Analysis of Variance or ANOVA is
discussed in the next section. Even if you do not know how
this test is performed or what an F value is, you can still
understand that the test is saying that the result does not
look like it is something that is random.
c) We can do exactly the same analysis with regression.
We do it with what are called Dummy Variables, which are
variables that indicate off/on. Off=0, On =1. We will use
three of them, which are labeled fc (Franklin Central), man
(Manchester) and np (New Prairie). How does the ANOVA table
you get here compare with the ANOVA table in part a?
(They are different because in part a we are
analyzing rank or how they placed, and below we are
analyzing their running time in seconds.)
What does the Rsquared tell us?
According to the regression coefficients, girls from which
semistate, FC, NP, MAN, or TerraHaute (which is not listed
separately) are slowest?
The ANOVA results have the
same level of significance, .001, which says that the
chances of getting a result like this by random chance is
only one in a thousand.
The R squared says that about 25% of the variation is
explained by the groupings, which in this case seems to be
quite a lot.
The girls from TerraHaute averaged 933.5 seconds. The girls
from the other three regionals had faster times because the
regression coefficients are all negative. The girls from NP,
for example, averaged 933.5  37.3 seconds or 896.2
seconds.
R SQUARE = .247 
ADJUSTED R SQUARE = .207 
Sum of 
df 
Mean Square 
F 
Sig 

Regression 
15075.650 
3 
5025.217 
6.136 
.001 
Regression 
StdError 
t 
Sig 

Constant 
933.467 
7.389 
126.327 
.000 
d) If we look at the scatter diagram of state times and semistate times, we get the picture below. If we look at the correlation, we get the correlation below. What do they tell us?
Correlation between state time and semistate time = .628; significance is .000 (or less than 1 in 1000).
It appears that the girls who ran fast in the semistate ran fast at state, and the girls who were slower at semistate were slower at the state meet.
e) If we try to explain the time that a girl ran in the state meet using with regression using her time in the semistate meet as the independent variable, we get the following. How well does the semistate time predict the state time? What would we predict for a girl who ran a semistate time of 900 seconds (that is 15 minutes)? Do we see regression toward the mean?
We can explain almost 40% of the variation in state times using semistate times. A girl who ran in 900 seconds in the semistate would be expected to run in 243.037 + .742*900 = 910.837. Regression toward the mean means that girls who excelled at the semistate meet did well at the state meet, but not quite as well as they did at the semistate meet, and girls who were slower at the semistate meet were also slower, but not as much slower. In other words, some of those who did well at the semistate had the race of their lives at semistate and did not repeat at the state level, while there were some others who were a bit off at the semistate level and who redeemed themselves at the state meet. We can tell this because the regression coefficient is less than one. If the regression coefficient were greater than one, there would be no regression toward the mean. It would mean, rather, that those who did well at the semistate did even better at the state level as compared to those who were a bit slower.
R SQUARE = .395 
ADJUSTED R SQUARE = .384 
Sum of 
df 
Mean Square 
F 
Sig 

Regression 
24059.334 
1 
24059.334 
37.836 
.000 
Regression 
StdError 
t 
Sig 

Constant 
243.037 
109.121 
2.227 
.030 
f) We might do better than this if we also include the variables that tell which semistate a runner came from. Those results are below. By how much has our Rsquared increased? Which semistate has the fastest course? Which seems to have the slowest course? What time would we predict for a girl who runs 900 seconds in the NP semistate?
This is a good illustration of how including more variables can change results. The regression coefficients are very different from what they looked like in the other regressions. The Rsquared is now .736, so we can explain 73.6% of the variation in times based on these variables. Manchester has the fastest course. The girls ran slower by over 49 seconds at the state meet when we take their semistate times into account. This indicates that the semistate course was a fast course. The slowest course was New Prairie, but not by much over the default of Terra Haute. We would predict that a girls who runs 900 seconds at NP would run 162.846 + 1.174*900  7.529 = 886.225. By the way, there is no regression toward the mean in these results. The girls who were the best at the semistate stepped it up a notch at the state meet. Tougher competition for the lead improved their performances.
R SQUARE = .736 
ADJUSTED R SQUARE = .716 
Sum of 
df 
Mean Square 
F 
Sig 

Regression 
44830.269 
4 
11207.567 
38.262 
.000 
Regression 
StdError 
t 
Sig 

Constant 
162.846 
1008.865 
1.496 
.140 
3. Two alumni of the University of Wisconsin wrote a research paper trying to explain variations in salary levels of 45 members of the University of Wisconsin economics department. Here were their published results:
REGRESSION OF FACULTY SALARY LEVELS OF EXPERIENCE, PUBLISHING PERFORMANCE, TEACHING PERFORMANCE AND ADMINISTRATIVE DUTIES 

Independent Variables 
Coefficient (in dollars) 
Error (in dollars) 

Experience 
253.28 
59.71 
4.24* 
Monographs 
5.72 
162.01 
0.04 
Articles in National Journals 
392.46 
90.64 
4.33* 
Articles in Specialty Journals 
344.59 
90.45 
3.81* 
Other Publications 
76.49 
24.31 
3.15* 
Transformed Teaching Score 
7.31.67 
429.82 
1.70 
Administrative Duties 
5,208.90 
807.46 
6.45* 
Intercept 
12,127.10 

*Significant at the .01 level 





Source: American Economic Review, May 1973 (Vol. 63 No. 2) p 313 
a) It means that we have
strong reason to believe that the variable involved actually
have a real impact. There are two that are not significant
at this level, monographs and transformed teaching score. We
have no strong evidence that they matter in determining
pay.
b) By $392.46.
c) $12127.10 + 3*$253.28 + 2*$344.59 = $13576.12
d) Roughly $76.49 ± 2*$24.31 or $74.50 ±
$48.60.
e) We can explain 88.1% of the variation in salaries using
these variables. We cannot account for 11.9% of the
variation. Some of that may be random, and some of it may
reflect a variable or two that we should have included but
either could not measure or do not know
about.
4. Several different groups attempt to measure how conservative or liberal congressmen are. Among these groups were the Americans for Democratic Action (ADA), the AFLCIO Committee on Political Education (COPE), the National Farm Union (NFU), and the Americans for Constitutional Action (ACA) Below is a matrix showing correlations between ratings given by these various interest groups many years ago. (Data from Journal of Law and Economics, Dec. 1979, p 369.)


Groups: 




ADA 

COPE 


NFU 



ACA 




*Significant at the .01 level. 
a) It appears that there are
real relationships between the various pairs of variables,
that what we are seeing is more than just random chance.
b) The ACA measures how conservative a congressman was. If a
congressman gets a high liberal score, he or she also gets a
low conservative score.
Here is a regression that tried to explain the ADA rating of over 400 congressmen by looking at the party of the congressman (1 = Democrat, 0 = Republican) and whether the congressman was from the North (= 0) or South (= 1)
ADA = 



Corrected R^{2}=.55 



c) We can explain 55% of the
variation in the ADA score by party and area of the country
from which the congressman comes.
d) The tvalue is huge. Southern congressmen were definitely
less liberal as ranked by the ADA than northern congressmen
after we take into account their party.
e) That a Democrat on average and after taking region into
account was rated 44.278 points more liberal than a
Republican.
f. First we need to compute the standard error of the
regression coefficient. t = (b0)/se;
13.04 = 30.903/se; se = 30.903/13.04 = 2.37.
The confidence interval will be roughly 30.9 ±
1.64*2.37 or 30.9 ± 3.9
5. Below are some data for used Cadillacs from several years ago. In addition to the price of the car, they include the age of the car, the year of the car, and the number of miles the car has. Using regression to see how price depended on age and miles (Miles are measured in thousandsi.e. 1 = 1k), gives the results below:
Model Summary: Predictors: (Constant), AGE, MILES; Dependent Variable: PRICE
R 
R Square 
Adjusted R Square 
Std. Error of the Estimate 
.928 
.861 
.852 
3010.4572 
ANOVA 
Sum of Squares 
df 
Mean Square 
F 
Sig. 
Regression 
1742354183.829 
2 
871177091.915 
96.126 
.000 
Residual 
280948430.641 
31 
9062852.601 

Total 
2023302614.471 
33 
Unstandardized Coefficients 
Standardized Coefficients 
t 
Sig. 

B 
Std. Error 
Beta 

(Constant) 
28213.880 
1185.683 
23.795 
.000 

MILES 
117.329 
24.242 
.402 
4.840 
.000 
AGE 
1125.172 
148.238 
.631 
7.590 
.000 
a) The regression explains
86.1% of the original variation in car prices; about 14% is
noise or due to variables that are not included in this
regression.
b) It looks like it is much better than random. The
significance of the Fstatistics, which measures this, says
that we would get this much explanation of variation (86.1%)
much less than one time in a thousand just by random chance.
Something more than random chance seems to be involved
here.
c) $28213.88
d) $28213.88  $117.33*5  $1125.17 = $26502.06
e) $28213.88  $117.33*30  $1125.17*6 = $17942.96
f) For each year older, the car drops in value by
$1125.172.
g) When an extra thousand miles is added, the value drops by
$117.329.
h) Because the regression coefficients have those signs. The
tvalue is testing the hypothesis that the true regression
coefficient is zero, so it is computed as (b  0)/(standard
error).
i) We should reject the hypothesis that age has nothing to
do with price. We have "proven" that age
matters.
