|
|
Sum of |
df |
Mean Square |
F |
Sig |
|
Between Groups |
12046.467 |
3 |
4015.489 |
6.226 |
.001 |
The probability of getting a
pattern as different from what we expect by random chance is
only one in a thousand. In other words, it is highly
unlikely that we would get this pattern randomly, so it
appears that the different groups are in fact different.
One of the great things about levels of significance is that
it lets you interpret a statistical result even if you do
not understand the test. Analysis of Variance or ANOVA is
discussed in the next section. Even if you do not know how
this test is performed or what an F value is, you can still
understand that the test is saying that the result does not
look like it is something that is random.
c) We can do exactly the same analysis with regression.
We do it with what are called Dummy Variables, which are
variables that indicate off/on. Off=0, On =1. We will use
three of them, which are labeled fc (Franklin Central), man
(Manchester) and np (New Prairie). How does the ANOVA table
you get here compare with the ANOVA table in part a?
(They are different because in part a we are
analyzing rank or how they placed, and below we are
analyzing their running time in seconds.)
What does the R-squared tell us?
According to the regression coefficients, girls from which
semistate, FC, NP, MAN, or TerraHaute (which is not listed
separately) are slowest?
The ANOVA results have the
same level of significance, .001, which says that the
chances of getting a result like this by random chance is
only one in a thousand.
The R squared says that about 25% of the variation is
explained by the groupings, which in this case seems to be
quite a lot.
The girls from TerraHaute averaged 933.5 seconds. The girls
from the other three regionals had faster times because the
regression coefficients are all negative. The girls from NP,
for example, averaged 933.5 - 37.3 seconds or 896.2
seconds.
R SQUARE = .247 |
ADJUSTED R SQUARE = .207 |
Sum of |
df |
Mean Square |
F |
Sig |
|
Regression |
15075.650 |
3 |
5025.217 |
6.136 |
.001 |
Regression |
StdError |
t |
Sig |
|
Constant |
933.467 |
7.389 |
126.327 |
.000 |
d) If we look at the scatter diagram of state times and semistate times, we get the picture below. If we look at the correlation, we get the correlation below. What do they tell us?
Correlation between state time and semi-state time = .628; significance is .000 (or less than 1 in 1000).
It appears that the girls who ran fast in the semi-state ran fast at state, and the girls who were slower at semi-state were slower at the state meet.
e) If we try to explain the time that a girl ran in the state meet using with regression using her time in the semi-state meet as the independent variable, we get the following. How well does the semi-state time predict the state time? What would we predict for a girl who ran a semistate time of 900 seconds (that is 15 minutes)? Do we see regression toward the mean?
We can explain almost 40% of the variation in state times using semi-state times. A girl who ran in 900 seconds in the semi-state would be expected to run in 243.037 + .742*900 = 910.837. Regression toward the mean means that girls who excelled at the semi-state meet did well at the state meet, but not quite as well as they did at the semi-state meet, and girls who were slower at the semi-state meet were also slower, but not as much slower. In other words, some of those who did well at the semi-state had the race of their lives at semi-state and did not repeat at the state level, while there were some others who were a bit off at the semi-state level and who redeemed themselves at the state meet. We can tell this because the regression coefficient is less than one. If the regression coefficient were greater than one, there would be no regression toward the mean. It would mean, rather, that those who did well at the semi-state did even better at the state level as compared to those who were a bit slower.
R SQUARE = .395 |
ADJUSTED R SQUARE = .384 |
Sum of |
df |
Mean Square |
F |
Sig |
|
Regression |
24059.334 |
1 |
24059.334 |
37.836 |
.000 |
Regression |
StdError |
t |
Sig |
|
Constant |
243.037 |
109.121 |
2.227 |
.030 |
f) We might do better than this if we also include the variables that tell which semistate a runner came from. Those results are below. By how much has our R-squared increased? Which semistate has the fastest course? Which seems to have the slowest course? What time would we predict for a girl who runs 900 seconds in the NP semistate?
This is a good illustration of how including more variables can change results. The regression coefficients are very different from what they looked like in the other regressions. The R-squared is now .736, so we can explain 73.6% of the variation in times based on these variables. Manchester has the fastest course. The girls ran slower by over 49 seconds at the state meet when we take their semi-state times into account. This indicates that the semi-state course was a fast course. The slowest course was New Prairie, but not by much over the default of Terra Haute. We would predict that a girls who runs 900 seconds at NP would run -162.846 + 1.174*900 - 7.529 = 886.225. By the way, there is no regression toward the mean in these results. The girls who were the best at the semi-state stepped it up a notch at the state meet. Tougher competition for the lead improved their performances.
R SQUARE = .736 |
ADJUSTED R SQUARE = .716 |
Sum of |
df |
Mean Square |
F |
Sig |
|
Regression |
44830.269 |
4 |
11207.567 |
38.262 |
.000 |
Regression |
StdError |
t |
Sig |
|
Constant |
-162.846 |
1008.865 |
-1.496 |
.140 |
3. Two alumni of the University of Wisconsin wrote a research paper trying to explain variations in salary levels of 45 members of the University of Wisconsin economics department. Here were their published results:
REGRESSION OF FACULTY SALARY LEVELS OF EXPERIENCE, PUBLISHING PERFORMANCE, TEACHING PERFORMANCE AND ADMINISTRATIVE DUTIES |
|||
Independent Variables |
Coefficient (in dollars) |
Error (in dollars) |
|
Experience |
253.28 |
59.71 |
4.24* |
Monographs |
-5.72 |
162.01 |
-0.04 |
Articles in National Journals |
392.46 |
90.64 |
4.33* |
Articles in Specialty Journals |
344.59 |
90.45 |
3.81* |
Other Publications |
76.49 |
24.31 |
3.15* |
Transformed Teaching Score |
7.31.67 |
429.82 |
1.70 |
Administrative Duties |
5,208.90 |
807.46 |
6.45* |
Intercept |
12,127.10 |
||
*Significant at the .01 level |
|||
|
|
|
|
Source: American Economic Review, May 1973 (Vol. 63 No. 2) p 313 |
a) It means that we have
strong reason to believe that the variable involved actually
have a real impact. There are two that are not significant
at this level, monographs and transformed teaching score. We
have no strong evidence that they matter in determining
pay.
b) By $392.46.
c) $12127.10 + 3*$253.28 + 2*$344.59 = $13576.12
d) Roughly $76.49 ± 2*$24.31 or $74.50 ±
$48.60.
e) We can explain 88.1% of the variation in salaries using
these variables. We cannot account for 11.9% of the
variation. Some of that may be random, and some of it may
reflect a variable or two that we should have included but
either could not measure or do not know
about.
4. Several different groups attempt to measure how conservative or liberal congressmen are. Among these groups were the Americans for Democratic Action (ADA), the AFL-CIO Committee on Political Education (COPE), the National Farm Union (NFU), and the Americans for Constitutional Action (ACA) Below is a matrix showing correlations between ratings given by these various interest groups many years ago. (Data from Journal of Law and Economics, Dec. 1979, p 369.)
|
||||
Groups: |
|
|
|
|
ADA |
||||
COPE |
|
|||
NFU |
|
|
||
ACA |
|
|
|
|
*Significant at the .01 level. |
a) It appears that there are
real relationships between the various pairs of variables,
that what we are seeing is more than just random chance.
b) The ACA measures how conservative a congressman was. If a
congressman gets a high liberal score, he or she also gets a
low conservative score.
Here is a regression that tried to explain the ADA rating of over 400 congressmen by looking at the party of the congressman (1 = Democrat, 0 = Republican) and whether the congressman was from the North (= 0) or South (= 1)
ADA = |
|
|
|
Corrected R2=.55 |
|
|
|
c) We can explain 55% of the
variation in the ADA score by party and area of the country
from which the congressman comes.
d) The t-value is huge. Southern congressmen were definitely
less liberal as ranked by the ADA than northern congressmen
after we take into account their party.
e) That a Democrat on average and after taking region into
account was rated 44.278 points more liberal than a
Republican.
f. First we need to compute the standard error of the
regression coefficient. t = (b-0)/se;
-13.04 = -30.903/se; se = 30.903/13.04 = 2.37.
The confidence interval will be roughly 30.9 ±
1.64*2.37 or 30.9 ± 3.9
5. Below are some data for used Cadillacs from several years ago. In addition to the price of the car, they include the age of the car, the year of the car, and the number of miles the car has. Using regression to see how price depended on age and miles (Miles are measured in thousands--i.e. 1 = 1k), gives the results below:
Model Summary: Predictors: (Constant), AGE, MILES; Dependent Variable: PRICE
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
.928 |
.861 |
.852 |
3010.4572 |
ANOVA |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
Regression |
1742354183.829 |
2 |
871177091.915 |
96.126 |
.000 |
Residual |
280948430.641 |
31 |
9062852.601 |
||
Total |
2023302614.471 |
33 |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
||
B |
Std. Error |
Beta |
|||
(Constant) |
28213.880 |
1185.683 |
23.795 |
.000 |
|
MILES |
-117.329 |
24.242 |
-.402 |
-4.840 |
.000 |
AGE |
-1125.172 |
148.238 |
-.631 |
-7.590 |
.000 |
a) The regression explains
86.1% of the original variation in car prices; about 14% is
noise or due to variables that are not included in this
regression.
b) It looks like it is much better than random. The
significance of the F-statistics, which measures this, says
that we would get this much explanation of variation (86.1%)
much less than one time in a thousand just by random chance.
Something more than random chance seems to be involved
here.
c) $28213.88
d) $28213.88 - $117.33*5 - $1125.17 = $26502.06
e) $28213.88 - $117.33*30 - $1125.17*6 = $17942.96
f) For each year older, the car drops in value by
$1125.172.
g) When an extra thousand miles is added, the value drops by
$117.329.
h) Because the regression coefficients have those signs. The
t-value is testing the hypothesis that the true regression
coefficient is zero, so it is computed as (b - 0)/(standard
error).
i) We should reject the hypothesis that age has nothing to
do with price. We have "proven" that age
matters.
|