Answers: Regression

Start . Text
Research Paper Example
Next

Answers: Multivariate Regression 1

1. I wanted to see what determined the gas mileage of two-wheel-drive sports utility vehicles. I found a publication on the Internet, and entered some of the data it had into SPSS. I ran a regression, choosing miles per gallon as my dependent variable and engine size (measured in liters) and the type of transmission as independent variables. I got a bunch of output, but now I need some help making sense of it.

a) There were two types of transmission, manual and automatic. I put in a 1 for automatic and a zero for manual. There is a name for this type of variable, but I cannot remember what it was. Can you help me?

b) I remember hearing that R Square was an important number. Mine is .825. But I am not sure if this is good or bad. Can you help me make sense of this number?

c) Is there anything in the ANOVA table that I should pay attention to?

d) I seem to have negative unstandardized coefficients. What do they mean? Should I worry that they are negative, or be happy about that?

e) I seem to remember that the t and Sig were used to test hypotheses. What hypotheses are we testing here? And what do we conclude?

f) I was hoping that the data I found would have included information about the weight of the car because I have often heard that heavy cars get much fewer miles per gallon. But there was nothing about weight in the publication I found. Should I worry about not having this variable in my regression? What do you think the effects of ignoring it are?

a) Dummy variable
b) The R-squared says that for this sample 82.5% of the variation in the gas mileage can be explained using the variables engine size and type of transmission.
c) The F-statistics tells you if your overall results look like they are random or not. In effect it tests the hypothesis that the true R-squared is zero, and that what you have in the sample is merely random chance.
d) Negative regression coefficients mean that when the independent variable in question increases, the dependent variable tends to decrease. They are common and neither good nor bad.
e) The t-values are used to test they hypotheses that the regression coefficients that you found are simply random, that is, that the true regression coefficients for the population are zero and that what you are finding may simply be the result of the peculiarities of the sample you were using. If the level of significance is high (greater than .05 or .1, for example), you should be concerned that you may not be finding a real relationship. If they are very low (less than .01), you should be dubious of the claim that there is no real relationship.
f) If you miss variables in a regression that should be there, your results may be misleading. In this case large engine size is probably correlated with weight of the car, so that some of what the regression is attributing to engine size may actually be the effect of weight.

2. To get to the state meet in cross-country in Indiana in 2001, a team had to compete to advance to the semi-state meet. Twenty teams and assorted individual (who had already survived sectionals and regionals) competed, and to advance, a runner either had to be on one of the top five teams or in the top 15 individuals. There were four semi-state competitions, and they left a trail of numbers.

a) The data have 60 observations representing the top 15 finishers from each semi-state competition.

One question of some interest is whether all four semi-state competitions are equally tough. Below is a graph that shows the placing of each of these 60 girls grouped by the semi-state they came from. What do you see? Do all semi-states look equally tough? (There were more than 60 girls in the race—the data only include those who were in the top 15 in each semi-state.)

The first and second categories seem to be much tougher than the fourth category.

b) Analysis of Variance is a way to statistically test what we are eyeballing in question 1. The null hypothesis is that all the groups are the same, or that means of the subgroups are the same as the overall mean. If we run ANOVA, we get the following table. What does the level of significance tell you? (Your null hypothesis is that all the semi-states are the same. Do you keep this hypothesis, or do you have substantial evidence that it is wrong?)

Sum of
Squares

df

Mean Square

F

Sig

Between Groups
Within Groups
Total

12046.467
36115.467
48161.933

3
56
59

4015.489
644.919

6.226

.001

The probability of getting a pattern as different from what we expect by random chance is only one in a thousand. In other words, it is highly unlikely that we would get this pattern randomly, so it appears that the different groups are in fact different.
One of the great things about levels of significance is that it lets you interpret a statistical result even if you do not understand the test. Analysis of Variance or ANOVA is discussed in the next section. Even if you do not know how this test is performed or what an F value is, you can still understand that the test is saying that the result does not look like it is something that is random.

c) We can do exactly the same analysis with regression. We do it with what are called Dummy Variables, which are variables that indicate off/on. Off=0, On =1. We will use three of them, which are labeled fc (Franklin Central), man (Manchester) and np (New Prairie). How does the ANOVA table you get here compare with the ANOVA table in part a? (They are different because in part a we are analyzing rank or how they placed, and below we are analyzing their running time in seconds.)
What does the R-squared tell us?
According to the regression coefficients, girls from which semistate, FC, NP, MAN, or TerraHaute (which is not listed separately) are slowest?

The ANOVA results have the same level of significance, .001, which says that the chances of getting a result like this by random chance is only one in a thousand.
The R squared says that about 25% of the variation is explained by the groupings, which in this case seems to be quite a lot.
The girls from TerraHaute averaged 933.5 seconds. The girls from the other three regionals had faster times because the regression coefficients are all negative. The girls from NP, for example, averaged 933.5 - 37.3 seconds or 896.2 seconds.

R SQUARE = .247

ADJUSTED R SQUARE = .207

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

15075.650
45865.200
60940.850

3
56
59

5025.217
819.021

6.136

.001

Regression
Coefficient

StdError

t

Sig

Constant
NP
FC
Man

933.467
-37.267
-32.800
-8.00

7.389
10.450
10.450
10.450

126.327
-3.566
-3.139
-.766

.000
.001
.003
.447

d) If we look at the scatter diagram of state times and semistate times, we get the picture below. If we look at the correlation, we get the correlation below. What do they tell us?

Correlation between state time and semi-state time = .628; significance is .000 (or less than 1 in 1000).

It appears that the girls who ran fast in the semi-state ran fast at state, and the girls who were slower at semi-state were slower at the state meet.

e) If we try to explain the time that a girl ran in the state meet using with regression using her time in the semi-state meet as the independent variable, we get the following. How well does the semi-state time predict the state time? What would we predict for a girl who ran a semistate time of 900 seconds (that is 15 minutes)? Do we see regression toward the mean?

We can explain almost 40% of the variation in state times using semi-state times. A girl who ran in 900 seconds in the semi-state would be expected to run in 243.037 + .742*900 = 910.837. Regression toward the mean means that girls who excelled at the semi-state meet did well at the state meet, but not quite as well as they did at the semi-state meet, and girls who were slower at the semi-state meet were also slower, but not as much slower. In other words, some of those who did well at the semi-state had the race of their lives at semi-state and did not repeat at the state level, while there were some others who were a bit off at the semi-state level and who redeemed themselves at the state meet. We can tell this because the regression coefficient is less than one. If the regression coefficient were greater than one, there would be no regression toward the mean. It would mean, rather, that those who did well at the semi-state did even better at the state level as compared to those who were a bit slower.

R SQUARE = .395

ADJUSTED R SQUARE = .384

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

24059.334
36881.516
60940.850

1
58
59

24059.334
635.888

37.836

.000

Regression
Coefficient

StdError

t

Sig

Constant
Semi-time

243.037
.742

109.121
.121

2.227
6.151

.030
.000
.

f) We might do better than this if we also include the variables that tell which semistate a runner came from. Those results are below. By how much has our R-squared increased? Which semistate has the fastest course? Which seems to have the slowest course? What time would we predict for a girl who runs 900 seconds in the NP semistate?

This is a good illustration of how including more variables can change results. The regression coefficients are very different from what they looked like in the other regressions. The R-squared is now .736, so we can explain 73.6% of the variation in times based on these variables. Manchester has the fastest course. The girls ran slower by over 49 seconds at the state meet when we take their semi-state times into account. This indicates that the semi-state course was a fast course. The slowest course was New Prairie, but not by much over the default of Terra Haute. We would predict that a girls who runs 900 seconds at NP would run -162.846 + 1.174*900 - 7.529 = 886.225. By the way, there is no regression toward the mean in these results. The girls who were the best at the semi-state stepped it up a notch at the state meet. Tougher competition for the lead improved their performances.

R SQUARE = .736

ADJUSTED R SQUARE = .716

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

44830.269
16110.581
60940.850

4
55
59

11207.567
292.920

38.262

.000

Regression
Coefficient

StdError

t

Sig

Constant
Semi-time
NP
FC
Man

-162.846
1.174
-7.529
18.928
49.363

1008.865
.116
6.911
8.087
8.452

-1.496
10.079
-1.089
2.341
5.840

.140
.000
2.81
.023
.000

3. Two alumni of the University of Wisconsin wrote a research paper trying to explain variations in salary levels of 45 members of the University of Wisconsin economics department. Here were their published results:

REGRESSION OF FACULTY SALARY LEVELS OF EXPERIENCE, PUBLISHING PERFORMANCE, TEACHING PERFORMANCE AND ADMINISTRATIVE DUTIES

Independent Variables
Regression
Coefficient
(in dollars) Standard
Error
(in dollars) t-ratio

Experience

253.28

59.71

4.24*

Monographs

-5.72

162.01

-0.04

Articles in National Journals

392.46

90.64

4.33*

Articles in Specialty Journals

344.59

90.45

3.81*

Other Publications

76.49

24.31

3.15*

Transformed Teaching Score

7.31.67

429.82

1.70

Administrative Duties

5,208.90

807.46

6.45*

Intercept

12,127.10

*Significant at the .01 level

Correct R2 = .881 standard error = $1,735 n= 45

Source: American Economic Review, May 1973 (Vol. 63 No. 2) p 313

a) What does the little note at the bottom, "Significant at the .01 level" mean?

b) How much does salary go up, on the average, as a result of an additional publication in a national journal?

c) How much salary would this equation predict for a professor who had three years of experience, two articles in specialty journals but no publications elsewhere (i.e., national journals, monographs, other publications), has no administrative duties and a teaching score of zero?

d) Find a 95% confidence interval for the regression coefficient on Other Publications.

e) What information does the R2 =.881 give you?

a) It means that we have strong reason to believe that the variable involved actually have a real impact. There are two that are not significant at this level, monographs and transformed teaching score. We have no strong evidence that they matter in determining pay.
b) By $392.46.
c) $12127.10 + 3*$253.28 + 2*$344.59 = $13576.12
d) Roughly $76.49 ± 2*$24.31 or $74.50 ± $48.60.
e) We can explain 88.1% of the variation in salaries using these variables. We cannot account for 11.9% of the variation. Some of that may be random, and some of it may reflect a variable or two that we should have included but either could not measure or do not know about.

4. Several different groups attempt to measure how conservative or liberal congressmen are. Among these groups were the Americans for Democratic Action (ADA), the AFL-CIO Committee on Political Education (COPE), the National Farm Union (NFU), and the Americans for Constitutional Action (ACA) Below is a matrix showing correlations between ratings given by these various interest groups many years ago. (Data from Journal of Law and Economics, Dec. 1979, p 369.)

Correlation of Congressional Ratings, 1973

Groups:
ADA COPE NFU ACA

ADA

COPE
0.850*

NFU
0.715* 0.819*

ACA
-0.897* -0.891* -0.800*

*Significant at the .01 level.

a) What information are you given when you are told that all of these correlations are significant at the .01 level?

b) How can you explain the negative correlation between ACA and ADA? (Hint: the ADA measured how "liberal" a congressman was.)

a) It appears that there are real relationships between the various pairs of variables, that what we are seeing is more than just random chance.
b) The ACA measures how conservative a congressman was. If a congressman gets a high liberal score, he or she also gets a low conservative score.

Here is a regression that tried to explain the ADA rating of over 400 congressmen by looking at the party of the congressman (1 = Democrat, 0 = Republican) and whether the congressman was from the North (= 0) or South (= 1)

ADA =
23.890 - 30.903DNS + 44.278Party
Corrected R²=.55

(1.67) (-13.04) (2.15)

c) What information does the R2 = .55 give you? (It is actually a corrected R2, with a little line over the R, but I cannot do that for this page)

d) The values in parentheses are t-values. Can we conclude with confidence that northern congressmen were more liberal than southern congressmen? Explain carefully.

e) The ADA rating is on a scale from 0 to 100. What does the regression coefficient in front of Party tell us?

f) Compute a 90% confidence interval for the regression coefficient of DNS.

c) We can explain 55% of the variation in the ADA score by party and area of the country from which the congressman comes.
d) The t-value is huge. Southern congressmen were definitely less liberal as ranked by the ADA than northern congressmen after we take into account their party.
e) That a Democrat on average and after taking region into account was rated 44.278 points more liberal than a Republican.
f. First we need to compute the standard error of the regression coefficient. t = (b-0)/se;
-13.04 = -30.903/se; se = 30.903/13.04 = 2.37.
The confidence interval will be roughly 30.9 ± 1.64*2.37 or 30.9 ± 3.9

5. Below are some data for used Cadillacs from several years ago. In addition to the price of the car, they include the age of the car, the year of the car, and the number of miles the car has. Using regression to see how price depended on age and miles (Miles are measured in thousands--i.e. 1 = 1k), gives the results below:

Model Summary: Predictors: (Constant), AGE, MILES; Dependent Variable: PRICE

R

R Square

Adjusted R Square

Std. Error of the Estimate

.928

.861

.852

3010.4572

ANOVA

Sum of Squares

df

Mean Square

F

Sig.

Regression

1742354183.829

2

871177091.915

96.126

.000

Residual

280948430.641

31

9062852.601

Total

2023302614.471

33

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

(Constant)

28213.880

1185.683

23.795

.000

MILES

-117.329

24.242

-.402

-4.840

.000

AGE

-1125.172

148.238

-.631

-7.590

.000

Dependent Variable: PRICE

a) How much of the variation in price in this sample can I explain?

b) Is my explanation of variation in price just random, or do I seem to have something more than randomness? Explain.

c) Based on my results, how much should a Caddy be worth that is zero years old and with zero miles?

d) If we have a Caddy that is one year old and has 5000 miles on it, what would we predict for its price?

e) If we have a Cadillac that is six years old and has 30,000 miles, what would we predict for its price?

f) According to these results, when age increases by a year, by how much should price change?

g) According to these results, when mileage increases by 1k, by how much should price change?

h) The t-value for the Constant is positive (23.795), while those for miles and age are negative. Why?

i) We test the hypothesis that age has nothing to do with the price of these cars. What should we conclude and why?

a) The regression explains 86.1% of the original variation in car prices; about 14% is noise or due to variables that are not included in this regression.
b) It looks like it is much better than random. The significance of the F-statistics, which measures this, says that we would get this much explanation of variation (86.1%) much less than one time in a thousand just by random chance. Something more than random chance seems to be involved here.
c) $28213.88
d) $28213.88 - $117.33*5 - $1125.17 = $26502.06
e) $28213.88 - $117.33*30 - $1125.17*6 = $17942.96
f) For each year older, the car drops in value by $1125.172.
g) When an extra thousand miles is added, the value drops by $117.329.
h) Because the regression coefficients have those signs. The t-value is testing the hypothesis that the true regression coefficient is zero, so it is computed as (b - 0)/(standard error).
i) We should reject the hypothesis that age has nothing to do with price. We have "proven" that age matters.

Back to Problems

Go to Part 2

Start . Text

Research Paper Example