Problems: Regression

Start . Text
Research Paper Example
Next

Problems: Multivariate Regression 1

1. I wanted to see what determined the gas mileage of two-wheel-drive sports utility vehicles. I found a publication on the Internet, and entered some of the data it had into SPSS. I ran a regression, choosing miles per gallon as my dependent variable and engine size (measured in liters) and the type of transmission as independent variables. I got a bunch of output, but now I need some help making sense of it.

a) There were two types of transmission, manual and automatic. I put in a 1 for automatic and a zero for manual. There is a name for this type of variable, but I cannot remember what it was. Can you help me?

b) I remember hearing that R Square was an important number. Mine is .825. But I am not sure if this is good or bad. Can you help me make sense of this number?

c) Is there anything in the ANOVA table that I should pay attention to?

d) I seem to have negative unstandardized coefficients. What do they mean? Should I worry that they are negative, or be happy about that?

e) I seem to remember that the t and Sig were used to test hypotheses. What hypotheses are we testing here? And what do we conclude?

f) I was hoping that the data I found would have included information about the weight of the car because I have often heard that heavy cars get much fewer miles per gallon. But there was nothing about weight in the publication I found. Should I worry about not having this variable in my regression? What do you think the effects of ignoring it are?

2. To get to the state meet in cross-country in Indiana in 2001, a team had to compete to advance to the semi-state meet. Twenty teams and assorted individual (who had already survived sectionals and regionals) competed, and to advance, a runner either had to be on one of the top five teams or in the top 15 individuals. There were four semi-state competitions, and they left a trail of numbers.

a) The data have 60 observations representing the top 15 finishers from each semi-state competition.

One question of some interest is whether all four semi-state competitions are equally tough. Below is a graph that shows the placing of each of these 60 girls grouped by the semi-state they came from. What do you see? Do all semi-states look equally tough? (There were more than 60 girls in the race—the data only include those who were in the top 15 in each semi-state.)

b) Analysis of Variance is a way to statistically test what we are eyeballing in question 1. The null hypothesis is that all the groups are the same, or that means of the subgroups are the same as the overall mean. If we run ANOVA, we get the following table. What does the level of significance tell you? (Your null hypothesis is that all the semi-states are the same. Do you keep this hypothesis, or do you have substantial evidence that it is wrong?)

Sum of
Squares

df

Mean Square

F

Sig

Between Groups
Within Groups
Total

12046.467
36115.467
48161.933

3
56
59

4015.489
644.919

6.226

.001

c) We can do exactly the same analysis with regression. We do it with what are called Dummy Variables, which are variables that indicate off/on. Off=0, On =1. We will use three of them, which are labeled fc (Franklin Central), man (Manchester) and np (New Prairie). How does the ANOVA table you get here compare with the ANOVA table in part a? (They are different because in part a we are analyzing rank or how they placed, and below we are analyzing their running time in seconds.)
What does the R-squared tell us?
According to the regression coefficients, girls from which semistate, FC, NP, MAN, or TerraHaute (which is not listed separately) are slowest?

R SQUARE = .247

ADJUSTED R SQUARE = .207

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

15075.650
45865.200
60940.850

3
56
59

5025.217
819.021

6.136

.001

Regression
Coefficient

StdError

t

Sig

Constant
NP
FC
Man

933.467
-37.267
-32.800
-8.00

7.389
10.450
10.450
10.450

126.327
-3.566
-3.139
-.766

.000
.001
.003
.447

d) If we look at the scatter diagram of state times and semistate times, we get the picture below. If we look at the correlation, we get the correlation below. What do they tell us?

Correlation between state time and semi-state time = .628; significance is .000 (or less than 1 in 1000).

e) If we try to explain the time that a girl ran in the state meet using with regression using her time in the semi-state meet as the independent variable, we get the following. How well does the semi-state time predict the state time? What would we predict for a girl who ran a semistate time of 900 seconds (that is 15 minutes)? Do we see regression toward the mean?

R SQUARE = .395

ADJUSTED R SQUARE = .384

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

24059.334
36881.516
60940.850

1
58
59

24059.334
635.888

37.836

.000

Regression
Coefficient

StdError

t

Sig

Constant
Semi-time

243.037
.742

109.121
.121

2.227
6.151

.030
.000
.

f) We might do better than this if we also include the variables that tell which semistate a runner came from. Those results are below. By how much has our r-sqaured increased? Which semistate has the fastest course? Which seems to have the slowest course? What time would we predict for a girl who runs 900 seconds in the NP semistate?

R SQUARE = .736

ADJUSTED R SQUARE = .716

Sum of
Squares

df

Mean Square

F

Sig

Regression
Residual
Total

44830.269
16110.581
60940.850

4
55
59

11207.567
292.920

38.262

.000

Regression
Coefficient

StdError

t

Sig

Constant
Semi-time
NP
FC
Man

-162.846
1.174
-7.529
18.928
49.363

1008.865
.116
6.911
8.087
8.452

-1.496
10.079
-1.089
2.341
5.840

.140
.000
2.81
.023
.000

3. Two alumni of the University of Wisconsin wrote a research paper trying to explain variations in salary levels of 45 members of the University of Wisconsin economics department. Here were their published results:

REGRESSION OF FACULTY SALARY LEVELS OF EXPERIENCE, PUBLISHING PERFORMANCE, TEACHING PERFORMANCE AND ADMINISTRATIVE DUTIES

Independent Variables
Regression
Coeffient
(in dollars) Standard
Error
(in dollars) t-ratio

Experience

253.28

59.71

4.24*

Monographs

-5.72

162.01

-0.04

Articles in National Journals

392.46

90.64

4.33*

Articles in Specialty Journals

344.59

90.45

3.81*

Other Publications

76.49

24.31

3.15*

Transformed Teaching Score

7.31.67

429.82

1.70

Administrative Duties

5,208.90

807.46

6.45*

Intercept

12,127.10

*Significant at the .01 level

Correct R2 = .881 standard error = $1,735 n= 45

Source: American Economic Review, May 1973 (Vol. 63 No. 2) p 313

a) What does the little note at the bottom, "Significant at the .01 level" mean?

b) How much does salary go up, on the average, as a result of an additional publication in a national journal?

c) How much salary would this equation predict for a professor who had three years of experience, two articles in specialty journals but no publications elsewhere (i.e., national journals, monographs, other publications), has no administrative duties and a teaching score of zero?

d) Find a 95% confidence interval for the regression coefficient on Other Publications.

e) What information does the R2 =.881 give you?

4. Several different groups attempt to measure how conservative or liberal congressmen are. Among these groups were the Americans for Democratic Action (ADA), the AFL-CIO Committee on Political Education (COPE), the National Farm Union (NFU), and the Americans for Constitutional Action (ACA) Below is a matrix showing correlations between ratings given by these various interest groups many years ago. (Data from Journal of Law and Economics, Dec. 1979, p 369.)

Correlation of Congressional Ratings, 1973

Groups:
ADA COPE NFU ACA

ADA

COPE
0.850*

NFU
0.715* 0.819*

ACA
-0.897* -0.891* -0.800*

*Significant at the .01 level.

a) What information are you given when you are told that all of these correlations are significant at the .01 level?

b) How can you explain the negative correlation between ACA and ADA? (Hint: the ADA measured how "liberal" a congressman was.)

Here is a regression that tried to explain the ADA rating of over 400 congressmen by looking at the party of the congressman (1 = Democrat, 0 = Republican) and whether the congressman was from the North (= 0) or South (= 1)

ADA =
23.890 - 30.903DNS + 44.278Party
Corrected R²=.55

(1.67) (-13.04) (2.15)

c) What information does the R2 = .55 give you? (It is actually a corrected R2, with a little line over the R, but I cannot do that for this page)

d) The values in parentheses are t-values. Can we conclude with confidence that northern congressmen were more liberal than southern congressmen? Explain carefully.

e) The ADA rating is on a scale from 0 to 100. What does the regression coefficient in front of Party tell us?

f) Compute a 90% confidence interval for the regression coefficient of DNS.

5. Below are some data for used Cadillacs from several years ago. In addition to the price of the car, they include the age of the car, the year of the car, and the number of miles the car has. Using regression to see how price depended on age and miles (Miles are measured in thousands--i.e. 1 = 1k), gives the results below:

Model Summary: Predictors: (Constant), AGE, MILES; Dependent Variable: PRICE

R

R Square

Adjusted R Square

Std. Error of the Estimate

.928

.861

.852

3010.4572

ANOVA

Sum of Squares

df

Mean Square

F

Sig.

Regression

1742354183.829

2

871177091.915

96.126

.000

Residual

280948430.641

31

9062852.601

Total

2023302614.471

33

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

(Constant)

28213.880

1185.683

23.795

.000

MILES

-117.329

24.242

-.402

-4.840

.000

AGE

-1125.172

148.238

-.631

-7.590

.000

Dependent Variable: PRICE

a) How much of the variation in price in this sample can I explain?

b) Is my explanation of variation in price just random, or do I seem to have something more than randomness? Explain.

c) Based on my results, how much should a Caddy be worth that is zero years old and with zero miles?

d) If we have a Caddy that is one year old and has 5000 miles on it, what would we predict for its price?

e) If we have a Cadillac that is six years old and has 30,000 miles, what would we predict for its price?

f) According to these results, when age increases by a year, by how much should price change?

g) According to these results, when mileage increases by 1k, by how much should price change?

h) The t-value for the Constant is positive (23.795), while those for miles and age are negative. Why?

i) We test the hypothesis that age has nothing to do with the price of these cars. What should we conclude and why?

Answers here.

Go to Part 2

Start . Text

Research Paper Example