Research Paper Example Next

## Problems: Multivariate Regression 1

1. I wanted to see what determined the gas mileage of two-wheel-drive sports utility vehicles. I found a publication on the Internet, and entered some of the data it had into SPSS. I ran a regression, choosing miles per gallon as my dependent variable and engine size (measured in liters) and the type of transmission as independent variables. I got a bunch of output, but now I need some help making sense of it.

a) There were two types of transmission, manual and automatic. I put in a 1 for automatic and a zero for manual. There is a name for this type of variable, but I cannot remember what it was. Can you help me?
b) I remember hearing that R Square was an important number. Mine is .825. But I am not sure if this is good or bad. Can you help me make sense of this number?
c) Is there anything in the ANOVA table that I should pay attention to?
d) I seem to have negative unstandardized coefficients. What do they mean? Should I worry that they are negative, or be happy about that?
e) I seem to remember that the t and Sig were used to test hypotheses. What hypotheses are we testing here? And what do we conclude?
f) I was hoping that the data I found would have included information about the weight of the car because I have often heard that heavy cars get much fewer miles per gallon. But there was nothing about weight in the publication I found. Should I worry about not having this variable in my regression? What do you think the effects of ignoring it are?

2. To get to the state meet in cross-country in Indiana in 2001, a team had to compete to advance to the semi-state meet. Twenty teams and assorted individual (who had already survived sectionals and regionals) competed, and to advance, a runner either had to be on one of the top five teams or in the top 15 individuals. There were four semi-state competitions, and they left a trail of numbers.

a) The data have 60 observations representing the top 15 finishers from each semi-state competition.

One question of some interest is whether all four semi-state competitions are equally tough. Below is a graph that shows the placing of each of these 60 girls grouped by the semi-state they came from. What do you see? Do all semi-states look equally tough? (There were more than 60 girls in the race—the data only include those who were in the top 15 in each semi-state.) b) Analysis of Variance is a way to statistically test what we are eyeballing in question 1. The null hypothesis is that all the groups are the same, or that means of the subgroups are the same as the overall mean. If we run ANOVA, we get the following table. What does the level of significance tell you? (Your null hypothesis is that all the semi-states are the same. Do you keep this hypothesis, or do you have substantial evidence that it is wrong?)

 Sum of Squares df Mean Square F Sig Between Groups Within Groups Total 12046.467 36115.467 48161.933 3 56 59 4015.489 644.919 6.226 .001

c) We can do exactly the same analysis with regression. We do it with what are called Dummy Variables, which are variables that indicate off/on. Off=0, On =1. We will use three of them, which are labeled fc (Franklin Central), man (Manchester) and np (New Prairie). How does the ANOVA table you get here compare with the ANOVA table in part a? (They are different because in part a we are analyzing rank or how they placed, and below we are analyzing their running time in seconds.)
What does the R-squared tell us?
According to the regression coefficients, girls from which semistate, FC, NP, MAN, or TerraHaute (which is not listed separately) are slowest?

 R SQUARE = .247 ADJUSTED R SQUARE = .207

 Sum of Squares df Mean Square F Sig Regression Residual Total 15075.650 45865.200 60940.850 3 56 59 5025.217 819.021 6.136 .001

 Regression Coefficient StdError t Sig Constant NP FC Man 933.467 -37.267 -32.800 -8.00 7.389 10.450 10.450 10.450 126.327 -3.566 -3.139 -.766 .000 .001 .003 .447

d) If we look at the scatter diagram of state times and semistate times, we get the picture below. If we look at the correlation, we get the correlation below. What do they tell us? Correlation between state time and semi-state time = .628; significance is .000 (or less than 1 in 1000).

e) If we try to explain the time that a girl ran in the state meet using with regression using her time in the semi-state meet as the independent variable, we get the following. How well does the semi-state time predict the state time? What would we predict for a girl who ran a semistate time of 900 seconds (that is 15 minutes)? Do we see regression toward the mean?

 R SQUARE = .395 ADJUSTED R SQUARE = .384

 Sum of Squares df Mean Square F Sig Regression Residual Total 24059.334 36881.516 60940.850 1 58 59 24059.334 635.888 37.836 .000

 Regression Coefficient StdError t Sig Constant Semi-time 243.037 .742 109.121 .121 2.227 6.151 .030 .000 . f) We might do better than this if we also include the variables that tell which semistate a runner came from. Those results are below. By how much has our r-sqaured increased? Which semistate has the fastest course? Which seems to have the slowest course? What time would we predict for a girl who runs 900 seconds in the NP semistate?

 R SQUARE = .736 ADJUSTED R SQUARE = .716

 Sum of Squares df Mean Square F Sig Regression Residual Total 44830.269 16110.581 60940.850 4 55 59 11207.567 292.920 38.262 .000

 Regression Coefficient StdError t Sig Constant Semi-time NP FC Man -162.846 1.174 -7.529 18.928 49.363 1008.865 .116 6.911 8.087 8.452 -1.496 10.079 -1.089 2.341 5.840 .140 .000 2.81 .023 .000

3. Two alumni of the University of Wisconsin wrote a research paper trying to explain variations in salary levels of 45 members of the University of Wisconsin economics department. Here were their published results:

 REGRESSION OF FACULTY SALARY LEVELS OF EXPERIENCE, PUBLISHING PERFORMANCE, TEACHING PERFORMANCE AND ADMINISTRATIVE DUTIES Independent Variables Regression Coeffient (in dollars) Standard Error (in dollars) t-ratio Experience 253.28 59.71 4.24* Monographs -5.72 162.01 -0.04 Articles in National Journals 392.46 90.64 4.33* Articles in Specialty Journals 344.59 90.45 3.81* Other Publications 76.49 24.31 3.15* Transformed Teaching Score 7.31.67 429.82 1.70 Administrative Duties 5,208.90 807.46 6.45* Intercept 12,127.10 *Significant at the .01 level Correct R2 = .881 standard error = \$1,735 n= 45 Source: American Economic Review, May 1973 (Vol. 63 No. 2) p 313

a) What does the little note at the bottom, "Significant at the .01 level" mean?
b) How much does salary go up, on the average, as a result of an additional publication in a national journal?
c) How much salary would this equation predict for a professor who had three years of experience, two articles in specialty journals but no publications elsewhere (i.e., national journals, monographs, other publications), has no administrative duties and a teaching score of zero?
d) Find a 95% confidence interval for the regression coefficient on Other Publications.
e) What information does the R2 =.881 give you?

4. Several different groups attempt to measure how conservative or liberal congressmen are. Among these groups were the Americans for Democratic Action (ADA), the AFL-CIO Committee on Political Education (COPE), the National Farm Union (NFU), and the Americans for Constitutional Action (ACA) Below is a matrix showing correlations between ratings given by these various interest groups many years ago. (Data from Journal of Law and Economics, Dec. 1979, p 369.)

 Correlation of Congressional Ratings, 1973 Groups: ADA COPE NFU ACA ADA COPE 0.850* NFU 0.715* 0.819* ACA -0.897* -0.891* -0.800* *Significant at the .01 level.

a) What information are you given when you are told that all of these correlations are significant at the .01 level?
b) How can you explain the negative correlation between ACA and ADA? (Hint: the ADA measured how "liberal" a congressman was.)

Here is a regression that tried to explain the ADA rating of over 400 congressmen by looking at the party of the congressman (1 = Democrat, 0 = Republican) and whether the congressman was from the North (= 0) or South (= 1)

 ADA = 23.890 - 30.903DNS + 44.278Party Corrected R2=.55 (1.67) (-13.04) (2.15)

c) What information does the R2 = .55 give you? (It is actually a corrected R2, with a little line over the R, but I cannot do that for this page)
d) The values in parentheses are t-values. Can we conclude with confidence that northern congressmen were more liberal than southern congressmen? Explain carefully.
e) The ADA rating is on a scale from 0 to 100. What does the regression coefficient in front of Party tell us?
f) Compute a 90% confidence interval for the regression coefficient of DNS.

5. Below are some data for used Cadillacs from several years ago. In addition to the price of the car, they include the age of the car, the year of the car, and the number of miles the car has. Using regression to see how price depended on age and miles (Miles are measured in thousands--i.e. 1 = 1k), gives the results below:

Model Summary: Predictors: (Constant), AGE, MILES; Dependent Variable: PRICE

 R R Square Adjusted R Square Std. Error of the Estimate .928 .861 .852 3010.4572

 ANOVA Sum of Squares df Mean Square F Sig. Regression 1742354183.829 2 871177091.915 96.126 .000 Residual 280948430.641 31 9062852.601 Total 2023302614.471 33

 Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 28213.880 1185.683 23.795 .000 MILES -117.329 24.242 -.402 -4.840 .000 AGE -1125.172 148.238 -.631 -7.590 .000
Dependent Variable: PRICE

a) How much of the variation in price in this sample can I explain?
b) Is my explanation of variation in price just random, or do I seem to have something more than randomness? Explain.
c) Based on my results, how much should a Caddy be worth that is zero years old and with zero miles?
d) If we have a Caddy that is one year old and has 5000 miles on it, what would we predict for its price?
e) If we have a Cadillac that is six years old and has 30,000 miles, what would we predict for its price?
f) According to these results, when age increases by a year, by how much should price change?
g) According to these results, when mileage increases by 1k, by how much should price change?
h) The t-value for the Constant is positive (23.795), while those for miles and age are negative. Why?
i) We test the hypothesis that age has nothing to do with the price of these cars. What should we conclude and why?