### Computer Exercises: Regression

(This exercise was designed to work with SPSS. It may be modified to work with other programs, and it may have to be modified to work with the current version of SPSS.)

The table below is from an EPA report showing gas mileage for 1978 automobiles.

 Gas Mileage Engine Size Transmission City Highway (cubic inch displacement) 29 40 85 M 24 30 85 A 12 19 273 M 11 16 273 A 25 34 140 M 21 29 140 A 15 20 250 A 27 36 98 M 22 29 98 A 9 12 412 A 24 36 89 M 23 32 89 A 15 20 250 A 16 24 250 A 15 21 231 A 20 30 140 M 18 23 200 A 13 21 318 A 11 18 400 A 14 21 305 A 10 16 360 A 13 18 350 A 12 19 360 A 18 26 120 M 21 31 79 M

a) How closely associated are city gas mileage and highway gas mileage?
b) How well can we explain highway gas mileage using engine size? How well can we explain city mileage using engine size?
c) What is the average difference between city and highway gas mileage?
d) Each time engine size goes up by one inch, by how much on average does highway gas mileage change?
e) How well can we explain city gas mileage using only type of transmission?
f) Regress city gas mileage on both engine size and transmission. Can you conclude that the true regression coefficients (the parameters) are different from zero? Explain.

### More Car Problems

We did not finish our analysis of the market for used Lincolns.

We ran a regression trying to explain the price of the car using only age. (You need to redo that regression, saving the unstandardized residuals and predicted values.) The results overestimated the importance of age because miles were also important, but miles and age were correlated, so when we included age alone, it also captured some of the influence of miles.

1. In running the regression, we have tried to squeeze out all the information that there is about price from age. Suppose that there was a correlation between age and the residual, the leftovers from the regression. (In other words, knowing age would help predict the value of the residuals.) Would that suggest that we had been successful in squeezing out all the information that was there? ________ What should the correlation between age and the residuals be if we have extracted all the information they have to offer? _________ What is the correlation?

2. Look at the correlation between the residuals and miles. What is the sign on the correlation? What does this suggest? Part of the effect of miles was included in age, but did we capture all the information that is here in explaining price? (If we know miles, will it help predict residuals?)

3. What should the mean of the residuals be? (Suppose that it is greater than zero. What does this suggest about our predictions for price?)

4. What should the mean of the predicted value be? (Suppose that it is greater than the mean of price. Would this be good or bad? Why?) Compute the means. Do you get what you think you should get?

5. Run a regression of price as the dependent variable and both age and miles as independent variables. When you set up the regression, set it up so you get confidence intervals for the regression coefficients (you will find it by clicking the Statistics button on the bottom) and also the unstandardized residuals and unstandardized predicted values (found clicking the Save button on the bottom.)

a) By how much has your R-squared increased from the regression without miles?

b) How much of the variation in prices are you not explaining with these two variables? What might explain that leftover variation?

c) You can test the hypothesis that the true value of R-squared is zero using ANOVA. The level of significance is what is important. Does it tell you that you can be very convinced that the true value of R-squared is not zero, and that what you have is just the effects of random noise (chance variation)?

d) What has happened to the coefficient for age when you included miles?

e) What price does your output predict for a five-year-old Lincoln with 50,000 miles?

f) Are you pretty sure that increasing mileage and age both lower the value of a car? Explain how the level of significance you get shows you that.

g) Based on this sample, what is a 95% confidence interval for the true coefficient of mileage?

h) Look at your residuals. According to them, which car is the most overpriced? Which car is the best bargain?

i) What should the correlations be between the residuals and age and the residuals and mileage? Check to see if you are right.

j) What is the value of a 40-year-old car with 100,000 miles. How do you explain the result that you get?

### Predicting College Success (and Failure)

Many years ago I received data from the Admissions Office showing the information that students submitted to SJC as part of their application and also the GPA that these students had gotten at SJC after a year or two. The Admissions Office was interested in how well they could predict student success at SJC and was spending a fair amount of money sending this data to consultants for analysis. (Somehow sending it off and spending lots of money made more sense than having someone on the faculty analyze it, though there were several people on the faculty who were quite capable of doing for free what they were spending lots of money on.) Included were SAT verbal and math scores, rank in high school class, adjusted high-school GPA, and college GPA (after two semesters?). These data are here.

1. Find the correlation between college cumulative GPA and SAT verbal.__________ Is the sign (positive or negative) what you expected it to be?

2. Is it significant at the .05 level?_________ at the .01 level?__________

3. What is the level of significance telling you?

4. Run a linear regression with college cumulative GPA as the dependent variable and SAT verbal as the independent variable. Set it up so that it saves the predicted and residual results. (In the Regression set-up, click on the Save button on the bottom, and select Predicted Values unstandardized and Residuals Unstandardized.) What is your R-squared? _________

5. What does this R-squared result mean?

6. What is the intercept or constant term _____________ and what is your slope term?___________

7. Write the equation of the regression line that you found: _______________________________

8. We can test the hypothesis that the true value of the slope is actually zero, and that the slope that you actually found is just a random chance result. (If the slope is zero, then information about SAT verbal tells us nothing useful about how well a student will do in college. Do you see this?) To do this we divide the unstandardized coefficients by the standard error of the regression coefficients. Make the division. What do you notice.______________________

9. Is the regression coefficient you have found for the slope statistically significant at the .01 level?____________ As a result, do you accept or reject the null hypothesis that beta (note the Greek letter) is actually zero? _________

10. According to your residuals, which student was furthest away from his or her predicted value? Looking at the other data on the student, can you venture a guess as to why the predicted value was so far off?

**************

Now that we have about as much fun as we can have with bivariate statistics, let's give multivariate stats a try.

11. Run a linear regression with college cumulative GPA as the dependent variable and the other four variables as independent variable. Set it up so that it saves the predicted and residual results. What is your R-squared? _________

12. By how much has your R-squared improved by adding the other three variables?___________ Are you surprised that R-square has increased? Explain.

13. If you had to explain what this R-square means to someone who does not know anything about statistics (you a few weeks ago?), what would you tell him or her.

14. Write the equation of the regression line that you found: _______________________________

15. Looking at the levels of significance for the independent variables, how many would be classified as not due to random chance? _______. Explain:

16. The problem statisticians worry about in a regression like this is called multicollinearity. See if you can find out what it is.

16. Look at your residuals. Is the same person still the one who is least well predicted?

17. Do we live in a deterministic world, where past foretells future? Explain.

18, At the time, the admissions office claimed that the adjusted high school average was by far the best predictor of college GPA. Run a regression using only the high-school GPA as an independent variable. By how much does the R-square drop from the equation with four independent variables? Does this result support the claim of the admissions office?

One of the fun things that happens to me as the result of having lots of stuff on the Internet (see ingrimayne.saintjoe.edu) is that I sometimes get strange communications from far-distant places. A few summers ago I received an inquiry from a gentleman from Colorado who was interested in bidding on a contract for the State of Colorado to construct a cost-of-living index for the school districts. Nothing ever came of it, except during several rounds of e-mail I was directed to the past studies. And you are the beneficiaries of that e-mail exchange.

1. The State of Colorado grants state funds to school districts based on the cost of living in each school district. If the cost of living is high, the school district gets more state funding than if the cost of living is low. Do you think that this is a good program? Explain why you support or oppose this program before we learn more about it.

2. Here are data of past cost-of-living for Colorado school districts. We need to check to make sure there are no mistakes because I had to scan in this data, and we want to make sure there were no mistakes introduced in that process. The total column should be the sum of housing, transport, goods, other, and taxes. Using the Transform->>compute command, let us compute a check by adding those five columns and then subtracting the total column. If we get only zeros, we are OK. What happens?

3. We have learned that regression is a way to explain one column in terms of others. We could try to explain the total by regressing it on the parts, but this would actually be pretty stupid. Can you see why? If so explain.

If you cannot see why this is stupid, do the regression. What is your r-squared? What is happening here? Now do you see why this was a stupid thing to do?

4. Let's take a look at the descriptive statistics. Find the mean and standard deviations for all the columns. Which of the subparts has the biggest standard deviation? What do you suspect from a big standard deviation?

5. If you have looked at the data and if you have any feel for numbers, you may be suspecting that housing is very important in determining the total cost of living. Let us check this out. Construct a new variable called "non-hous" by adding up the four non-housing components. (Use Transform, compute.) Compute the means and standard deviations of this variable and the housing variable. Which has the bigger mean? Which has the bigger standard deviation? Any suspicions as a result?

6. Compute the correlation among housing, non-housing, and total. What do you find? (The results are pretty startling—explain what they mean.)

7. Do a scatter diagram of housing and total. There is one dot that looks like it may be a mistake. Check the original data. What school district is it? Do you think it is a mistake? Why do you think that some school districts have housing costs that are so much bigger than other districts?

8. Do a scatter diagram of non-housing and total. What do you get?

9. Think about these results. What determines how much money the state of Colorado is giving local school districts under this program? Are you sure this is a good program? Explain.