Lab: Simple Regression

Computer Exercises: Simple Regression

(These exercises were designed for use with SPSS. They may need be modified to work with other programs or even with the current version of SPSS.)

Statistics Fun: Car Problems

1. Here are data taken from an issue of Chicago Tribune from several years ago. Included are prices of used Lincoln automobiles along with their ages and mileage. Enter these data in SPSS and label your columns.

Let us begin with some graphs. Pull down the Graphs menu to "Legacy Dialogs" to "Scatter/Dot..." Select the simple graph. Enter Age as the Y axis, and Mileage as the X axis. What do you see?

2. Try looking at Age and Price, and Miles and Price. Also see if you can make any sense of the other options of scatter graphs.

3. Pull down the analyze menu to correlate. Select bivariate. Enter all three variables. There is a level of significance given, which means that a hypothesis has been tested. What was the hypothesis that was being tested?

4. Was the hypothesis being tested accepted or rejected? How do you know?

5. What does "Significant at .000" mean?

6. It is now time to look at regression, a statistical technique that goes far beyond correlation. Regression tries to estimate the best fitting line through the data. That is, if we know the age of the car, what would our best guess of its price be?

Pull down the Analyze menu to Regression and select "Linear." Enter Price as the dependent variable. This is the variable you are trying to explain. Let's see if we can explain it using the age of the cars. Enter age as the independent variable. Run the regression.

7. There are some numbers which are important, and which we will spend time exploring in the next few weeks. One is the R-square. It is the square of the correlation coefficient in our case, and represents the amount of variation in price that can be explained by age. What is your R-square?

8. Regression fits a line through a set of points. The form of the line is Y=a + bX. What line have you fit? Price = __________ + __________(age)

(Hint, you find this under coefficients--the Bs.)

9. When the age of a car increases by one year, what does this equation says happens to price?

10. What does this equation say is the value of a new Lincoln? (Hint: How old would a new Lincoln be?)

11. What does this equation predict as the value of a 20-year-old Lincoln? Does this answer make any sense? How do you explain it?

12. Are you surprised to find our old friends, the t-values or t-statistics, are still with us? This means that we are testing a hypothesis. The hypothesis we are testing is that age has no effect on the price of the car. Do we accept or reject this hypothesis? How do you know?

13. The equation you have found overstates the importance of age on price. Both age and mileage affect the price of a used car, but we have not put in mileage. However, mileage is correlated with age, so when we only put in age, we capture not just the effect of age, but some of the effect of mileage. What do you think we could do to solve this problem?

14. Rerun the regression, but this time save the unstandardized predicted values and residuals. (In the regression setup dialog, click the save option.) After you run the regression, see if you can plot the regression line along with the original age-price points. (Use scatter-overlay; the pairs should be price-age and unstandardized predicted-age.)

15. In running the regression, we have tried to squeeze out all the information that there is about price from age. Suppose that there was a correlation between age and the residual, the leftovers from the regression. Would that suggest that we had been successful in squeezing out all the information that was there? ________ What should the correlation between age and the residuals be if we have extracted all the information they have to offer? _________ Run the correlation. Do you get what you expected to get?

16. There are two other columns of data labeled "one" and "two." Do a correlation, and then do a regression explaining one with two. You should have three levels of significance--one with the correlation, one with an F-test, and one with a t-test. Which of them is biggest? Which is smallest? Are you surprised?

Introduction to Linear Regression

1. Today we are going to meet linear regression analysis, the big brother to correlation. We will again start by generating three columns of numbers. Columns 1 and 2 are to be random numbers. We have been generating them with a uniform distribution, but today, just to change things, let us generate them with a normal distribution. We use the same Transform to Compute menu, but instead of finding Uniform(?), we find Normal(?). The question mark is for standard deviation, so let us put in something in the 50 to 100 range. After you get two columns of numbers, label them X and error. Form a third column (which we will label Y) so that Y=200 + 4X + error.

Let us check to see if X and error look like normal distributions. Look at the histogram of them and have a normal curve plotted with them. Do they look like they are normally distributed?

Do a scatter diagram of Y and X. What do you see?

Now it is time to try some linear regression. Pull down Analyze to Regression, move over to Linear, and click it. In the dialog box that follows, make your dependent variable Y and your independent variable X. Click OK.

Now you get a big mess of results. We will try to make sense of almost all of them as time goes on. But for now, let us go to the bottom, to the box entitled "Coefficients." Look at unstandardized coefficients, under the B. If regression has done its job right, you should have numbers very close to 200 and 4. What is the equation that regression found?

Y = _______ + __________X

What are your thoughts at this time?

If you have time, redo the entire problem, but instead of getting random numbers from normal(?), get them from uniform(?). Do the results change much?

Statistics Lab --random regressions

Part 1

1. Let's go back to our old friend, random numbers generated by the UNIFORM(?) item in the compute menu item. Let us begin with a set of 120 random numbers between 0 and 1000. Call this column of numbers

INDEPEND (which stands for the independent variable).

2. Next we will use these numbers to help construct a second column of numbers. Go back to compute, but this time replace the question mark in UNIFORM with INDEPEND. What this will do is have the computer look at whatever number is in the INDEPEND column and then construct a new random number that is between zero and that number. We will call this column DEPEND. Do you see that the numbers in this column must always be less than the corresponding number in the INDEPEND column?

3. What is the mean of INDEPEND? What is the true mean of DEPEND? Can you explain why? Do a t-test and see if you can reject these means. (If you can reject them, make sure you have constructed them properly.)

4. Do a scatter diagram of these two columns with DEPEND on the vertical axis and INDEPEND on the horizontal axis. If you think really hard about what we have done, you should be able to figure out what equation will fit best through these points. (If you were really bad at algebra, this may be beyond you, but if you were OK at it, this is a challenge you should be able to solve. Hint, when the variable INDEPEND has a value of 1000, the variable DEPEND will randomly take a value somewhere between 0 and 1000. What is the expected value of DEPEND? When the variable INDEPEND has a value of 200, the variable DEPEND will randomly take a value somewhere between 0 and 200. What is the expected value of DEPEND? Can you see the equation that emerges?)

Part 2

1. After you have savored this question for a while, you can move on to the next step, which is to do a linear regression with INDEPEND as the independent variable and DEPEND as the dependent variable. (Subtle, huh?) (You pull down the Analyse to Regression to Linear.) Before you hit the OK button, click on Statistics and make sure confidence intervals are checked. We want to focus on the unstandardized coefficients, which tell us the equation that best fits the points we have generated. The equation for a line is Y = a + bX. Y is the dependent variable and X is the independent variable. The value of a is the value under B and next to Constant. The value for b is the value under B and next to INDEPEND. Write out the equation that the computer found. How close is it to the true values?

2. There is a problem with what we have done. The variables we have generated leave us with the problem of heteroscedasticity. See if you can use the internet to find out what that is. It is a bad thing--it messes up the t-values and makes the levels of significance misleading. The good case is called homoscedasticity. (Doesn't statistics have some neat words and terms? My favorite has always been normal deviate. That was probably named by a psychologist.)

3. I think we can reconstruct the problem without the heteroscedascity. We can use the same column of random numbers that we started with, but we must create the dependent variables in a different way. Let us generate them with the normal(stdev) item under compute. For stdev put in 170. Then add this to the expression: + .5*INDEPEND.

4. Repeat steps 1.4 and 2.1. What do you get? How do your results differ from the previous results? (Look at the scatter diagram. What is the difference?)

5. If you have extra time, you can try to plot the line you have computed relative to the points. When you do the regression, click the save button, and make sure the unstandardized predicted value is checked. This will add a new column to your data page, the predicted values from the regression. Then do a scatter diagram, but with overlay as the choice. You need to have two series, independ-depend and independ-predicted. If you do everything right, you will get a really interesting graph.

Think about what you have done. We will talk about these results in class on Monday.

Grading Fun

(This exercise was designed for use with SPSS, but can be modified to run with other statistical programs or with statistical calculators such as those found at http://www.wessa.net/slr.wasp and http://www.wessa.net/corr.wasp)

Here are grade-book data from an introductory economics class in several forms. At the top you have it as comma delimited, and then the two columns are presented separately. In the first column you have the total points for the first eight weeks, and in the second column you have total points from the last eight weeks. If you add the two together, you will get total points for the course.

What kind of relationship would you expect between these two columns? Should they be positively or negatively correlated? Should the correlation be strong or weak? Explain what you expect and why you expect it.

After you have formed your expectations, let us see what the numbers say. First, do a scatter diagram of the two columns. Do the results support your expectations or are they considerably different?

Compute the correlation. What does it tell you? Are the results in line with what you expected, or are they different?

Finally, compute a regression, saving the predicted values and the unstandardized residuals. To what extent could we predict the score after midsemester if we knew the midsemester total?

Find the students who performed best and worst compared to their predicted performance.

Let's compute some descriptive statistics on these data. Pull Down Analyze to Descriptives Statistics, then move over to Descriptives. Click the options button, and check variance. Then hit OK until the program computes the results. Now take the variance for the residuals and subtract it from the variance for the second eight weeks. This should be the variance that has disappeared because of the regression. Take this number and divide it by the variance of the second eight weeks. Compare the results to the R-Square. What do you see? How do you explain what you get?

Look at the correlation between the independent variable, midsemester totals, and the residuals. What do you discover?

We can try to see the regression line when we have only one independent variable, as we have in this case. Pull down the Graphs menu to scatter, and select Overlay. You will want two pairs, final eight-week totals-- Midsem and pre-1 and midsem (in that order--if they are not in this order, highlight them and click the swap button.) Hit OK and see what you get.

(There are many possible assignments using grades, point total, and student evaluations. Most of those that I have used are specific to the college I taught at so are not included here.)

Start . Text