Computer Exercises: Simple
Regression
(These exercises were designed for use with SPSS. They
may need be modified to work with other programs or even
with the current version of SPSS.)
Statistics Fun: Car Problems
1. Here are data taken
from an issue of Chicago Tribune from several years ago.
Included are prices of used Lincoln automobiles along with
their ages and mileage. Enter these data in SPSS and label
your columns.
Let us begin with some graphs. Pull down the Graphs menu
to "Legacy Dialogs" to "Scatter/Dot..." Select the simple
graph. Enter Age as the Y axis, and Mileage as the X axis.
What do you see?
2. Try looking at Age and Price, and Miles and Price.
Also see if you can make any sense of the other options of
scatter graphs.
3. Pull down the analyze menu to correlate. Select
bivariate. Enter all three variables. There is a level of
significance given, which means that a hypothesis has been
tested. What was the hypothesis that was being tested?
4. Was the hypothesis being tested accepted or rejected?
How do you know?
5. What does "Significant at .000" mean?
6. It is now time to look at regression, a statistical
technique that goes far beyond correlation. Regression tries
to estimate the best fitting line through the data. That is,
if we know the age of the car, what would our best guess of
its price be?
Pull down the Analyze menu to Regression and select
"Linear." Enter Price as the dependent variable. This is the
variable you are trying to explain. Let's see if we can
explain it using the age of the cars. Enter age as the
independent variable. Run the regression.
7. There are some numbers which are important, and which
we will spend time exploring in the next few weeks. One is
the R-square. It is the square of the correlation
coefficient in our case, and represents the amount of
variation in price that can be explained by age. What is
your R-square?
8. Regression fits a line through a set of points. The
form of the line is Y=a + bX. What line have you fit? Price
= __________ + __________(age)
(Hint, you find this under coefficients--the Bs.)
9. When the age of a car increases by one year, what does
this equation says happens to price?
10. What does this equation say is the value of a new
Lincoln? (Hint: How old would a new Lincoln be?)
11. What does this equation predict as the value of a
20-year-old Lincoln? Does this answer make any sense? How do
you explain it?
12. Are you surprised to find our old friends, the
t-values or t-statistics, are still with us? This means that
we are testing a hypothesis. The hypothesis we are testing
is that age has no effect on the price of the car. Do we
accept or reject this hypothesis? How do you know?
13. The equation you have found overstates the importance
of age on price. Both age and mileage affect the price of a
used car, but we have not put in mileage. However, mileage
is correlated with age, so when we only put in age, we
capture not just the effect of age, but some of the effect
of mileage. What do you think we could do to solve this
problem?
14. Rerun the regression, but this time save the
unstandardized predicted values and residuals. (In the
regression setup dialog, click the save option.) After you
run the regression, see if you can plot the regression line
along with the original age-price points. (Use
scatter-overlay; the pairs should be price-age and
unstandardized predicted-age.)
15. In running the regression, we have tried to squeeze
out all the information that there is about price from age.
Suppose that there was a correlation between age and the
residual, the leftovers from the regression. Would that
suggest that we had been successful in squeezing out all the
information that was there? ________ What should the
correlation between age and the residuals be if we have
extracted all the information they have to offer? _________
Run the correlation. Do you get what you expected to
get?
16. There are two other columns of data labeled "one" and
"two." Do a correlation, and then do a regression explaining
one with two. You should have three levels of
significance--one with the correlation, one with an F-test,
and one with a t-test. Which of them is biggest? Which is
smallest? Are you surprised?
Introduction to Linear Regression
1. Today we are going to meet linear regression analysis,
the big brother to correlation. We will again start by
generating three columns of numbers. Columns 1 and 2 are to
be random numbers. We have been generating them with a
uniform distribution, but today, just to change things, let
us generate them with a normal distribution. We use the same
Transform to Compute menu, but instead of finding
Uniform(?), we find Normal(?). The question mark is for
standard deviation, so let us put in something in the 50 to
100 range. After you get two columns of numbers, label them
X and error. Form a third column (which we will label Y) so
that Y=200 + 4X + error.
Let us check to see if X and error look like normal
distributions. Look at the histogram of them and have a
normal curve plotted with them. Do they look like they are
normally distributed?
Do a scatter diagram of Y and X. What do you see?
Now it is time to try some linear regression. Pull down
Analyze to Regression, move over to Linear, and click it. In
the dialog box that follows, make your dependent variable Y
and your independent variable X. Click OK.
Now you get a big mess of results. We will try to make
sense of almost all of them as time goes on. But for now,
let us go to the bottom, to the box entitled "Coefficients."
Look at unstandardized coefficients, under the B. If
regression has done its job right, you should have numbers
very close to 200 and 4. What is the equation that
regression found?
Y = _______ + __________X
What are your thoughts at this time?
If you have time, redo the entire problem, but instead of
getting random numbers from normal(?), get them from
uniform(?). Do the results change much?
Statistics Lab --random regressions
Part 1
1. Let's go back to our old friend, random numbers
generated by the UNIFORM(?) item in the compute menu item.
Let us begin with a set of 120 random numbers between 0 and
1000. Call this column of numbers
INDEPEND (which stands for the independent variable).
2. Next we will use these numbers to help construct a
second column of numbers. Go back to compute, but this time
replace the question mark in UNIFORM with INDEPEND. What
this will do is have the computer look at whatever number is
in the INDEPEND column and then construct a new random
number that is between zero and that number. We will call
this column DEPEND. Do you see that the numbers in this
column must always be less than the corresponding number in
the INDEPEND column?
3. What is the mean of INDEPEND? What is the true mean of
DEPEND? Can you explain why? Do a t-test and see if you can
reject these means. (If you can reject them, make sure you
have constructed them properly.)
4. Do a scatter diagram of these two columns with DEPEND
on the vertical axis and INDEPEND on the horizontal axis. If
you think really hard about what we have done, you should be
able to figure out what equation will fit best through these
points. (If you were really bad at algebra, this may be
beyond you, but if you were OK at it, this is a challenge
you should be able to solve. Hint, when the variable
INDEPEND has a value of 1000, the variable DEPEND will
randomly take a value somewhere between 0 and 1000. What is
the expected value of DEPEND? When the variable INDEPEND has
a value of 200, the variable DEPEND will randomly take a
value somewhere between 0 and 200. What is the expected
value of DEPEND? Can you see the equation that emerges?)
Part 2
1. After you have savored this question for a while, you
can move on to the next step, which is to do a linear
regression with INDEPEND as the independent variable and
DEPEND as the dependent variable. (Subtle, huh?) (You pull
down the Analyse to Regression to Linear.) Before you hit
the OK button, click on Statistics and make sure confidence
intervals are checked. We want to focus on the
unstandardized coefficients, which tell us the equation that
best fits the points we have generated. The equation for a
line is Y = a + bX. Y is the dependent variable and X is the
independent variable. The value of a is the value under B
and next to Constant. The value for b is the value under B
and next to INDEPEND. Write out the equation that the
computer found. How close is it to the true values?
2. There is a problem with what we have done. The
variables we have generated leave us with the problem of
heteroscedasticity. See if you can use the internet to find
out what that is. It is a bad thing--it messes up the
t-values and makes the levels of significance misleading.
The good case is called homoscedasticity. (Doesn't
statistics have some neat words and terms? My favorite has
always been normal deviate. That was probably named by a
psychologist.)
3. I think we can reconstruct the problem without the
heteroscedascity. We can use the same column of random
numbers that we started with, but we must create the
dependent variables in a different way. Let us generate them
with the normal(stdev) item under compute. For stdev put in
170. Then add this to the expression: + .5*INDEPEND.
4. Repeat steps 1.4 and 2.1. What do you get? How do your
results differ from the previous results? (Look at the
scatter diagram. What is the difference?)
5. If you have extra time, you can try to plot the line
you have computed relative to the points. When you do the
regression, click the save button, and make sure the
unstandardized predicted value is checked. This will add a
new column to your data page, the predicted values from the
regression. Then do a scatter diagram, but with overlay as
the choice. You need to have two series, independ-depend and
independ-predicted. If you do everything right, you will get
a really interesting graph.
Think about what you have done. We will talk about these
results in class on Monday.
Grading Fun
(This exercise was designed for use with SPSS, but can be
modified to run with other statistical programs or with
statistical calculators such as those found at http://www.wessa.net/slr.wasp
and http://www.wessa.net/corr.wasp)
Here are grade-book data
from an introductory economics class in several forms. At
the top you have it as comma delimited, and then the two
columns are presented separately. In the first column you
have the total points for the first eight weeks, and in the
second column you have total points from the last eight
weeks. If you add the two together, you will get total
points for the course.
What kind of relationship would you expect between these
two columns? Should they be positively or negatively
correlated? Should the correlation be strong or weak?
Explain what you expect and why you expect it.
After you have formed your expectations, let us see what
the numbers say. First, do a scatter diagram of the two
columns. Do the results support your expectations or are
they considerably different?
Compute the correlation. What does it tell you? Are the
results in line with what you expected, or are they
different?
Finally, compute a regression, saving the predicted
values and the unstandardized residuals. To what extent
could we predict the score after midsemester if we knew the
midsemester total?
Find the students who performed best and worst compared
to their predicted performance.
Let's compute some descriptive statistics on these data.
Pull Down Analyze to Descriptives Statistics, then move over
to Descriptives. Click the options button, and check
variance. Then hit OK until the program computes the
results. Now take the variance for the residuals and
subtract it from the variance for the second eight weeks.
This should be the variance that has disappeared because of
the regression. Take this number and divide it by the
variance of the second eight weeks. Compare the results to
the R-Square. What do you see? How do you explain what you
get?
Look at the correlation between the independent variable,
midsemester totals, and the residuals. What do you
discover?
We can try to see the regression line when we have only
one independent variable, as we have in this case. Pull down
the Graphs menu to scatter, and select Overlay. You will
want two pairs, final eight-week totals-- Midsem and pre-1
and midsem (in that order--if they are not in this order,
highlight them and click the swap button.) Hit OK and see
what you get.
(There are many possible assignments using grades, point
total, and student evaluations. Most of those that I have
used are specific to the college I taught at so are not
included here.)
|