|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sum the X and Y columns, square X and sum, crossmultiply X and Y and sum. The results are the table above. The formula for the slope of the regression line is:
b = (n∑XY - ∑X∑Y)/(n∑X2
- (∑X)2);
= (5*135 - 25*35)/5*175 - 25*25)
= (675 - 875) ÷ (825 - 625) = -200 ÷ 250 =
-0.8
a = (∑Y - b(∑X)) ÷ n = (35 + .8*25) ÷ 5 = 55 ÷ 5 = 11.
(it may be useful to do one of these calculations without a computer program doing all the calculation, but once one has done one or two, it is best to let the machine do the calculation.)
Find the R2 for the least squares regression line that you found.
Using a regression calculator, the total sum of squares was 34, the amount explained by the regression was 32, so the R2 = 32 ÷ 34 = .94
2. A regression is run using 100 observations to determine the relationship between price and the number of pages in a book. The regression yields this equation:
Price = 1.41 + 1.32(Number of pages)
1.41 + 1.32*500 = 1.41 + 660
= 661.32.
Using 2 as the t-value, it will be 1.32 ± 2*.13 or 1.32
± .26.
3. The regression equation for the numbers in the following table is Y = 8 + .5X. What is the standard error of estimate?
|
|
|
|
|
|
10 |
1 |
|
|
10.5 |
0.25 |
|
|
13 |
0 |
|
|
11 |
1 |
|
|
10.5 |
2.25 |
We compute a standard error of estimate in the same way as we compute a standard deviation, except our degrees of freedom are n-2 instead of n-1. (We are estimating two things, not one. We cannot have an error term unless we have at least three observations, so the first two do not count for degrees of freedom. The result is the square root of (4.5/3) or apparoximately 1.22.
4. Suppose we have run a regression with five observations and we have the following results:
X error 5 -1 4 1 1 0 2 ? 0 ?
What are the last two values for the residuals? (Hint: They must sum to zero, and the correlation of the error terms and the independent variables must be zero.)
This is a hard problem requiring a bit of algebra.
All five errors must sume to zero. Calling the missing values v1 and v2, we have 0 = -1 + 1 + 0 + v1 + v2 or v1 = - v2.
The correlation coefficient
must also be zero. We can ignore the denominator of the
formula because it will only be zero if the numerator is
zero. We can also ignore the ∑X∑e term in the
numerator because ∑e = 0. So ∑X*e must equal
zero, or:
0 = -5 + 4 + 0 + 2*v1 + 0*v2;
1 = 2*v1;
v1 = .5; v2 = -.5
5. Two researchers were interested in what relationship, if any, existed between a teacher's teaching effectiveness (measured by student evaluations) and his/her research ability (measured by the number of books or articles published over a three year period). Taking a sample of 69, they obtained this result
Teaching Effectiveness = 387.22 + 3.137(Research
Ability)
R2 = .155; t-value for the regression coefficient = 3.51
a) More research ability
means better teacher evaluations;
b) Only 15% of the variation in teaching evaluations is
explained, which is not very much. We may be missing some
important variables, and once they are included, the results
might be very different. For example, if age matters in
both, it might be that the regression coefficient may simply
be capturing the effect of age.
c) The t-value is used to test the hypothesis that what we
are getting is simply a random chance result and there is no
true connection between these variables. The high t-value
says that this result is very unlikely to happen randomly,
so there it seems that there is something real here.
d) The correlation coefficient in a simple, two variable
regression is the square root of the R square, or
.39.
6. A teacher used a series of problems in a class that came from a variety of sources. After each set of problems, the students evaluated it in terms of usefulness, with 1 meaning very helpful and 5 meaning useless. The teacher wondered if the material from a prestigious school was better than the rest. He ran a regression using as the dependent variable the average student rating of the set of problems (remember, higher numbers mean less useful) and as an independent variable whether or not the problems came from the prestigious school (0 if from an ordinary school, 1 if from the prestigious school). Below are his results.
Variable
Coefficient
std error
t-statistic
constant
2.285
.034
67.071
Prestige?
.214
.057
3.780
R2 = .212
n > 40
(Comment: This is a problem of comparing whether or not two means are the same. Here it is done with regression. It can also be done without regression using a two-sample t-test, a test that some introductory texts explain but which I have not included on this site. The results will be the same regardless of which method is used.)
(Use of a zero-one coding is common when we have an off-on situation. Variables with this coding are called dummy variables.)
a) 2.285
b) 2.285 + .214 = 2.499
c) The teacher thought that the questions from the prestige
school would be rated better, and since better is a lower
score, these expectations were not supported. It appears
that the lessons were rated as being inferior.
d) Random chance does not seem like a good explanation of
this difference because the t-value is 3.78. It is extremely
unlikely that we would get such a big gap by random chance,
so we conclude that there are real differences in the
material as far as the students are concerned.
e) Only a little more than 20% of the variation is
explained, which is not a great deal. Almost 79% of the
variation is left unexplained.
7. Below are the results from a regression trying to predict the asking price of Cadillacs based on their mileage (measured in thousands of miles). (These data were taken from an issue of the Chicago Tribune a number of years ago.)
R Square |
.603 |
|||
Adjusted R Square |
.591 |
|||
|
||||
Variable |
Regression Coefficient |
Std. Error |
|
Significance |
Constant |
26303.415 |
1928.098 |
13.642 |
.000 |
miles |
-226.465 |
32.478 |
-6.973 |
.000 |
a) How successful is our attempt to explain the prices of
these cars? (Hint: Use R Square.)
b) If we have a Caddy that has 10,000 miles on it, what
would we predict for its price?
c) The level of significance for miles .000. What is the
hypothesis being tested?
d) There is a problem with the regression. Miles and age
tend to go together, with older cars having more miles.
Perhaps we are capturing some of the effects of age when we
include only miles. How do you think we could fix this
problem?
a) We can explain 60% of the
variation in price based on the miles that the car has.
b) Price = $26303.415 - $226.465(10) = 26303.42 - 2264.65 =
24038.77.
c) The hypothesis being tested is that milage does not
matter for the price of the car or, in other words, that the
true regression coefficient is zero and what we are getting
here is the result of random chance variation. The level of
significance says that getting a result like this by random
chance would happen (much) less than one time in one
thousand.
d) We can fix the problem by moving on to the next section,
where we find that we can have more than one independent
variable. We can have both age and miles explaining price.
It is the ability to have multiple independent variables
that makes regression such a powerful and commonly-used tool
in research.
8. For the data below, compute compute the correlation coefficient for X and Y.
|
|
|
|
|
|
|
|
|
|
|
|
Using a regression calculator, I get slope = .31818 and intercept = .76364, so the regression equation is:
Y = .76 + .32X