Answers: Simple Regression

(It is useful to calculate a regression coefficient at least once. However, no one who is doing useful work does hand calculation --they use statistical programs. Here is a simple regression coefficient calculator from the Internet: http://www.easycalculation.com/statistics/regression.php.)

1. For the data below, compute regression coefficients a and b.

Y X X² XY

5 7 49 35

9 4 16 36

10 1 1 10

3 10 100 30

8 3 9 24

35 25 175 135

Sum the X and Y columns, square X and sum, crossmultiply X and Y and sum. The results are the table above. The formula for the slope of the regression line is:

b = (n∑XY - ∑X∑Y)/(n∑X² - (∑X)²);
= (5*135 - 25*35)/5*175 - 25*25)
= (675 - 875) ÷ (825 - 625) = -200 ÷ 250 = -0.8

a = (∑Y - b(∑X)) ÷ n = (35 + .8*25) ÷ 5 = 55 ÷ 5 = 11.

(it may be useful to do one of these calculations without a computer program doing all the calculation, but once one has done one or two, it is best to let the machine do the calculation.)

Find the R² for the least squares regression line that you found.

Using a regression calculator, the total sum of squares was 34, the amount explained by the regression was 32, so the R² = 32 ÷ 34 = .94

2. A regression is run using 100 observations to determine the relationship between price and the number of pages in a book. The regression yields this equation:

Price = 1.41 + 1.32(Number of pages)

a) What price does this equation predict for a book with 500 pages?

b) If the standard deviation of the regression coefficient for pages is .13, what is a 95% confidence interval for the true coefficient?

1.41 + 1.32*500 = 1.41 + 660 = 661.32.
Using 2 as the t-value, it will be 1.32 ± 2*.13 or 1.32 ± .26.

3. The regression equation for the numbers in the following table is Y = 8 + .5X. What is the standard error of estimate?

X Y Predicted Y e²

4 9
10

1

5 11
10.5

0.25

10 13
13

0

6 10
11

1

5 12
10.5

2.25

We compute a standard error of estimate in the same way as we compute a standard deviation, except our degrees of freedom are n-2 instead of n-1. (We are estimating two things, not one. We cannot have an error term unless we have at least three observations, so the first two do not count for degrees of freedom. The result is the square root of (4.5/3) or apparoximately 1.22.

4. Suppose we have run a regression with five observations and we have the following results:

X error

5 -1

4 1

1 0

2 ?

0 ?

What are the last two values for the residuals? (Hint: They must sum to zero, and the correlation of the error terms and the independent variables must be zero.)

This is a hard problem requiring a bit of algebra.

All five errors must sume to zero. Calling the missing values v1 and v2, we have 0 = -1 + 1 + 0 + v1 + v2 or v1 = - v2.

The correlation coefficient must also be zero. We can ignore the denominator of the formula because it will only be zero if the numerator is zero. We can also ignore the ∑X∑e term in the numerator because ∑e = 0. So ∑X*e must equal zero, or:
0 = -5 + 4 + 0 + 2*v1 + 0*v2;
1 = 2*v1;
v1 = .5; v2 = -.5

5. Two researchers were interested in what relationship, if any, existed between a teacher's teaching effectiveness (measured by student evaluations) and his/her research ability (measured by the number of books or articles published over a three year period). Taking a sample of 69, they obtained this result

Teaching Effectiveness = 387.22 + 3.137(Research Ability)
R2 = .155; t-value for the regression coefficient = 3.51

a) What does the coefficient on Research Ability tell you?

b) What does the R2 tell you?

c) You are given a t-value. What does it mean?

d) It is possible to find the correlation coefficient of the two variables from the information above. What is it?

a) More research ability means better teacher evaluations;
b) Only 15% of the variation in teaching evaluations is explained, which is not very much. We may be missing some important variables, and once they are included, the results might be very different. For example, if age matters in both, it might be that the regression coefficient may simply be capturing the effect of age.
c) The t-value is used to test the hypothesis that what we are getting is simply a random chance result and there is no true connection between these variables. The high t-value says that this result is very unlikely to happen randomly, so there it seems that there is something real here.
d) The correlation coefficient in a simple, two variable regression is the square root of the R square, or .39.

6. A teacher used a series of problems in a class that came from a variety of sources. After each set of problems, the students evaluated it in terms of usefulness, with 1 meaning very helpful and 5 meaning useless. The teacher wondered if the material from a prestigious school was better than the rest. He ran a regression using as the dependent variable the average student rating of the set of problems (remember, higher numbers mean less useful) and as an independent variable whether or not the problems came from the prestigious school (0 if from an ordinary school, 1 if from the prestigious school). Below are his results.

Variable

Coefficient

std error

t-statistic

constant

2.285

.034

67.071

Prestige?

.214

.057

3.780

R² = .212
n > 40

a) What was the average rating of the lessons from the ordinary schools?

b) What was the average rating of the lessons from the prestigious school?

c) Was the expectation of the teacher confirmed?

d) Suppose the claim was that the lessons from the prestigious school were just like the other lessons and that any differences are due to random chance. Does random chance look like a good explanation of the differences in the quality of the lessons as perceived by students? What number do we use to answer this?

e) How much of the variation in student evaluations did the teacher explain with this regression? Is this a lot or a little?

(Comment: This is a problem of comparing whether or not two means are the same. Here it is done with regression. It can also be done without regression using a two-sample t-test, a test that some introductory texts explain but which I have not included on this site. The results will be the same regardless of which method is used.)

(Use of a zero-one coding is common when we have an off-on situation. Variables with this coding are called dummy variables.)

a) 2.285
b) 2.285 + .214 = 2.499
c) The teacher thought that the questions from the prestige school would be rated better, and since better is a lower score, these expectations were not supported. It appears that the lessons were rated as being inferior.
d) Random chance does not seem like a good explanation of this difference because the t-value is 3.78. It is extremely unlikely that we would get such a big gap by random chance, so we conclude that there are real differences in the material as far as the students are concerned.
e) Only a little more than 20% of the variation is explained, which is not a great deal. Almost 79% of the variation is left unexplained.

7. Below are the results from a regression trying to predict the asking price of Cadillacs based on their mileage (measured in thousands of miles). (These data were taken from an issue of the Chicago Tribune a number of years ago.)

R Square

.603

Adjusted R Square

.591

.

Variable

Regression Coefficient

Std. Error
t
Significance

Constant

26303.415

1928.098

13.642

.000

miles

-226.465

32.478

-6.973

.000

a) How successful is our attempt to explain the prices of these cars? (Hint: Use R Square.)
b) If we have a Caddy that has 10,000 miles on it, what would we predict for its price?
c) The level of significance for miles .000. What is the hypothesis being tested?
d) There is a problem with the regression. Miles and age tend to go together, with older cars having more miles. Perhaps we are capturing some of the effects of age when we include only miles. How do you think we could fix this problem?

a) We can explain 60% of the variation in price based on the miles that the car has.
b) Price = $26303.415 - $226.465(10) = 26303.42 - 2264.65 = 24038.77.
c) The hypothesis being tested is that milage does not matter for the price of the car or, in other words, that the true regression coefficient is zero and what we are getting here is the result of random chance variation. The level of significance says that getting a result like this by random chance would happen (much) less than one time in one thousand.
d) We can fix the problem by moving on to the next section, where we find that we can have more than one independent variable. We can have both age and miles explaining price. It is the ability to have multiple independent variables that makes regression such a powerful and commonly-used tool in research.

8. For the data below, compute compute the correlation coefficient for X and Y.

Y X

-1 0

1 1

0 2

2 1

3 6

Then compute the vaues of a and b in the regression equation y = a + bX.

Using a regression calculator, I get slope = .31818 and intercept = .76364, so the regression equation is:

Y = .76 + .32X

Back to Problems

Start . Text