Computer Exercises: Box Model
(This exercise was designed to work with SPSS. It may be
modified to work with other programs, and it may have to be
modified to work with the current version of SPSS. You might
be able to most of this exercise using a random number
generator from the Internet (http://www.random.org/integers/
is a good one) and doing the computations with a simple
statistics calculator such as http://25yearsofprogramming.com/javascript/descriptivestatistics.htm.)
1. We begin by simulating dice rolls. In terms of the box
model, rolling a die 144 times and adding up the number of
spots showing is like taking ________ draws from the box
with tickets: __________________________________.
The average of this box is: _________________.
The standard deviation of this box is about 1.71. If you
have time at the end, come back and see if you can compute
this.
If we roll a die 144 times, we expect a sum of _______
give or take about _________. (The number in the first blank
is n times the average of the box. The number in the second
blank is the standard deviation of the box multiplied by the
square root of n.)
2. To roll the die 144 times, scroll down in SPSS to 144
and put a number in the row. Then, pull down the Transform
menu to Random Number Generators. Set the starting point to
Random. Click OK.
Second, pull down the Transform menu to Compute. You will
get a complex dialog box. In the Target Value spot, type in
a name of (rnd1 is a good name). Then in the scrolling field
called Function group scroll down to Random Numbers and
highlight it. Then in the scrolling field below that, scroll
down to Rv.Uniform and click it up to the top. You will get
two question marks. You want the first to be 1 and the
second to be 7. Hit the OK button. You should now have a
column of random numbers.
All the random numbers should be between what two
values?______ and ________
3. Let us create another column in the same way and call
it rnd2. We are not done yet, because our numbers are real
numbers, and dice rolls are integers. To make our real
numbers into integers, we will truncate them (which means we
will chop off everything after the decimal point). Go back
to the Compute command (under Transform), and empty the
fields you so carefully completed in the step above. Now
pull down to the first scroll field to arithmetic and choose
Trunc in the second field and click it up. (Or you can just
type TRUNC(rnd1) in the Numeric expression field.). Let us
call this new variable die1 and put that name on the left.
Then replace ? with rnd1, so you have the equation
die1=TRUNC(rnd1). Hit OK. You should see that there is a new
column created with just digits from 1 to 6.
Repeat the step above to create a new variable die2 that
will be the truncated version of rnd2.
(Your new variables die1 and die 2 should have numbers
between 1 and 6.)
Finally, let us add die1 and die2. Go back to Compute,
name the new variable SUM, and on the right put
die1+die2
Now we are ready to have some fun. If we add up either
column die1 or die2, we should get about ______ give or take
about ____________. If we find the average for either column
die1 or die2, we should get about ____ give or take about
_______.
4. We can add the columns using the Descriptives Command
under the Analyze menu pulled over the Descriptive
Statistics. Click the options button and make sure sum is
checked (otherwise you will not get a sum). Let us add both
the die1 and die2 columns.
On the first die you got a sum of ________. This is a
total of an expected value of _______ plus a chance error of
__________. If we compute a z-score for this sum, it will be
the (sum minus the expected value) divided by the standard
error of the sum. The z-score on the first die is _______.
About _____% of the class should get z-scores between -1 and
+1. About _______% of the class should get z-scores between
-2 and 2.
5. Repeat for the second die. On the second die you got a
sum of ________. This is a total of an expected value of
_______ plus a chance error of _______. The z-score for the
second die is _______.
6. If we do a histogram of die1 or die2, what should it
look like? How close are the actual histograms to what you
expect?
7. If we do a bar chart of the sum, we should get a
pyramidal shape similar to what we discussed in class. Do
it. How close is it to the expected shape?
8. Repeat steps 1 to 7 using 900 rolls instead of
144.
- Expected Sums:
- Standard error of sums:
- Actual Sums:
- Chance Errors:
- Z-scores of sums:
9. Does the bar chart of the sum of two rolls look more
like the probability histogram with 900 rolls or with
144?
10. As a final problem, let us simulate a box model that
is a little different. Generate 900 random numbers between 0
and 1.25. (That will be RV.UNIFORM(0, 1.25).) Then generate
a column by truncating these numbers. You should get a
column of zeros and ones. This column simulates drawing from
a box with replacement of ______ tickets with a value of
zero and _______ tickets with a value of one.
11. The expected value of this box is______. The Standard
deviation of this box is ______ (You can use the nifty
short-cut method described earlier to get the standard
deviation. The calculation is so simple you should not need
a calculator!)
12. Us the frequency command (Under
Analyze-->Descriptive Statistics) to find the percentage
of zeros and ones in the column. Is it close to what you
expect? Explain.
13. The expected sum of the column is ______ give or take
about _________.
14. The actual sum of the column is ______. This is a
total of an expected value of _____ plus a chance error of
_______.
15. (A peek ahead) Suppose you did not know the box from
which these numbers were drawn, but were interested in
knowing what the average and standard deviation of the box
were. You could estimate them by looking at the average and
standard deviation of your column of 900 numbers. The sum of
the column could be between ____ and ____, which means the
average could be between ______ and ______. The average of
my sample is in fact _____, which is _______ away from the
true average. Would your estimate be pretty good or pretty
bad?
16. Suppose you wanted to simulate a box model with two
tickets with a two and one ticket with a zero. How could you
make SPSS do that?
We have been talking about taking repeated draws from a box
with replacement. Today we are going to do exercises similar
to those in the book, but with sample data. We will use the
ability of Excel to generate random numbers.
1. We begin by simulating dice rolls. In terms of the box
model, rolling a die 144 times and adding up the number of
spots showing is like taking ________ draws from the box
with tickets: __________________________________.
The average of this box is: _________________.
The standard deviation of this box is about 1.71. If you
have time at the end, come back and see if you can compute
this.
If we roll a die 144 times, we expect a sum of _______
give or take about _________. (The number in the first blank
is n times the average of the box. The number in the second
blank is the standard deviation of the box multiplied by the
square root of n.)
2. To simulate rolling a die in Excel, use the formula
=ceiling(6*RND()) If you enter this 144 times (it helps to
cut and paste) you will have simulated 144 rolls of dice.
Check your results. Are all the numbers between 1 and 6
inclusive?
I leave it to you to get the sum and standard deviation
of these 144 rolls. Explain how you did it.
On your 144 rolls you got a sum of ________. This is a
total of an expected value of _______ plus a chance error of
__________. If we compute a z-score for this sum, it will be
the (sum minus the expected value) divided by the standard
error of the sum. The z-score on the sum is _______. About
_____% of the class should get z-scores between -1 and +1.
About _______% of the class should get z-scores between -2
and 2.
3. If we do a histogram of our dice rolls, what should it
look like? How close are the actual histograms to what you
expect?
Repeat using 900 rolls instead of 144.
Expected Sums:
Standard error of sums:
Actual Sums:
Chance Errors:
Z-scores of sums:
A Mystery Box
1. Let us simulate a box model that is a little
different. Generate 900 random numbers between 0 and 1.25.
(That will be RV.UNIFORM(0, 1.25).) Then generate a column
by truncating these numbers. You should get a column of
zeros and ones. This column simulates drawing from a box
with replacement of ______ tickets with a value of zero and
_______ tickets with a value of one.
11. The expected value of this box is______. The Standard
deviation of this box is ______ (You can use the nifty
short-cut method described earlier to get the standard
deviation. The calculation is so simple you should not need
a calculator!)
12. Us the frequency command (Under
Analyze-->Descriptive Statistics) to find the percentage
of zeros and ones in the column. Is it close to what you
expect? Explain.
13. The expected sum of the column is ______ give or take
about _________.
14. The actual sum of the column is ______. This is a
total of an expected value of _____ plus a chance error of
_______.
15. (A peek ahead) Suppose you did not know the box from
which these numbers were drawn, but were interested in
knowing what the average and standard deviation of the box
were. You could estimate them by looking at the average and
standard deviation of your column of 900 numbers. The sum of
the column could be between ____ and ____, which means the
average could be between ______ and ______. The average of
my sample is in fact _____, which is _______ away from the
true average. Would your estimate be pretty good or pretty
bad?
Marbles in a Box
1. Suppose we have a box that has a trillion marbles. One
third of them are red. If we draw out 200 marbles, how many
red marbles should we expect to get? ________
In terms of the box model in the book, suppose we count
drawing a red marble as a one and any other marble as a
zero. We can view this situation as drawing with replacement
from a box with these three tickets: _____________
2. The standard deviation of the box in this case is
about .471. (How did I get that? ______________) If we
compute the standard error of the count (or sum), it is
6.67. (How was this computed?) If we compute the standard
error of the percent, it is about 3.33%. Hence, we would
expect that 95% of the time the sum should be in the range
__________ to ___________.
3. We expect that 95% of the time, the percent should be
in the range _________ to ___________.
4. Now it is time to let the computer simulate drawing
200 marbles from this box. There are a couple ways of doing
this, but let us use the easiest one. Go to Compute under
the Transform menu. You need the equation
v1=TRUNC(RV.UNIFORM(0,1.5)). Why does this equation generate
a series similar to what we would get drawing red marbles
out of the box described above?
5. Once you are doing it correctly, it is easy to add
columns to see what happens if we draw many different
samples of 200 from the box. All you have to do is change
the v1 to v2, v3, etc. Construct 20 columns of these
numbers. Then look at the descriptive statistics. Pull down
Analyze-->>Descriptive Statistics _- Descriptives.
Click options and ask for sums. Click all the v1 to v20
variables over and then click OK. Are your sums close to
what you expected to get up in part 1? What is the biggest
chance error?
6. Are your standard deviations close to .471? What is
the biggest error?
7. We know what was in the box from which these samples
were drawn. But suppose we did not. How close could we
estimate the average of the box? (Note: Statistics more
often deals with sample averages than sample sums, but we
can easily jump from one to the other.) We can see by
creating confidence intervals. To do this, pull down
Analyze-->>Compare Means-->One Sample T test. Click
over the v1 to v20 variables. Click options and you should
see that the confidence interval is set to 95%. Put 0 in the
test value box. And then click OK.
In the output you get, how close are the standard errors
of the mean to the value that was given in part 2 above? Why
are they not all the same?
8. (Are you thinking about what you are doing? Or are you
just mechanically answering questions trying to get out of
here?) About 1 in 20 confidence intervals that we have
created should not contain the true mean of .3333. (Why one
in twenty? ______________________) Do you have any? If so,
what is it? If not, what is the one that is closest to not
containing .3333?
9. We can also create 99% confidence intervals. Redo the
steps in part 6, but when you click options, put in 99
instead of the 95 that is there. What happens to your
confidence intervals?
10. The primary purpose of the One Sample T-test is not
to construct confidence intervals, but to test claims. I
claim that the true mean is .333. Let us test that. Go back
and redo part 6, but put .333 in the test box. The t-values
you get in the output are like z-scores. A t-score of 2
would mean that the average of the sample is 2 standard
deviations away from .333. Do you have any over 2? If you
do, the confidence interval that you constructed above
probably does not have .3333 in it. Does it? (Or if all your
confidence intervals had .333 in them, the t-values you get
should all be less than 2. Are they?) We will explain this
more later.
11. Suppose we did not know anything about the box but
believed that 50% of the marbles in it were red. Redo part
10 above, but with .5 as the test value. You should get
t-values that are negative and some quite a lot less than
-2. What is that telling you about the probability of
getting this sample if the true mean of the box was .5?
Coin Flips
The textbook gives lots of examples of taking repeated
draws from a box with replacement. Today we are going to do
exercises similar to those in the book, but with sample data
that we will generate using a program called SPSS.
Let us begin with simulating flipping a coin, with the
wager that you will win a dollar if heads comes up and lose
a dollar if tails comes up.
This is equivalent to taking draws from a box with two
tickets. What amounts do we want on the two tickets?
What is the average of the box?
What is the standard deviation of the box?
If we draw 100 tickets from this box (play our
coin-flipping game 100 times), what would we expect to win?
(Hint: it is the average of the box multiplied by the number
of draws.)
What standard error would we expect? (The standard error
is the give-or-take number the book mentions.) (Hint: it is
the standard deviation of the box multiplied by the square
root of the number of draws.)
The most we could gain from this game would be ________;
the most we could lose is _______.
However, 95% percent of the time we expect to be within
two standard errors of the expected value. Hence, we should
expect that 95% of the time our gains from the game will be
between $_________ and $__________.
Now that we have figured out what should happen, it is
time to see what does happen. Open up the program SPSS.
Click through the introductory screens until you get a
window that looks like a spreadsheet window. Scroll down
until you get to 100 and put a number in the row. (I know it
only goes to 40. Put your cursor in that row, and then hit
return until you get to 100.)
Go to the Transform menu. Pull down to Random Number Seed
and make sure it is set to Random Seed. Now for the hard
part. We want to generate a column of numbers that simulate
our coin flip. We will do this by generating random real
numbers between 0 and 2. We will then chop off everything
after the decimal, leaving us with zeros and ones. We will
multiply by two, then subtract one, and we will be left with
a column of positive and negative ones.
Pull down Transform to compute. In the Target Variable
box put in a name (v1 works.) Then in the Numeric Expression
box put in UNIFORM(2). Click the OK box. You should now have
a column of 100 numbers between 0 and 2 complete with
decimals. To get rid of the decimals, pull down Transform
again to Compute and change the Numeric Expression box to
"TRUNC(v1)". (This assumes that your variable is called
v1.). You should now see that all your numbers are without
decimals, but we are still not finished. Pull down Transform
again to Compute and change the Numeric Expression box to
v1*2-1. This, finally, should give you the results you want,
a column that only has -1 or 1 in it.
Now let us see if we got what we expected. Pull down
Analyze to Descriptive Statistics, slide over to
Descriptives. Click the option box and make sure sum is
checked. Then slide v1 over to the variable list and click
OK.
What sum do you get? Is it within the range you expected
it to be in? Explain. (Your sum cannot be an odd number.
Why?)
(You will see that SPSS works with two windows, an output
window and a data window. You have to flip back to the data
window to do some things. It is not the most friendly
design, but you can get used to it.)
Let's do a bunch of columns of coin flips, but to do that
quickly, we will take a short cut.
Pull down Transform again to Compute and change the
Numeric Expression box to trunc(uniform(2))*2-1. Put v2 in
the target variable box. Click OK. This one command does all
of what we did before. Now you can easily create a new
column by simply changing the name of the target variable.
Do so, creating 20 columns.
Look at the descriptive statistics for all the columns.
How many are within one standard error of the expected
value? How many are within two standard errors of the
expected value? Do you have any results that are more than
2.5 standard errors of the expected value?
If you have time, redo the assignment with 400 flips
instead 100. How do your results compare?
Just Sixes
1. Suppose we roll a die 144 times, and we are interested
in predicting how many sixes we will get in those rolls. The
problem is like drawing from a box with six tickets, five of
them zeros and one of them a one. (Do you see why?) The
average of the box is _____ and the standard deviation of
the box is ____. If we pick 144 times from the box, we
expect to get a sum of about ____ give or take _____. 95% of
the time we should get a sum between ____ and ____.
2. We can simulate drawing from this box of tickets by
creating random numbers between 0 and 6, taking the square
root twice, then truncating them, and finally subtracting
the result from 1 (When we take the square root, things get
closer to 1. If we take it twice, anything between 1 and 6
is squeezed between 1 and 2, while things below 1 will stay
below 1. Truncating them will give us about 1/6 that are
zero and 5/6 that are 1. Subtracting from one will flip
this, giving us about 1/6 as ones and about 5/6 as zeros.)
The formula that works in SPSS is
v1 = 1 -(TRUNC(SQRT(SQRT(UNIFORM(6)))))
(You have to get it exactly right, including all the
parentheses.)
Use this to fill up ten columns of 144 numbers. Then,
using descriptive statistics, see what the sums of these
columns are. Are they close to what you predicted in part
1?
3. Usually in statistics we get data without knowing the
process that generates the data. We work backward and try to
infer what the average of the process is. We can do this by
pulling down Analyze to Compare Means to One-Sample-t-test.
Click over all your new variables and click OK. You have
generated something called confidence intervals. They tell
you, in the form of an interval, what the true means seems
to be based on your sample. The true mean is .1667 (which is
one-sixth). My first interval has a lower number of .1230
and an upper bound of .2520. If I did not know what the true
mean was and only had the data to go on, I would be fairly
sure that the true mean was in that interval. In this case,
I would be right, but one time in twenty I will be wrong.
(You are saying that this is not very precise, and you are
right. Bigger samples will give more precision.) How many of
your intervals contain the true mean?
One More
1. Suppose we draw from a box with two tickets, a zero
and a one. What is the average of the box? What is the
standard deviation of the box?
2. If we draw 331 times from the box, we should expect a
sum around 331 times the average of the box. What will that
be?
3. Our sums will vary from trial to trial. The standard
deviation of the sums will be 18.2 times the standard
deviation of the box. What will that be?
4. 95% of the time we should be within two standard
deviations of the expected value you found in 2. What range
should we expect?
5. Let us do it. Go to SPSS and pull down Transform to
Compute. Call the target variable "toss." In the numeric
expression field put trunc(RV.UNIFORM(0,2)) Click OK.
(What this does is randomly find a number between 0 and
2. It will be a decimal like 0.7865... Then it truncates,
which means it chops off all the numbers after the decimal,
which means it get turned into a zero. About half the
numbers should be zero and half one.)
|