Lab: Box Model

Computer Exercises: Box Model

(This exercise was designed to work with SPSS. It may be modified to work with other programs, and it may have to be modified to work with the current version of SPSS. You might be able to most of this exercise using a random number generator from the Internet (http://www.random.org/integers/ is a good one) and doing the computations with a simple statistics calculator such as http://25yearsofprogramming.com/javascript/descriptivestatistics.htm.)

1. We begin by simulating dice rolls. In terms of the box model, rolling a die 144 times and adding up the number of spots showing is like taking ________ draws from the box with tickets: __________________________________.

The average of this box is: _________________.

The standard deviation of this box is about 1.71. If you have time at the end, come back and see if you can compute this.

If we roll a die 144 times, we expect a sum of _______ give or take about _________. (The number in the first blank is n times the average of the box. The number in the second blank is the standard deviation of the box multiplied by the square root of n.)

2. To roll the die 144 times, scroll down in SPSS to 144 and put a number in the row. Then, pull down the Transform menu to Random Number Generators. Set the starting point to Random. Click OK.

Second, pull down the Transform menu to Compute. You will get a complex dialog box. In the Target Value spot, type in a name of (rnd1 is a good name). Then in the scrolling field called Function group scroll down to Random Numbers and highlight it. Then in the scrolling field below that, scroll down to Rv.Uniform and click it up to the top. You will get two question marks. You want the first to be 1 and the second to be 7. Hit the OK button. You should now have a column of random numbers.

All the random numbers should be between what two values?______ and ________

3. Let us create another column in the same way and call it rnd2. We are not done yet, because our numbers are real numbers, and dice rolls are integers. To make our real numbers into integers, we will truncate them (which means we will chop off everything after the decimal point). Go back to the Compute command (under Transform), and empty the fields you so carefully completed in the step above. Now pull down to the first scroll field to arithmetic and choose Trunc in the second field and click it up. (Or you can just type TRUNC(rnd1) in the Numeric expression field.). Let us call this new variable die1 and put that name on the left. Then replace ? with rnd1, so you have the equation die1=TRUNC(rnd1). Hit OK. You should see that there is a new column created with just digits from 1 to 6.

Repeat the step above to create a new variable die2 that will be the truncated version of rnd2.

(Your new variables die1 and die 2 should have numbers between 1 and 6.)

Finally, let us add die1 and die2. Go back to Compute, name the new variable SUM, and on the right put die1+die2

Now we are ready to have some fun. If we add up either column die1 or die2, we should get about ______ give or take about ____________. If we find the average for either column die1 or die2, we should get about ____ give or take about _______.

4. We can add the columns using the Descriptives Command under the Analyze menu pulled over the Descriptive Statistics. Click the options button and make sure sum is checked (otherwise you will not get a sum). Let us add both the die1 and die2 columns.

On the first die you got a sum of ________. This is a total of an expected value of _______ plus a chance error of __________. If we compute a z-score for this sum, it will be the (sum minus the expected value) divided by the standard error of the sum. The z-score on the first die is _______. About _____% of the class should get z-scores between -1 and +1. About _______% of the class should get z-scores between -2 and 2.

5. Repeat for the second die. On the second die you got a sum of ________. This is a total of an expected value of _______ plus a chance error of _______. The z-score for the second die is _______.

6. If we do a histogram of die1 or die2, what should it look like? How close are the actual histograms to what you expect?

7. If we do a bar chart of the sum, we should get a pyramidal shape similar to what we discussed in class. Do it. How close is it to the expected shape?

8. Repeat steps 1 to 7 using 900 rolls instead of 144.

Expected Sums:

Standard error of sums:

Actual Sums:

Chance Errors:

Z-scores of sums:

9. Does the bar chart of the sum of two rolls look more like the probability histogram with 900 rolls or with 144?

10. As a final problem, let us simulate a box model that is a little different. Generate 900 random numbers between 0 and 1.25. (That will be RV.UNIFORM(0, 1.25).) Then generate a column by truncating these numbers. You should get a column of zeros and ones. This column simulates drawing from a box with replacement of ______ tickets with a value of zero and _______ tickets with a value of one.

11. The expected value of this box is______. The Standard deviation of this box is ______ (You can use the nifty short-cut method described earlier to get the standard deviation. The calculation is so simple you should not need a calculator!)

12. Us the frequency command (Under Analyze-->Descriptive Statistics) to find the percentage of zeros and ones in the column. Is it close to what you expect? Explain.

13. The expected sum of the column is ______ give or take about _________.

14. The actual sum of the column is ______. This is a total of an expected value of _____ plus a chance error of _______.

15. (A peek ahead) Suppose you did not know the box from which these numbers were drawn, but were interested in knowing what the average and standard deviation of the box were. You could estimate them by looking at the average and standard deviation of your column of 900 numbers. The sum of the column could be between ____ and ____, which means the average could be between ______ and ______. The average of my sample is in fact _____, which is _______ away from the true average. Would your estimate be pretty good or pretty bad?

16. Suppose you wanted to simulate a box model with two tickets with a two and one ticket with a zero. How could you make SPSS do that?

We have been talking about taking repeated draws from a box with replacement. Today we are going to do exercises similar to those in the book, but with sample data. We will use the ability of Excel to generate random numbers.

The average of this box is: _________________.

The standard deviation of this box is about 1.71. If you have time at the end, come back and see if you can compute this.

2. To simulate rolling a die in Excel, use the formula =ceiling(6*RND()) If you enter this 144 times (it helps to cut and paste) you will have simulated 144 rolls of dice. Check your results. Are all the numbers between 1 and 6 inclusive?

I leave it to you to get the sum and standard deviation of these 144 rolls. Explain how you did it.

On your 144 rolls you got a sum of ________. This is a total of an expected value of _______ plus a chance error of __________. If we compute a z-score for this sum, it will be the (sum minus the expected value) divided by the standard error of the sum. The z-score on the sum is _______. About _____% of the class should get z-scores between -1 and +1. About _______% of the class should get z-scores between -2 and 2.

3. If we do a histogram of our dice rolls, what should it look like? How close are the actual histograms to what you expect?

Repeat using 900 rolls instead of 144.

Expected Sums:

Standard error of sums:

Actual Sums:

Chance Errors:

Z-scores of sums:

A Mystery Box

1. Let us simulate a box model that is a little different. Generate 900 random numbers between 0 and 1.25. (That will be RV.UNIFORM(0, 1.25).) Then generate a column by truncating these numbers. You should get a column of zeros and ones. This column simulates drawing from a box with replacement of ______ tickets with a value of zero and _______ tickets with a value of one.

12. Us the frequency command (Under Analyze-->Descriptive Statistics) to find the percentage of zeros and ones in the column. Is it close to what you expect? Explain.

13. The expected sum of the column is ______ give or take about _________.

14. The actual sum of the column is ______. This is a total of an expected value of _____ plus a chance error of _______.

Marbles in a Box

1. Suppose we have a box that has a trillion marbles. One third of them are red. If we draw out 200 marbles, how many red marbles should we expect to get? ________

In terms of the box model in the book, suppose we count drawing a red marble as a one and any other marble as a zero. We can view this situation as drawing with replacement from a box with these three tickets: _____________

2. The standard deviation of the box in this case is about .471. (How did I get that? ______________) If we compute the standard error of the count (or sum), it is 6.67. (How was this computed?) If we compute the standard error of the percent, it is about 3.33%. Hence, we would expect that 95% of the time the sum should be in the range __________ to ___________.

3. We expect that 95% of the time, the percent should be in the range _________ to ___________.

4. Now it is time to let the computer simulate drawing 200 marbles from this box. There are a couple ways of doing this, but let us use the easiest one. Go to Compute under the Transform menu. You need the equation v1=TRUNC(RV.UNIFORM(0,1.5)). Why does this equation generate a series similar to what we would get drawing red marbles out of the box described above?

5. Once you are doing it correctly, it is easy to add columns to see what happens if we draw many different samples of 200 from the box. All you have to do is change the v1 to v2, v3, etc. Construct 20 columns of these numbers. Then look at the descriptive statistics. Pull down Analyze-->>Descriptive Statistics _- Descriptives. Click options and ask for sums. Click all the v1 to v20 variables over and then click OK. Are your sums close to what you expected to get up in part 1? What is the biggest chance error?

6. Are your standard deviations close to .471? What is the biggest error?

7. We know what was in the box from which these samples were drawn. But suppose we did not. How close could we estimate the average of the box? (Note: Statistics more often deals with sample averages than sample sums, but we can easily jump from one to the other.) We can see by creating confidence intervals. To do this, pull down Analyze-->>Compare Means-->One Sample T test. Click over the v1 to v20 variables. Click options and you should see that the confidence interval is set to 95%. Put 0 in the test value box. And then click OK.

In the output you get, how close are the standard errors of the mean to the value that was given in part 2 above? Why are they not all the same?

8. (Are you thinking about what you are doing? Or are you just mechanically answering questions trying to get out of here?) About 1 in 20 confidence intervals that we have created should not contain the true mean of .3333. (Why one in twenty? ______________________) Do you have any? If so, what is it? If not, what is the one that is closest to not containing .3333?

9. We can also create 99% confidence intervals. Redo the steps in part 6, but when you click options, put in 99 instead of the 95 that is there. What happens to your confidence intervals?

10. The primary purpose of the One Sample T-test is not to construct confidence intervals, but to test claims. I claim that the true mean is .333. Let us test that. Go back and redo part 6, but put .333 in the test box. The t-values you get in the output are like z-scores. A t-score of 2 would mean that the average of the sample is 2 standard deviations away from .333. Do you have any over 2? If you do, the confidence interval that you constructed above probably does not have .3333 in it. Does it? (Or if all your confidence intervals had .333 in them, the t-values you get should all be less than 2. Are they?) We will explain this more later.

11. Suppose we did not know anything about the box but believed that 50% of the marbles in it were red. Redo part 10 above, but with .5 as the test value. You should get t-values that are negative and some quite a lot less than -2. What is that telling you about the probability of getting this sample if the true mean of the box was .5?

Coin Flips

The textbook gives lots of examples of taking repeated draws from a box with replacement. Today we are going to do exercises similar to those in the book, but with sample data that we will generate using a program called SPSS.

Let us begin with simulating flipping a coin, with the wager that you will win a dollar if heads comes up and lose a dollar if tails comes up.

This is equivalent to taking draws from a box with two tickets. What amounts do we want on the two tickets?

What is the average of the box?

What is the standard deviation of the box?

If we draw 100 tickets from this box (play our coin-flipping game 100 times), what would we expect to win? (Hint: it is the average of the box multiplied by the number of draws.)

What standard error would we expect? (The standard error is the give-or-take number the book mentions.) (Hint: it is the standard deviation of the box multiplied by the square root of the number of draws.)

The most we could gain from this game would be ________; the most we could lose is _______.

However, 95% percent of the time we expect to be within two standard errors of the expected value. Hence, we should expect that 95% of the time our gains from the game will be between $_________ and $__________.

Now that we have figured out what should happen, it is time to see what does happen. Open up the program SPSS. Click through the introductory screens until you get a window that looks like a spreadsheet window. Scroll down until you get to 100 and put a number in the row. (I know it only goes to 40. Put your cursor in that row, and then hit return until you get to 100.)

Go to the Transform menu. Pull down to Random Number Seed and make sure it is set to Random Seed. Now for the hard part. We want to generate a column of numbers that simulate our coin flip. We will do this by generating random real numbers between 0 and 2. We will then chop off everything after the decimal, leaving us with zeros and ones. We will multiply by two, then subtract one, and we will be left with a column of positive and negative ones.

Pull down Transform to compute. In the Target Variable box put in a name (v1 works.) Then in the Numeric Expression box put in UNIFORM(2). Click the OK box. You should now have a column of 100 numbers between 0 and 2 complete with decimals. To get rid of the decimals, pull down Transform again to Compute and change the Numeric Expression box to "TRUNC(v1)". (This assumes that your variable is called v1.). You should now see that all your numbers are without decimals, but we are still not finished. Pull down Transform again to Compute and change the Numeric Expression box to v1*2-1. This, finally, should give you the results you want, a column that only has -1 or 1 in it.

Now let us see if we got what we expected. Pull down Analyze to Descriptive Statistics, slide over to Descriptives. Click the option box and make sure sum is checked. Then slide v1 over to the variable list and click OK.

What sum do you get? Is it within the range you expected it to be in? Explain. (Your sum cannot be an odd number. Why?)

(You will see that SPSS works with two windows, an output window and a data window. You have to flip back to the data window to do some things. It is not the most friendly design, but you can get used to it.)

Let's do a bunch of columns of coin flips, but to do that quickly, we will take a short cut.

Pull down Transform again to Compute and change the Numeric Expression box to trunc(uniform(2))*2-1. Put v2 in the target variable box. Click OK. This one command does all of what we did before. Now you can easily create a new column by simply changing the name of the target variable. Do so, creating 20 columns.

Look at the descriptive statistics for all the columns. How many are within one standard error of the expected value? How many are within two standard errors of the expected value? Do you have any results that are more than 2.5 standard errors of the expected value?

If you have time, redo the assignment with 400 flips instead 100. How do your results compare?

Just Sixes

1. Suppose we roll a die 144 times, and we are interested in predicting how many sixes we will get in those rolls. The problem is like drawing from a box with six tickets, five of them zeros and one of them a one. (Do you see why?) The average of the box is _____ and the standard deviation of the box is ____. If we pick 144 times from the box, we expect to get a sum of about ____ give or take _____. 95% of the time we should get a sum between ____ and ____.

2. We can simulate drawing from this box of tickets by creating random numbers between 0 and 6, taking the square root twice, then truncating them, and finally subtracting the result from 1 (When we take the square root, things get closer to 1. If we take it twice, anything between 1 and 6 is squeezed between 1 and 2, while things below 1 will stay below 1. Truncating them will give us about 1/6 that are zero and 5/6 that are 1. Subtracting from one will flip this, giving us about 1/6 as ones and about 5/6 as zeros.) The formula that works in SPSS is

v1 = 1 -(TRUNC(SQRT(SQRT(UNIFORM(6)))))

(You have to get it exactly right, including all the parentheses.)

Use this to fill up ten columns of 144 numbers. Then, using descriptive statistics, see what the sums of these columns are. Are they close to what you predicted in part 1?

3. Usually in statistics we get data without knowing the process that generates the data. We work backward and try to infer what the average of the process is. We can do this by pulling down Analyze to Compare Means to One-Sample-t-test. Click over all your new variables and click OK. You have generated something called confidence intervals. They tell you, in the form of an interval, what the true means seems to be based on your sample. The true mean is .1667 (which is one-sixth). My first interval has a lower number of .1230 and an upper bound of .2520. If I did not know what the true mean was and only had the data to go on, I would be fairly sure that the true mean was in that interval. In this case, I would be right, but one time in twenty I will be wrong. (You are saying that this is not very precise, and you are right. Bigger samples will give more precision.) How many of your intervals contain the true mean?

One More

1. Suppose we draw from a box with two tickets, a zero and a one. What is the average of the box? What is the standard deviation of the box?

2. If we draw 331 times from the box, we should expect a sum around 331 times the average of the box. What will that be?

3. Our sums will vary from trial to trial. The standard deviation of the sums will be 18.2 times the standard deviation of the box. What will that be?

4. 95% of the time we should be within two standard deviations of the expected value you found in 2. What range should we expect?

5. Let us do it. Go to SPSS and pull down Transform to Compute. Call the target variable "toss." In the numeric expression field put trunc(RV.UNIFORM(0,2)) Click OK.

(What this does is randomly find a number between 0 and 2. It will be a decimal like 0.7865... Then it truncates, which means it chops off all the numbers after the decimal, which means it get turned into a zero. About half the numbers should be zero and half one.)

Start . Text