Whut ur all dem zee scores?Posted on September 14th, 2011 No comments
In statistics, getting numbers to say what they mean is a very difficult task. A number with no context is next to meaningless. If I told you I got a 30 on my statistics exam, you’d probably assume that I didn’t do so well. The context is that the score was a 30 out of 30 (go me!). Alternatively, if I told you I got a 30 percent, then you know I did badly because the word “percent” sets the expectation of a total maximum of 100. Context helps us identify what the heck the numbers mean.
A z-score is a statistical maneuver that helps us learn just how relevant the number really is. We do that by calculating where the value lies in relation to all of the other values in the data set.
The standard deviation tells us how widely spread the numbers are from the average value, or the mean. (We’re assuming you know how to calculate the standard deviation from previous lessons, but really, they just kind of give it to you from here on out.) In a normal distribution, a majority of all the values, 68%, will be within 1 standard deviation of the mean. Within 2 standard deviations, that jumps to 95%. This means if the average test score is a 75, and the standard deviation is 10, that means that score of 80 you got is nothing special—80 is within 1 standard deviation of 75. However, if you got a 100, you are now outside two standard deviations. Only 5% of your classmates were that far away from the mean! (Because the other 95% are within 2 standard deviations.)
STOP! SURPRISE! I totally just tricked you into doing z-scores. All we wanted to know was how many standard deviations away we were from the mean. We took a look at the difference between your score and the mean (80 – 75 = 5, 100 – 75 = 25), then we see how many standard deviations were in the difference by dividing it (5 ÷ 10 = 0.5, 25 ÷ 10 = 2.5). We know from above that if the score is within 1 standard deviation, you’re in the majority. Within 2, you’re in a smaller group, and so on. That final number is the z-score!
So, the formula makes sense: Subtract the mean from the main value, then divide by the standard deviation:
X score – Mean
Or, in statistic-ese:
Where x is the value, μ (greek mu) and M is the population and sample mean respectively, and σ (greek sigma) and s are the population and sample standard deviation respectively. You also may see
(pronounced “x bar”) instead of M for the sample mean—it’s the same thing.
So, if we use the formula to find the z-scores of the above examples, we simply plug the numbers in and solve:
Those are z-scores. With these, we know that the test score that has a z-score of .5 is very common (within one standard deviation), and the one with a z-score of 2.5 is very rare (more than 2 standard deviations away). BAM!
Finding x when you already have z
One curveball the stat teacher threw at us was finding x by giving us the Z. In other words, instead of seeing the test score and finding out how many standard deviations away it is, she told us how many standard deviations away it is (the z-score) and asked us to find x. She gave us the following formula:
While it is easy enough to memorize a second formula and plug in the given mean, standard deviation, and z-scores, I heard some confusion about this. So, let me quell the fears by showing you something neat. First, subtract M from both sides:
Then, divide by s:
Look familiar? That’s because it is exactly the same formula! It was just solved for X, since that was what we were looking for. Therefore, those of you who have formula-memorization anxiety, never fear, because if you just remember the one formula, you can easily solve for any variable within it.
Proportions, or, the Area Under the Curve
So, now you have a z-score. Yay? Fortunately, there is much more we can learn from a z-score than just how rare something is. Let’s take a look at our previous examples, two test scores of 80 and 100, where the mean is 75 and the standard deviation is 10. Let’s also assume that the data is normally distributed—cause you can’t do much of anything unless it is.
So, a normal curve:
Then we plot on it the z-score we found for the first test, an 80, which was 0.5:
Now, with this, we can find out what proportion of our group here happened to get a score higher than you. In essence, we want to know, what are the odds someone has a higher z-score than you? Let’s fill in the area we are looking for.
What we need to do is find the proportion that corresponds to the red area in the graph (also known as the “area under the curve”). To do this, you will use a Z distribution table. A Z distribution table lists a lot of z-scores and their corresponding proportions. Each one is a little different, but I will explain the one we use for this class. Ours has four columns:
A) the list of z-scores
B) Proportion in Body
C) Proportion in Tail
D) Proportion Between Mean and Z
It can be kind of confusing to choose between them. Fortunately, the book gives us three specific points to keep in mind:
1) The BODY always corresponds to the LARGER part of the distribution WHETHER it is on the right side or the left side. Similarly, the TAIL is always the SMALLER section.
2) Because the normal distribution is symmetrical, the proportions on the right-hand side are exactly the same as the corresponding proportions on the left-hand side. For example, the proportion in the right-hand-tail beyond Z = 1 is the same as the proportion in the left-hand tail beyond Z = –1.
3) Although the z-score values will change signs (+ and –) from one side to the other, the proportions are ALWAYS positive. Thus, column C in the table always lists the proportion in the tail whether it is the right-hand tail or the left-hand tail.
There. So, since the red area we want to figure out is the SMALLER half of the graph, we will call it the tail. Therefore, we will look up 0.5 as the z-score in the table and look at the column with the Tail proportions, column C. When we do, the Proportion in Tail in column C that we find corresponds with the z-score of 0.5 is .3085. This means, when the values are normally distributed like this, there is a 30.85% chance of finding someone who has a score higher than 80, or a z-score higher than 0.5.
Let’s do the same thing, but for the score of 100. Again, with a mean of 75 and a standard deviation of 10, we find the z-score to be 2.5. Let’s draw it and fill it in:
Yipes! That’s a really tiny part of the curve! Fortunately, that makes sense, because we know that values occurring more than 2 standard deviations away from the mean are really rare and shouldn’t happen all that often anyways. So again, we are going to refer in the table to the z-score value of 2.5 and find the corresponding Tail proportion. We find the proportion to be .0062. This means that there is a very very very tiny chance of finding anyone with a higher score! (A 0.62% chance, specifically.)
In review: finding proportions is simple. Using the z-score as your dividing point, you can find the proportions of the curve on EITHER side (the larger is the body, the smaller is the tail) using the z distribution chart, referring to the appropriate column.
And now, for something tricky
Using the same situation as above, let’s find out the proportion of students who got a B on the test, defined as between 80 and 90. Instead of just one z-score as the starting point and then continuing off to one side, we need two z-scores, one each for the starting and stopping point. Let’s calculate them now for 80 and 90. Remember, 75 is the mean, and 10 is the standard deviation:
Well, we did kind of know the z-score for 80 already, but whatever. The z-scores of 80 and 90 are .5 and 1.5 respectively. Let’s plot them on our curve.
Unfortunately, we don’t have a “From z=.05 to z=1.5” column in our book to tell us what the proportion is here. We have to get creative. There are a number of different ways to solve this problem. Mainly we do it by breaking up the pieces to get just the parts we want. How you break it up is up to you. I’ll give you two examples.
First: One total we can get is the total value from the Mean (which is 0) to the 1.5 z-score. We do this by finding the 1.5 in the z-score table, and using column D, Proportion Between Mean and Z. This gives us the proportion of the original area we want (in red) plus the part shaded in black below, which is .4332.
However, we DON’T WANT the black part. So we’re going to subtract just the black part from the .4332. What is the proportion of just the black part? Well, it starts at the mean (zero) and goes to 0.5. So, I’m going to look up the z-score of 0.5 and look up the Proportion Between Mean and Z. According to the book, that is .1915. Now we subtract:
Proportion of 0 to 1.5 – Proportion of 0 to 0.5 = Proportion of 0.5 to 1.5
.4332 – .1915 = .2417
Proportion of 0.5 to 1.5 = .2417
Second: The other value we can get is the value from 0.5 to the end, and then subtract out the part we don’t want:
If we look up the z-score 0.5 and the corresponding Proportion in Tail, we get .3085. That is both the red and black parts. Now, we look up the z-score 1.5 to get THAT Proportion in Tail, which is .0668. Now subtract:
Proportion of Tail starting at 0.5 – Proportion of tail starting at 1.5 = Proportion of 0.5 to 1.5
.3085 – .0668 = .2417
Proportion of 0.5 to 1.5 = .2417
Using either method, we arrive at the answer: A proportion of students equal to .2417 got B’s on this test.
Distribution of Sample Means and the Standard Error
This one was slightly more confusing, but we’ll try our best to explain it. So far, we’ve been dealing with a single data set: either a single sample or the one whole population. The sample mean is just that, the mean of the items in that ONE SAMPLE. The basic idea behind statistics is that if you experiment with enough samples out of a population, you should start getting overall scores that match the population as a whole. The sample mean for each experiment will likely be different, but get enough sample means, and you should get an idea of what the population mean is. We should be able to say that certain scores are expected as we do more and more experiments.
The question is CONFIDENCE. Confidence tells you how often a statement will be true. When given a problem, the wording will usually look like “Given the average SAT score of 500, what are the range of SAT scores that we can say with 80% confidence will be the sample mean for any given sample.” Alternatively, it might say, “Given the average SAT score of 500, what are the range of SAT scores that will show up as sample means at least 80% of the time?” In this case, instead of comparing a specific value to the mean to find the z-score, we are comparing the sample mean itself to the whole population mean. So, instead of the deviation of actual values compared to the mean, we have to find the potential change in the sample mean compared to the population mean. This is called the standard error.
We use the standard error instead of the standard deviation when we are comparing the sample mean to the population mean. Otherwise, the z-score formula is nearly the same. We find the difference between the sample mean (M) and the population mean (µ), and then divide by the standard error instead of the standard deviation:
Let’s do that example and see where we get: “Given the average SAT score of 500 (µ), what are the range of SAT scores that will show up as sample means at least 80% of the time. The sample size is 25 (n = 25) and the standard deviation is 100. (σ = 100)”
First, let’s set up the normal curve:
We are looking for the values that are common enough that they occur at least 80% of the time when you take a sample from this population. We know that values closest to the mean are the most common, going in both directions from there. Therefore, we want to choose the 80% of values that are closest to the mean. We do this by choosing the 40% closest values to the mean on either side:
This area contains all of the SAT sample means that will occur 80% of the time. The next step is to find the z-scores associated with these areas. Since we are going from the mean outwards in each direction, we will use the Z distribution table and take a look at the Proportion Between Mean and Z column. We want to find the value that is closest to 40%, or .4000. As we look down that column, we find that .4000 falls between two values: .3997 and .4015. We will choose one closest to .4000, which is .3997. The z-score associated with that value is 1.28.
Now, this 1.28 is only for one side, from the mean out to 40%. Since we know that the proportions are the same on both sides, that means the other 40% goes out to –1.28. This gives us the boundaries of our total 80 percent: from –1.28 to +1.28.
But, we’re not quite done yet. We still need to figure out what the SAT scores are that go with each Z score. Starting with the z formula that we have for this:
If we try and fill in what we know—the average SAT score is 500 (µ = 500), the z-score (-1.28 and 1.28), and a standard deviation of 100 (σ = 100), we see that we do NOT have the standard error. We better go find that first:
Now, let’s plug in that σ = 100 and n = 25:
NOW we can plug in the z-score (-1.28 and 1.28), the average SAT score (µ = 500), and the standard error (20) into the Z formula and solve:
Multiply both sides by 20:
Then add 500:
One of the range boundaries is an SAT score of 525.6. Let’s go find the other score for the other z-score:
The other score is 474.4. Therefore, we know that we are 80% confident that the value of the sample mean in a sample of 25 students with a population average of 500 and a standard deviation of 100 will be between 474.4 and 525.6.