We estimate the variability by estimating what percent the population will be in the most important group. For example we might be trying to estimate the result of a presidential election. We might know that the pollsters have been saying it is a close race. So we might assume that the potential vote was evenly split with 50% going to each candidate. It wouldn't make much difference if we said 52% to one party and 48% to the other. So in this case we would tell the program to use 50% to estimate the variability. The important thing is to distinguish such an estimate from one which gave one party 25% and the other 75%. If we were studying undiagnosed reading disorders in children, we might guess that 6% of all grade school students suffered from such problems. Here we would tell the program to use 6% as an estimate of variability.

If we didn't have any idea about what the split would be, the most conservative course would be to use an estimate of 50%. This assumption can be demonstrated mathematically but it can also be explained intuitively. Suppose that you were playing a game in which you had to predict the outcome of the flip of a coin. Suppose further that you had the choice between predicting the flip of an honest coin with a 50-50 chance between heads and tails or the flip of a dishonest coin that came down heads 90% of the time. It is pretty obvious that the dishonest coin would be easier to predict than the 50-50 coin. This is, after all, what a dishonest gambler wants: a situation in whigh it is easier to correctly predict the outcome. This is the same situation we have in estimating the variance. A variable with a fifty-fifty split is harder to guess (has a higher variance) than one with a ten to ninety split (the dishonest coin above).

In the examples above we assumed that our measurements split the population into two groups but sometimes this is not the case. For example suppose that we are surveying a forest damaged by acid rain. By following an evaluation procedure we can rank trees as either very healthy, healthy, somewhat damaged, damaged, or dead. We expect 30% each in the groups very healthy and healthy, 20% somewhat damaged, and 10% each in the groups damaged and dead. In most cases we would use the value of 30% as an estimate of variation since it is the value closest to 50%. If we were certain that we were only interested in estimating the size of categories damaged and dead we might choose to use 10% to estimate variation. Remember that the most conservative course, the one least likely to lead us to underestimate the potential error in our sample, is always to use the value closest to 50%. We will see below that when we use the program to calculate the sample size, we are asked to enter an estimate of variability.

That is, if after the study 6% of our sample of students had learning disabilities then we would believe that the value for the population as a whole was between 2% and 10%. On the other hand, if this study were to be used to decide whether to go forward with an expensive program of testing and remediation, we might want a precision of as little as 1%.

Notice that precision values are expressed as a fraction of the entire population, not as a fraction of the observed percentage. One consequence of this is that, as the fraction of the population in the group of greatest interest gets smaller, we generally choose a lower value for precision. For example if we expect to find that 50% of a sample of supermarket shoppers would buy strawberry-mint ice cream, then a sample size that gives a precision at 5% is no problem If we do observe a sample response of 50% then knowing that population value is probably between 45% and 55% does not seem to put our estimate very far off. On the other hand if the sample value were 10%, then the range of 5% to 15% might seem very large. If the population value were 15% then strawberry-mint might be a worthwhile specialty flavor; on the other hand if only 5% would buy strawberry-mint ice cream, it might deteriorate before an entire batch could be sold.

Even if our results are wrong and wind up outside of the region set by the precision, they are still more likely to be close to the population value than they are to be far away.

One of the first things to remember in setting a confidence is that confidence covers only one special class of errors, namely the error that some random samples are not representative of the whole population. A researcher needs to consider how accurate the other parts of the research are. For example, opinion surveys are likely to be inherently less precise than say the weights of a sample of rats. Data on rats might reasonably be reported at a 99% confidence level while the fraction of voters who think that the president is doing a "good job" might be described with 95% or 90% confidence. The researcher must also consider matching the precision range and the confidence level. In general a small precision range will be paired with a higher confidence while a large precision range will be paired with a lower confidence level.

[Back]

[Next]

[Contents}

this page is at http://testbed.cis.drexel.edu/sample/variability.html