# Statistics Done Wrong: The Woefully Complete Guide (2015)

### Chapter 1. An Introduction to Statistical Significance

Much of experimental science comes down to measuring differences. Does one medicine work better than another? Do cells with one version of a gene synthesize more of an enzyme than cells with another version? Does one kind of signal processing algorithm detect pulsars better than another? Is one catalyst more effective at speeding a chemical reaction than another?

We use statistics to make judgments about these kinds of differences. We will always observe *some* difference due to luck and random variation, so statisticians talk about *statistically significant* differences when the difference is larger than could easily be produced by luck. So first we must learn how to make that decision.

**The Power of p Values**

Suppose you’re testing cold medicines. Your new medicine promises to cut the duration of cold symptoms by a day. To prove this, you find 20 patients with colds, give half of them your new medicine, and give the other half a placebo. Then you track the length of their colds and find out what the average cold length was with and without the medicine.

But not all colds are identical. Maybe the average cold lasts a week, but some last only a few days. Others might drag on for two weeks or more. It’s possible that the group of 10 patients who got the genuine medicine in your study all came down with really short colds. How can you prove that your medicine works, rather than just proving that some patients got lucky?

Statistical hypothesis testing provides the answer. If you know the distribution of typical cold cases—roughly how many patients get short colds, long colds, and average-length colds—you can tell how likely it is that a random sample of patients will all have longer or shorter colds than average. By performing a *hypothesis test* (also known as a *significance test*), you can answer this question: “Even if my medication were completely ineffective, what are the chances my experiment would have produced the observed outcome?”

If you test your medication on only one person, it’s not too surprising if her cold ends up being a little shorter than usual. Most colds aren’t perfectly average. But if you test the medication on 10 million patients, it’s pretty unlikely that all those patients will just happen to get shorter colds. More likely, your medication actually works.

Scientists quantify this intuition with a concept called the *p value*. The *p* value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what you actually observed.

So if you give your medication to 100 patients and find that their colds were a day shorter on average, then the *p* value of this result is the chance that if your medication didn’t actually do anything, their average cold would be a day shorter than the control group’s by luck alone. As you might guess, the *p* value depends on the size of the effect—colds that are shorter by four days are less common than colds that are shorter by just one day—as well as on the number of patients you test the medication on.

Remember, a *p* value is not a measure of how right you are or how important a difference is. Instead, think of it as a measure of surprise. If you assume your medication is ineffective and there is no reason other than luck for the two groups to differ, then the smaller the *p* value, the more surprising and lucky your results are—or your assumption is wrong, and the medication truly works.

How do you translate a *p* value into an answer to this question: “Is there really a difference between these groups?” A common rule of thumb is to say that any difference where *p* < 0.05 is statistically significant. The choice of 0.05 isn’t because of any special logical or statistical reasons, but it has become scientific convention through decades of common use.

Notice that the *p* value works by assuming there is no difference between your experimental groups. This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing the data is *in*consistent with the drug *not* working. Because of this, *p*values can be extended to any situation where you can mathematically express a hypothesis you want to knock down.

But *p* values have their limitations. Remember, *p* is a measure of surprise, with a smaller value suggesting that you should be more surprised. It’s not a measure of the size of the effect. You can get a tiny *p* value by measuring a huge effect—“This medicine makes people live four times longer”—or by measuring a tiny effect with great certainty. And because any medication or intervention usually has *some* real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences. As Bruce Thompson wrote,

*Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge.*^{1}

In short, statistical significance does not mean your result has any *practical* significance. As for statistical *in*significance, it doesn’t tell you much. A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data.

There’s no mathematical tool to tell you whether your hypothesis is true or false; you can see only whether it’s consistent with the data. If the data is sparse or unclear, your conclusions will be uncertain.

**Psychic Statistics**

Hidden beneath their limitations are some subtler issues with *p* values. Recall that a *p* value is calculated under the assumption that luck (not your medication or intervention) is the only factor in your experiment, and that *p* is defined as the probability of obtaining a result equal to *or more extreme* than the one observed. This means *p* values force you to reason about results that never actually occurred—that is, results more extreme than yours. The probability of obtaining such results depends on your experimental design, which makes *p* values “psychic”: two experiments with different designs can produce identical data but different *p* values because the *unobserved* data is different.

Suppose I ask you a series of 12 true-or-false questions about statistical inference, and you correctly answer 9 of them. I want to test the hypothesis that you answered the questions by guessing randomly. To do this, I need to compute the chances of you getting *at least* 9 answers right by simply picking true or false randomly for each question. Assuming you pick true and false with equal probability, I compute *p* = 0.073.^{[}*3*^{]} And since *p* > 0.05, it’s plausible that you guessed randomly. If you did, you’d get 9 or more questions correct 7.3% of the time.^{2}

But perhaps it was not my original plan to ask you only 12 questions. Maybe I had a computer that generated a limitless supply of questions and simply asked questions until you got 3 wrong. Now I have to compute the probability of you getting 3 questions wrong after being asked 15 or 20 or 47 of them. I even have to include the remote possibility that you made it to 175,231 questions before getting 3 questions wrong. Doing the math, I find that *p* = 0.033. Since *p* < 0.05, I conclude that random guessing would be unlikely to produce this result.

This is troubling: two experiments can collect identical data but result in different conclusions. Somehow, the *p* value can read your intentions.

**Neyman-Pearson Testing**

To better understand the problems of the *p* value, you need to learn a bit about the history of statistics. There are two major schools of thought in statistical significance testing. The first was popularized by R.A. Fisher in the 1920s. Fisher viewed *p* as a handy, informal method to see how surprising a set of data might be, rather than part of some strict formal procedure for testing hypotheses. The *p* value, when combined with an experimenter’s prior experience and domain knowledge, could be useful in deciding how to interpret new data.

After Fisher’s work was introduced, Jerzy Neyman and Egon Pearson tackled some unanswered questions. For example, in the cold medicine test, you can choose to compare the two groups by their means, medians, or whatever other formula you might concoct, so long as you can derive a *p*value for the comparison. But how do you know which is best? What does “best” even mean for hypothesis testing?

In science, it is important to limit two kinds of errors: *false positives*, where you conclude there is an effect when there isn’t, and *false negatives*, where you fail to notice a real effect. In some sense, false positives and false negatives are flip sides of the same coin. If we’re too ready to jump to conclusions about effects, we’re prone to get false positives; if we’re too conservative, we’ll err on the side of false negatives.

Neyman and Pearson reasoned that although it’s impossible to eliminate false positives and negatives entirely, it *is* possible to develop a formal decision-making process that will ensure false positives occur only at some predefined rate. They called this rate α, and their idea was for experimenters to set an α based upon their experience and expectations. So, for instance, if we’re willing to put up with a 10% rate of false positives, we’ll set α = 0.1. But if we need to be more conservative in our judgments, we might set α at 0.01 or lower. To determine which testing procedure is best, we see which has the lowest false negative rate for a given choice of α.

How does this work in practice? Under the Neyman–Pearson system, we define a *null hypothesis*—a hypothesis that there is no effect—as well as an *alternative hypothesis*, such as “The effect is greater than zero.” Then we construct a test that compares the two hypotheses, and determine what results we’d expect to see were the null hypothesis true. We use the *p* value to implement the Neyman-Pearson testing procedure by rejecting the null hypothesis whenever *p* < α. Unlike Fisher’s procedure, this method deliberately does not address the strength of evidence in any one particular experiment; now we are interested in only the decision to reject or not. The size of the *p* value isn’t used to compare experiments or draw any conclusions besides “The null hypothesis can be rejected.” As Neyman and Pearson wrote,

*We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.*

*But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.*^{3}

Although Neyman and Pearson’s approach is conceptually distinct from Fisher’s, practicing scientists often conflate the two.^{4},^{5},^{6} The Neyman-Pearson approach is where we get “statistical significance,” with a prechosen *p* value threshold that guarantees the long-run false positive rate. But suppose you run an experiment and obtain *p* = 0.032. If your threshold was the conventional *p* < 0.05, this is statistically significant. But it’d also have been statistically significant if your threshold was *p* < 0.033. So it’s tempting—and a common misinterpretation—to say “My false positive rate is 3.2%.”

But that doesn’t make sense. A single experiment does not have a false positive rate. The false positive rate is determined by your *procedure*, not the result of any single experiment. You can’t claim each experiment had a false positive rate of exactly *p*, whatever that turned out to be, when you were using a procedure to get a long-run false positive rate of α.

**Have Confidence in Intervals**

Significance tests tend to receive lots of attention, with the phrase “statistically significant” now part of the popular lexicon. Research results, especially in the biological and social sciences, are commonly presented with *p* values. But *p* isn’t the only way to evaluate the weight of evidence.*Confidence intervals* can answer the same questions as *p* values, with the advantage that they provide more information and are more straightforward to interpret.

A confidence interval combines a point estimate with the uncertainty in that estimate. For instance, you might say your new experimental drug reduces the average length of a cold by 36 hours and give a 95% confidence interval between 24 and 48 hours. (The confidence interval is for the*average* length; individual patients may have wildly varying cold lengths.) If you run 100 identical experiments, about 95 of the confidence intervals will include the true value you’re trying to measure.

A confidence interval quantifies the uncertainty in your conclusions, providing vastly more information than a *p* value, which says nothing about effect sizes. If you want to test whether an effect is significantly different from zero, you can construct a 95% confidence interval and check whether the interval includes zero. In the process, you get the added bonus of learning how precise your estimate is. If the confidence interval is too wide, you may need to collect more data.

For example, if you run a clinical trial, you might produce a confidence interval indicating that your drug reduces symptoms by somewhere between 15 and 25 percent. This effect is statistically significant because the interval doesn’t include zero, and now you can assess the importance of this difference using your clinical knowledge of the disease in question. As when you were using *p* values, this step is important—you shouldn’t trumpet this result as a major discovery without evaluating it in context. If the symptom is already pretty innocuous, maybe a 15–25% improvement isn’t too important. Then again, for a symptom like spontaneous human combustion, you might get excited about *any* improvement.

If you can write a result as a confidence interval instead of as a *p* value, you should.^{7} Confidence intervals sidestep most of the interpretational subtleties associated with *p* values, making the resulting research that much clearer. So why are confidence intervals so unpopular? In experimental psychology research journals, 97% of research papers involve significance testing, but only about 10% ever report confidence intervals—and most of those don’t use the intervals as supporting evidence for their conclusions, relying instead on significance tests.^{8} Even the prestigious journal*Nature* falls short: 89% of its articles report *p* values without any confidence intervals or effect sizes, making their results impossible to interpret in context.^{9} One journal editor noted that “*p* values are like mosquitoes” in that they “have an evolutionary niche somewhere and [unfortunately] no amount of scratching, swatting or spraying will dislodge them.”^{10}

One possible explanation is that confidence intervals go unreported because they are often embarrassingly wide.^{11} Another is that the peer pressure of peer-reviewed science is too strong—it’s best to do statistics the same way everyone else does, or else the reviewers might reject your paper. Or maybe the widespread confusion about *p* values obscures the benefits of confidence intervals. Or the overemphasis on hypothesis testing in statistics courses means most scientists don’t know how to calculate and use confidence intervals.

Journal editors have sometimes attempted to enforce the reporting of confidence intervals. Kenneth Rothman, an associate editor at the *American Journal of Public Health* in the mid-1980s, began returning submissions with strongly worded letters:

*All references to statistical hypothesis testing and statistical significance should be removed from the paper. I ask that you delete p values as well as comments about statistical significance. If you do not agree with my standards (concerning the inappropriateness of significance tests), you should feel free to argue the point, or simply ignore what you may consider to be my misguided view, by publishing elsewhere.*^{12}

During Rothman’s three-year tenure as associate editor, the fraction of papers reporting solely *p* values dropped precipitously. Significance tests returned after his departure, although subsequent editors successfully encouraged researchers to report confidence intervals as well. But despite reporting confidence intervals, few researchers discussed them in their articles or used them to draw conclusions, preferring instead to treat them merely as significance tests.*12*

Rothman went on to found the journal *Epidemiology*, which had a strong statistical reporting policy. Early on, authors familiar with significance testing preferred to report *p* values alongside confidence intervals, but after 10 years, attitudes had changed, and reporting only confidence intervals became common practice.*12*

Perhaps brave (and patient) journal editors can follow Rothman’s example and change statistical practices in their fields.

^{[}*3*^{] }I used a probability distribution known as the *binomial distribution* to calculate this result. In the next paragraph, I’ll calculate *p* using a different distribution, called the *negative binomial distribution*. A detailed explanation of probability distributions is beyond the scope of this book; we’re more interested in how to interpret *p* values rather than how to calculate them.