Statistics Done Wrong: The Woefully Complete Guide (2015)
Chapter 9. Researcher Freedom: Good Vibrations?
There’s a common misconception that statistics is boring and monotonous. Collect lots of data; plug numbers into Excel, SPSS, or R; and beat the software with a stick until it produces colorful charts and graphs. Done! All the statistician must do is enter some commands and read the results.
But one must choose which commands to use. Two researchers attempting to answer the same question can and often do perform entirely different statistical analyses. There are many decisions to make.
What do I measure?
§ This isn’t as obvious as it sounds. If I’m testing a psychiatric medication, I could use several different scales to measure symptoms: various brain function tests, reports from doctors, or all sorts of other measurements. Which will be most useful?
Which variables do I adjust for?
§ In a medical trial, I might control for patient age, gender, weight, BMI, medical history, smoking, or drug use, or for the results of medical tests done before the start of the study. Which of these factors are important? Which can be ignored? How do I measure them?
Which cases do I exclude?
§ If I’m testing diet plans, maybe I want to exclude test subjects who came down with diarrhea during the trial, since their results will be abnormal. Or maybe diarrhea is a side effect of the diet and I must include it. There will always be some results that are out of the ordinary, for reasons known or unknown. I may want to exclude them or analyze them specially. Which cases count as outliers? What do I do with them?
How do I define groups?
§ For example, I may want to split patients into “overweight,” “normal,” and “underweight” groups. Where do I draw the lines? What do I do with a muscular bodybuilder whose BMI is in the “overweight” range?
What about missing data?
§ Perhaps I’m testing cancer remission rates with a new drug. I run the trial for five years, but some patients will have tumors reappear after six years or eight years. My data does not include their recurrence. Or perhaps some patients dropped out because of side effects or personal problems. How do I account for this when measuring the effectiveness of the drug?
How much data should I collect?
§ Should I stop when I have a definitive result or continue as planned until I’ve collected all the data? What if I have trouble enrolling as many patients as desired?
It can take hours of exploration to see which procedures are most appropriate. Papers usually explain the statistical analysis performed but don’t always explain why researchers chose one method over another or what the results would have been had they chosen a different method. Researchers are free to choose whatever methods they feel appropriate—and though they may make good choices, what happens if they analyze the data differently?
This statistical freedom allows bias to creep into analysis undetected, even when analysts have the best of intentions. A few analysis decisions can change results dramatically, suggesting that perhaps analysts should make the decisions before they see the data. Let’s start with the outsized impact of small analysis decisions.
A Little Freedom Is a Dangerous Thing
In simulations, it’s possible to get effect sizes different by a factor of two simply by adjusting for different variables, excluding different sets of cases, and handling outliers differently.1 Even reasonable practices, such as remeasuring patients with strange laboratory test results or removing clearly abnormal patients, can bring a statistically insignificant result to significance.2 Apparently, being free to analyze how you want gives you enormous control over your results!
A group of researchers demonstrated this phenomenon with a simple experiment. Twenty undergraduates were randomly assigned to listen to either “When I’m Sixty-Four” by the Beatles or “Kalimba,” a song that comes with the Windows 7 operating system. Afterward, they were asked their age and their father’s age. The two groups were compared, and it was found that “When I’m Sixty-Four” listeners were a year and a half younger on average, controlling for their father’s age, with p < 0.05. Since the groups were randomly assigned, the only plausible source of the difference was the music.
Rather than publishing The Musical Guide to Staying Young, the researchers explained the tricks they used to obtain this result. They didn’t decide in advance how much data to collect; instead, they recruited students and ran statistical tests periodically to see whether a significant result had been achieved. (You saw earlier that such stopping rules can inflate false-positive rates significantly.) They also didn’t decide in advance to control for the age of the subjects’ fathers, instead asking how old they felt, how much they would enjoy eating at a diner, the square root of 100, their mother’s age, their agreement with “computers are complicated machines,” whether they would take advantage of an early-bird special, their political orientation, which of four Canadian quarterbacks they believed won an award, how often they refer to the past as “the good old days,” and their gender.
Only after looking at the data did the researchers decide on which outcome variable to use and which variables to control for. (Had the results been different, they might have reported that “When I’m Sixty-Four” causes students to, say, be less able to calculate the square root of 100, controlling for their knowledge of Canadian football.) Naturally, this freedom allowed the researchers to make multiple comparisons and inflated their false-positive rate. In a published paper, they wouldn’t need to mention the other insignificant variables; they’d be free to discuss the apparent antiaging benefit of the Beatles. The fallacy would not be visible to the reader.
Further simulation by the researchers suggested that if scientists try different statistical analyses until one works—say, by controlling for different combinations of variables and trying different sample sizes—false positive rates can jump to more than 50% for a given dataset.3
This example sounds outlandish, and most scientists would protest that they don’t intentionally tinker with the data until a significant result appears. They construct a hypothesis, collect data, explore the data a bit, and run a reasonable statistical analysis to test the hypothesis. Perhaps we could have tried 100 analyses until we got a fantastic result, they say, but we didn’t. We picked one analysis that seemed appropriate for the data and stuck with it.
But the choice of analysis strategy is always based on the data. We look at our data to decide which variables to include, which outliers to remove, which statistical tests to use, and which outcomes to examine. We do this not with the explicit goal of finding the most statistically significant result but to design an analysis that accounts for the peculiarities that arise in any dataset. Had we collected different data—had that one patient suffered from chronic constipation instead of acute diarrhea—we would choose a different statistical analysis. We bias the analysis to produce results that “make sense.”
Furthermore, a single prespecified scientific hypothesis does not necessarily correspond to a single statistical hypothesis. Many different statistical results could all be interpreted to support a hypothesis. You may believe that one drug has fewer side effects than another, but you will accept statistically significant drops in any of a dozen side effects as evidence. You may believe that women are more likely to wear red or pink during ovulation, but you will accept statistically significant effects for red shirts, pink shirts, or the combination of both. (Or perhaps you will accept effects for shirts, pants, hats, socks, or other kinds of clothing.) If you hypothesize that ovulation makes single women more liberal, you will accept changes in any of their voting choices, religious beliefs, and political values as evidence. The choices that produce interesting results will attract our attention and engage our human tendency to build plausible stories for any outcome.
The most worrying consequence of this statistical freedom is that researchers may unintentionally choose the statistical analysis most favorable to them. Their resulting estimates of uncertainty—standard errors, confidence intervals, and so on—will be biased. The false-positive rate will be inflated because the data guided their statistical design.
In physics, unconscious biases have long been recognized as a problem. Measurements of physical constants, such as the speed of light or subatomic particle properties, tend to cluster around previous measurements rather than the eventually accepted “truth.”8 It seems an experimentalist, obtaining results that disagree with earlier studies, “searches for the source or sources of such errors, and continues to search until he gets a result close to the accepted value. Then he stops!”9
Seeking to eliminate this bias, particle physicists have begun performing blind analyses: the scientists analyzing the data avoid calculating the value of interest until after the analysis procedure is finalized. Sometimes this is easy: Frank Dunnington, measuring the electron’s charge-to-mass ratio in the early 1930s, had his machinist build the experimental apparatus with the detector close to, but not exactly at, the optimal angle. Without the precise angle measurement, Dunnington could not calculate his final answer, so he devised his analysis procedures while unable to subconsciously bias the results. Once he was ready, he measured the angle and calculated the final ratio.
Blind analysis isn’t always this straightforward, of course, but particle physicists have begun to adopt it for major experiments. Other blinding techniques include adding a constant to all measurements, keeping this constant hidden from analysts until the analysis is finalized; having independent groups perform separate parts of the analysis and only later combining their results; or using simulations to inject false data that is later removed. Results are unblinded only after the research group is satisfied that the analysis is complete and appropriate.
In some medical studies, triple blinding is performed as a form of blind analysis; the patients, doctors, and statisticians all do not know which group is the control group until the analysis is complete. This does not eliminate all sources of bias. For example, the statistician may not be able to unconsciously favor the treatment group, but she may be biased toward a larger difference between groups. More extensive blinding techniques are not in frequent use, and significant methodological research is required to determine how common statistical techniques can be blinded without making analysis impractical.
Instead of triple blinding, one option is to remove the statistician’s freedom of choice. A limited form of this, covering the design and execution of the experiment rather than its analysis, is common in medicine. Doctors are required to draft a clinical trial protocol explaining how the data will be collected, including the planned sample size and measured outcome variables, and then the protocol is reviewed by an ethics committee to ensure it adequately protects patient safety and privacy. Because the protocol is drafted before data is collected, doctors can’t easily tinker with the design to obtain favorable results. Unfortunately, many studies depart from their protocols, allowing for researcher bias to creep in.10,11 Journal editors often don’t compare submitted papers to original protocols and don’t require authors to explain why their protocols were violated, so there is no way to determine the motivation for the changes.
Many scientific fields have no protocol publication requirement, and in sciences such as psychology, psychiatry, and sociology, there is often no single agreed-upon methodology to use for a particular experiment. Appropriate designs for medical trials or physics experiments have been analyzed to death, but it’s often unclear how to handle less straightforward behavioral studies. The result is an explosion of diversity in study design, with every new paper using a different combination of methods. When there is intense pressure to produce novel results, as there usually is in the United States, researchers in these fields tend to produce biased and extreme results more frequently because of their freedom in experimental design and data analysis.12 In response, some have proposed allowing protocol registration for confirmatory research, lending subsequent results greater credibility.
Of course, to paraphrase Helmuth von Moltke, no analysis plan survives contact with the data. There may be complications and problems you did not anticipate. Your assumptions about the distribution of measurements, the correlation between variables, and the likely causes of outliers—all essential to your choice of analysis—may be entirely wrong. You might have no idea what assumptions to make before collecting the data. When that happens, it’s better to correct your analysis than to proceed with an obviously wrong preplanned analysis.
It may not even be possible to prespecify an analysis before seeing the data. Perhaps you decide to test a new hypothesis using a common dataset that you have used for years, perhaps you aren’t sure what hypothesis is relevant until you see the data, or perhaps the data suggests interesting hypotheses you hadn’t thought of before collecting it. For some fields, prepublication replication can solve this problem: collect a new, independent dataset and analyze it using exactly the same methods. If the effect remains, you can be confident in your results. (Be sure your new sample has adequate statistical power.) But for economists studying a market crash, it’s not possible (or at least not ethical) to arrange for another one. For a doctor studying a cancer treatment, patients may not be able to wait for replication.
The proliferation of statistical techniques has given us useful tools, but it seems they’ve been put to use as blunt objects with which to beat the data until it confesses. With preregistered analyses, blinding, and further research into experimental methods, we can start to treat our data more humanely.
§ Before collecting data, plan your data analysis, accounting for multiple comparisons and including any effects you’d like to look for.
§ Register your clinical trial protocol if applicable.
§ If you deviate from your planned protocol, note this in your paper and provide an explanation.
§ Don’t just torture the data until it confesses. Have a specific statistical hypothesis in mind before you begin your analysis.
 This was a real study, claiming women at peak fertility were three times more likely to wear red or pink.4 Columbia University statistician Andrew Gelman wrote an article in Slate criticizing the many degrees of freedom in the study, using it as an example to attack statistical methods in psychology in general.5
 I am not making up this study either. It also found that “ovulation led married women to become more conservative.”6 A large replication attempt found no evidence for either claim.7