Statistics Done Wrong: The Woefully Complete Guide (2015)
In the final chapter of his famous book How to Lie with Statistics, Darrell Huff tells us that “anything smacking of the medical profession” or backed by scientific laboratories and universities is worthy of our trust—not unconditional trust but certainly more trust than we’d afford the media or politicians.(After all, Huff’s book is filled with the misleading statistical trickery used in politics and the media.) But few people complain about statistics done by trained scientists. Scientists seek understanding, not ammunition to use against political opponents.
Statistical data analysis is fundamental to science. Open a random page in your favorite medical journal and you’ll be deluged with statistics: t tests, p values, proportional hazards models, propensity scores, logistic regressions, least-squares fits, and confidence intervals. Statisticians have provided scientists with tools of enormous power to find order and meaning in the most complex of datasets, and scientists have embraced them with glee.
They have not, however, embraced statistics education, and many undergraduate programs in the sciences require no statistical training whatsoever.
Since the 1980s, researchers have described numerous statistical fallacies and misconceptions in the popular peer-reviewed scientific literature and have found that many scientific papers—perhaps more than half—fall prey to these errors. Inadequate statistical power renders many studies incapable of finding what they’re looking for, multiple comparisons and misinterpreted p values cause numerous false positives, flexible data analysis makes it easy to find a correlation where none exists, and inappropriate model choices bias important results. Most errors go undetected by peer reviewers and editors, who often have no specific statistical training, because few journals employ statisticians to review submissions and few papers give sufficient statistical detail to be accurately evaluated.
The problem isn’t fraud but poor statistical education—poor enough that some scientists conclude that most published research findings are probably false.1 Review articles and editorials appear regularly in leading journals, demanding higher statistical standards and tougher review, but few scientists hear their pleas, and journal-mandated standards are often ignored. Because statistical advice is scattered between frequently misleading textbooks, review articles in assorted journals, and statistical research papers difficult for scientists to understand, most scientists have no easy way to improve their statistical practice.
The methodological complexity of modern research means that scientists without extensive statistical training may not be able to understand most published research in their fields. In medicine, for example, a doctor who took one standard introductory statistics course would have sufficient knowledge to fully understand only about a fifth of research articles published in the New England Journal of Medicine.2 Most doctors have even less training—many medical residents learn statistics informally through journal clubs or short courses, rather than through required courses.3The content that is taught to medical students is often poorly understood, with residents averaging less than 50% correct on tests of statistical methods commonly used in medicine.4 Even medical school faculty with research training score less than 75% correct.
The situation is so bad that even the authors of surveys of statistical knowledge lack the necessary statistical knowledge to formulate survey questions—the numbers I just quoted are misleading because the survey of medical residents included a multiple-choice question asking residents to define a p value and gave four incorrect definitions as the only options.5 We can give the authors some leeway because many introductory statistics textbooks also poorly or incorrectly define this basic concept.
When the designers of scientific studies don’t employ statistics with sufficient care, they can sink years of work and thousands of dollars into research that cannot possibly answer the questions it is meant to answer. As psychologist Paul Meehl complained,
Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.6
Perhaps it is unfair to accuse most scientists of intellectual infertility, since most scientific fields rest on more than a few misinterpreted p values. But these errors have massive impacts on the real world. Medical clinical trials direct our health care and determine the safety of powerful new prescription drugs, criminologists evaluate different strategies to mitigate crime, epidemiologists try to slow down new diseases, and marketers and business managers try to find the best way to sell their products—it all comes down to statistics. Statistics done wrong.
Anyone who’s ever complained about doctors not making up their minds about what is good or bad for you understands the scope of the problem. We now have a dismissive attitude toward news articles claiming some food or diet or exercise might harm us—we just wait for the inevitable second study some months later, giving exactly the opposite result. As one prominent epidemiologist noted, “We are fast becoming a nuisance to society. People don’t take us seriously anymore, and when they do take us seriously, we may unintentionally do more harm than good.”7 Our instincts are right. In many fields, initial results tend to be contradicted by later results. It seems the pressure to publish exciting results early and often has surpassed the responsibility to publish carefully checked results supported by a surplus of evidence.
Let’s not judge so quickly, though. Some statistical errors result from a simple lack of funding or resources. Consider the mid-1970s movement to allow American drivers to turn right at red lights, saving gas and time; the evidence suggesting this would cause no more crashes than before was statistically flawed, as you will soon see, and the change cost many lives. The only factor holding back traffic safety researchers was a lack of data. Had they the money to collect more data and perform more studies—and the time to collate results from independent researchers in many different states—the truth would have been obvious.
While Hanlon’s razor directs us to “never attribute to malice that which is adequately explained by incompetence,” there are some published results of the “lies, damned lies, and statistics” sort. The pharmaceutical industry seems particularly tempted to bias evidence by neglecting to publish studies that show their drugs do not work; subsequent reviewers of the literature may be pleased to find that 12 studies indicate a drug works, without knowing that 8 other unpublished studies suggest it does not. Of course, it’s likely that such results would not be published by peer-reviewed journals even if they were submitted—a strong bias against unexciting results means that studies saying “it didn’t work” never appear and other researchers never see them. Missing data and publication bias plague science, skewing our perceptions of important issues.
Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to “torture the data until it confesses.” Just try several different analyses offered by your statistical software until one of them turns up an interesting result, and then pretend this is the analysis you intended to do all along. Without psychic powers, it’s almost impossible to tell when a published result was obtained through data torture.
In “softer” fields, where theories are less quantitative, experiments are difficult to design, and methods are less standardized, this additional freedom causes noticeable biases.8 Researchers in the United States must produce and publish interesting results to advance their careers; with intense competition for a small number of available academic jobs, scientists cannot afford to spend months or years collecting and analyzing data only to produce a statistically insignificant result. Even without malicious intent, these scientists tend to produce exaggerated results that more strongly favor their hypotheses than the data should permit.
In the coming pages, I hope to introduce you to these common errors and many others. Many of the errors are prevalent in vast swaths of the published literature, casting doubt on the findings of thousands of papers.
In recent years there have been many advocates for statistical reform, and naturally there is disagreement among them on the best method to address these problems. Some insist that p values, which I will show are frequently misleading and confusing, should be abandoned altogether; others advocate a “new statistics” based on confidence intervals. Still others suggest a switch to new Bayesian methods that give more-interpretable results, while others believe statistics as it’s currently taught is just fine but used poorly. All of these positions have merits, and I am not going to pick one to advocate in this book. My focus is on statistics as it is currently used by practicing scientists.
 Incidentally, I think this is why conspiracy theories are so popular. Once you believe you know something nobody else does (the government is out to get us!), you take every opportunity to show off that knowledge, and you end up reacting to all news with reasons why it was falsified by the government. Please don’t do the same with statistical errors.
 Readers interested in the pharmaceutical industry’s statistical misadventures may enjoy Ben Goldacre’s Bad Pharma (Faber & Faber, 2012), which caused a statistically significant increase in my blood pressure while I read it.