Statistics Done Wrong: The Woefully Complete Guide (2015)
Chapter 12. What Can Be Done?
I’ve painted a grim picture. But anyone can pick out small details in published studies and produce a tremendous list of errors. Do these problems matter?
Well, yes. If they didn’t, I wouldn’t have written this book.
John Ioannidis’s famous article “Why Most Published Research Findings Are False”1 was grounded in mathematical concerns rather than in an empirical test of research results. Since most research articles have poor statistical power and researchers have freedom to choose among analysis methods to get favorable results, while most tested hypotheses are false and most true hypotheses correspond to very small effects, we are mathematically determined to get a plethora of false positives.
But if you want empiricism, you can have it, courtesy of Jonathan Schoenfeld and John Ioannidis. They studied the question “Is everything we eat associated with cancer?”2, After choosing 50 common ingredients out of a cookbook, they set out to find studies linking them to cancer rates—and found 216 studies on 40 different ingredients. Of course, most of the studies disagreed with each other. Most ingredients had multiple studies alternately claiming they increased and decreased the risk of getting cancer. (Sadly, bacon was one of the few foods consistently found to increase the risk of cancer.) Most of the statistical evidence was weak, and meta-analyses usually showed much smaller effects on cancer rates than the original studies.
Perhaps this is not a serious problem, given that we are already conditioned to ignore news stories about common items causing cancer. Consider, then, a comprehensive review of all research articles published from 2001 to 2010 in the New England Journal of Medicine, one of the most prestigious medical research journals. Out of the 363 articles that tested a current standard medical practice, 146 of them—about 40%—concluded that the practice should be abandoned in favor of previous treatments. Only 138 of the studies reaffirmed the current practice.3
The astute reader may wonder whether these figures are influenced by publication bias. Perhaps the New England Journal of Medicine is biased toward publishing rejections of current standards since they are more exciting. But tests of the current standard of care are genuinely rare and would seem likely to attract an editor’s eye. Even if bias does exist, the sheer quantity of these reversals in medical practice should be worrisome.
Another review compared meta-analyses to subsequent large randomized controlled trials. In more than a third of cases, the randomized trial’s outcome did not correspond well to the meta-analysis, indicating that even the careful aggregation of numerous small studies cannot be trusted to give reliable evidence.4 Other comparisons of meta-analyses found that most results were inflated, with effect sizes decreasing as they were updated with more data. Perhaps a fifth of meta-analysis conclusions represented false positives.5
Of course, being contradicted by follow-up studies and meta-analyses doesn’t prevent a paper from being used as though it were true. Even effects that have been contradicted by massive follow-up trials with unequivocal results are frequently cited 5 or 10 years later, with scientists apparently not noticing that the results are false.6 Of course, new findings get widely publicized in the press, while contradictions and corrections are hardly ever mentioned.7 You can hardly blame the scientists for not keeping up.
Let’s not forget the merely biased results. Poor reporting standards in medical journals mean studies testing new treatments for schizophrenia can neglect to include the scales they used to evaluate symptoms—a handy source of bias because trials using homemade unpublished scales tend to produce better results than those using previously validated tests.8 Other medical studies simply omit particular results if they’re not favorable or interesting, biasing subsequent meta-analyses to include only positive results. A third of meta-analyses are estimated to suffer from this problem.9
Multitudes of physical-science papers misuse confidence intervals.10 And there’s a peer-reviewed psychology paper allegedly providing evidence for psychic powers on the basis of uncontrolled multiple comparisons in exploratory studies.11 Unsurprisingly, the results failed to be replicated—by scientists who appear not to have calculated the statistical power of their tests.12
So what can we do? How do we prevent these errors from reaching print? A good starting place would be in statistical education.
Most American science students have minimal statistical education—one or two required courses at best and none at all for most students. Many of these courses do not cover important concepts such as statistical power and multiple comparisons. And even when students have taken statistics courses, professors report that they can’t apply statistical concepts to scientific questions, having never fully understood—or having forgotten—the appropriate techniques. This needs to change. Almost every scientific discipline depends on statistical analysis of experimental data, and statistical errors waste grant funding and researcher time.
It would be tempting to say, “We must introduce a new curriculum adapted to the needs of practicing scientists and require students to take courses in this material” and then assume the problem will be solved. A great deal of research in science education shows that this is not the case. Typical lecture courses teach students little, simply because lectures are a poor way to teach difficult concepts.
Unfortunately, most of this research is not specifically aimed at statistics education. Physicists, however, have done a great deal of research on a similar problem: teaching introductory physics students the basic concepts of force, energy, and kinematics. An instructive example is a large-scale survey of 14 physics courses, including 2,084 students, using the Force Concept Inventory to measure student understanding of basic physics concepts before and after taking the courses. The students began the courses with holes in their knowledge; at the end of the semester, they had filled only 23% of those holes, despite the Force Concept Inventory being regarded as too easy by their instructors.13
The results are poor because lectures do not suit how students learn. Students have preconceptions about basic physics from their everyday experience—for example, everyone “knows” that something pushed will eventually come to a stop because every object in the real world does so. But we teach Newton’s first law, in which an object in motion stays in motion unless acted upon by an outside force, and expect students to immediately replace their preconception with the new understanding that objects stop only because of frictional forces. Interviews of physics students have revealed numerous surprising misconceptions developed during introductory courses, many not anticipated by instructors.14,15 Misconceptions are like cockroaches: you have no idea where they came from, but they’re everywhere—often where you don’t expect them—and they’re impervious to nuclear weapons.
We hope that students will learn to solve problems and reason with this new understanding, but usually they don’t. Students who watch lectures contradicting their misconceptions report greater confidence in their misconceptions afterward and do no better on simple tests of their knowledge. Often they report not paying attention because the lectures cover concepts they already “know.”16 Similarly, practical demonstrations of physics concepts make little improvement in student understanding because students who misunderstand find ways to interpret the demonstration in light of their misunderstanding.17 And we can’t expect them to ask the right questions in class, because they don’t realize they don’t understand.
At least one study has confirmed this effect in the teaching of statistical hypothesis testing. Even after reading an article explicitly warning against misinterpreting p values and hypothesis test results in general, only 13% of students correctly answered a questionnaire on hypothesis testing.18Obviously, assigning students a book like this one will not be much help if they fundamentally misunderstand statistics. Much of basic statistics is not intuitive (or, at least, not taught in an intuitive fashion), and the opportunity for misunderstanding and error is massive. How can we best teach our students to analyze data and make reasonable statistical inferences?
Again, methods from physics education research provide the answer. If lectures do not force students to confront and correct their misconceptions, we will have to use a method that does. A leading example is peer instruction. Students are assigned readings or videos before class, and class time is spent reviewing the basic concepts and answering conceptual questions. Forced to choose an answer and discuss why they believe it is true before the instructor reveals the correct answer, students immediately see when their misconceptions do not match reality, and instructors spot problems before they grow.
Peer instruction has been successfully implemented in many physics courses. Surveys using the Force Concept Inventory found that students typically double or triple their learning gains in a peer instruction course, filling in 50% to 75% of the gaps in their knowledge revealed at the beginning of the semester.13,19,20 And despite the focus on conceptual understanding, students in peer instruction courses perform just as well or better on quantitative and mathematical questions as their lectured peers.
So far there is relatively little data on the impact of peer instruction in statistics courses. Some universities have experimented with statistics courses integrated with science classes, with students immediately applying statistical knowledge to problems in their field. Preliminary results suggest this works: students learn and retain more statistics, and they spend less time complaining about being forced to take a statistics course.21 More universities should adopt these techniques and experiment with peer instruction using conceptual tests such as the Comprehensive Assessment of Outcomes in Statistics22 along with trial courses to see what methods work best. Students will be better prepared for the statistical demands of everyday research if we simply change existing courses, rather than introducing massive new education programs.
But not every student learns statistics in a classroom. I was introduced to statistics when I needed to analyze data in a laboratory and didn’t know how; until strong statistics education is more widespread, many students and researchers will find themselves in the same position, and they need resources. The masses of aspiring scientists who Google “how to do a t test” need freely available educational material developed with common errors and applications in mind. Projects like OpenIntro Statistics, an open source and freely redistributable introductory statistics textbook, are promising, but we’ll need many more. I hope to see more progress in the near future.
Scientific journals are slowly making progress toward solving many of the problems I have discussed. Reporting guidelines, such as CONSORT for randomized trials, make it clear what information is required for a published paper to be reproducible; unfortunately, as you’ve seen, these guidelines are infrequently enforced. We must continue to pressure journals to hold authors to more rigorous standards.
Premier journals need to lead the charge. Nature has begun to do so, announcing a new checklist that authors are required to complete before articles can be published.23 The checklist requires reporting of sample sizes, statistical power calculations, clinical trial registration numbers, a completed CONSORT checklist, adjustment for multiple comparisons, and sharing of data and source code. The guidelines address most issues covered in this book, except for stopping rules, preferential use of confidence intervals over p values, and discussion of reasons for departing from the trial’s registered protocol. Nature will also make statisticians available to consult for papers when requested by peer reviewers.
The popular journal Psychological Science has recently made similar moves, exempting methods and results sections from article word-count limits and requiring full disclosure of excluded data, insignificant results, and sample-size calculations. Preregistering study protocols and sharing data are strongly encouraged, and the editors have embraced the “new statistics,” which emphasizes confidence intervals and effect-size estimates over endless p values.24 But since confidence intervals are not mandatory, it remains to be seen if their endorsement will make a dent in the established practices of psychologists.
Regardless, more journals should do the same. As these guidelines are accepted by the community, enforcement can follow, and the result will be much more reliable and reproducible research.
There is also much to be said about the unfortunate incentive structures that pressure scientists to rapidly publish small studies with slapdash statistical methods. Promotions, tenure, raises, and job offers are all dependent on having a long list of publications in prestigious journals, so there is a strong incentive to publish promising results as soon as possible. Tenure and hiring committees, composed of overworked academics pushing out their own research papers, cannot extensively review each publication for quality or originality, relying instead on prestige and quantity as approximations. University rankings depend heavily on publication counts and successful grant funding. And because negative or statistically insignificant results will not be published by top journals, it’s often not worth the effort to prepare them for publication—publication in lower-class journals may be seen by other academics as a bad sign.
But prestigious journals keep their prestige by rejecting the vast majority of submissions; Nature accepts fewer than 10%. Ostensibly this is done because of page limits in the printed editions of journals, though the vast majority of articles are read online. Journal editors attempt to judge which papers will have the greatest impact and interest and consequently choose those with the most surprising, controversial, or novel results. As you’ve seen, this is a recipe for truth inflation, as well as outcome reporting and publication biases, and strongly discourages replication studies and negative results.
Online-only journals, such as the open-access PLOS ONE or BioMed Central’s many journals, are not restricted by page counts and have more freedom to publish less obviously exciting articles. But PLOS ONE is sometimes seen as a dumping ground for papers that couldn’t cut it at more prestigious journals, and some scientists fear publishing in it will worry potential employers. (It’s also the single largest academic journal, now publishing more than 30,000 articles annually, so clearly its stigma is not too great.) More prestigious online open-access journals, such as PLOS Biology or BMC Biology, are also highly selective, encouraging the same kind of statistical lottery.
To spur change, Nobel laureate Randy Schekman announced in 2013 that he and students in his laboratory will no longer publish in “luxury” scientific journals such as Science and Nature, focusing instead on open-access alternatives (such as eLife, which he edits) that do not artificially limit publication by rejecting the vast majority of articles.25 Of course, Schekman and his students are protected by his Nobel prize, which says more for the quality of his work than the title of the journal it is published in ever could. Average graduate students in average non-Nobel-winning laboratories could not risk damaging their careers with such a radical move.
Perhaps Schekman, shielded by his Nobel, can make the point the rest of us are afraid to make: the frenzied quest for more and more publications, with clear statistical significance and broad applications, harms science. We fixate on statistical significance and do anything to achieve it, even when we don’t understand the statistics. We push out numerous small and underpowered studies, padding our résumés, instead of taking the time and money to conduct larger, more definitive ones.
One proposed alternative to the tyranny of prestigious journals is the use of article-level metrics. Instead of judging an article on the prestige of the journal it’s published in, judge it on rough measures of its own impact. Online-only journals can easily measure the number of views of an article, the number of citations it has received in other articles, and even how often it is discussed on Twitter or Facebook. This is an improvement over using impact factors, which are a journal-wide average number of citations received by all research articles published in a given year—a self-reinforcing metric since articles from prestigious journals are cited more frequently simply because of their prestige and visibility.
I doubt the solution will be so simple. In open-access journals, article-level metrics reward articles popular among the general public (since open-access articles are free for anyone to read), so an article on the unpleasant composition of chicken nuggets would score better than an important breakthrough in some arcane branch of genetics. There is no one magic solution; academic culture will have to slowly change to reward the thorough, the rigorous, and the statistically sound.
The demands placed on the modern scientist are extreme. Besides mastering their own rapidly advancing fields, most scientists are expected to be good at programming (including version control, unit testing, and good software engineering practices), designing statistical graphics, writing scientific papers, managing research groups, mentoring students, managing and archiving data, teaching, applying for grants, and peer-reviewing other scientists’ work, along with the statistical skills I’m demanding here. People dedicate their entire careers to mastering one of these skills, yet we expect scientists to be good at all of them to be competitive.
This is nuts. A PhD program can last five to seven years in the United States and still not have time to teach all these skills, except via trial and error. Tacking on a year or two of experimental design and statistical analysis courses seems unrealistic. Who will have time for it besides statisticians?
Part of the answer is outsourcing. Use the statistical consulting services likely offered by your local statistics department, and rope in a statistician as a collaborator whenever your statistical needs extend beyond a few hours of free advice. (Many statisticians are susceptible to nerd sniping. Describe an interesting problem to them, and they will be unable to resist an attempt at solving it.) In exchange for coauthorship on your paper, the statistician will contribute valuable expertise you can’t pick up from two semesters of introductory courses.
Nonetheless, if you’re going to do your own data analysis, you’ll need a good foundation in statistics, if only to understand what the statistical consultant is telling you. A strong course in applied statistics should cover basic hypothesis testing, regression, statistical power calculation, model selection, and a statistical programming language like R. Or at the least, the course should mention that these concepts exist—perhaps a full mathematical explanation of statistical power won’t fit in the curriculum, but students should be aware of power and should know to ask for power calculations when they need them. Sadly, whenever I read the syllabus for an applied statistics course, I notice it fails to cover all of these topics. Many textbooks cover them only briefly.
Beware of false confidence. You may soon develop a smug sense of satisfaction that your work doesn’t screw up like everyone else’s. But I have not given you a thorough introduction to the mathematics of data analysis. There are many ways to foul up statistics beyond these simple conceptual errors. If you’re designing an unusual experiment, running a large trial, or analyzing complex data, consult a statistician before you start. A competent statistician can recommend an experimental design that mitigates issues such as pseudoreplication and helps you collect the right data—and the right quantity of data—to answer your research question. Don’t commit the sin, as many do, of showing up to your statistical consultant’s office with data in hand, asking, “So how do I tell if this is statistically significant?” A statistician should be a collaborator in your research, not a replacement for Microsoft Excel. You can likely get good advice in exchange for some chocolates or a beer or perhaps coauthorship on your next paper.
Of course, you will do more than analyze your own data. Scientists spend a great deal of time reading papers written by other scientists whose grasp of statistics is entirely unknown. Look for important details in a statistical analysis, such as the following:
§ The statistical power of the study or any other means by which the appropriate sample size was determined
§ How variables were selected or discarded for analysis
§ Whether the statistical results presented support the paper’s conclusions
§ Effect-size estimates and confidence intervals accompanying significance tests, showing whether the results have practical importance
§ Whether appropriate statistical tests were used and, if necessary, how they were corrected for multiple comparisons
§ Details of any stopping rules
If you work in a field for which a set of reporting guidelines has been developed (such as the CONSORT checklist for medical trials), familiarize yourself with it and read papers with it in mind. If a paper omits some of the required items, ask yourself what impact that has on its conclusions and whether you can be sure of its results without knowing the missing details. And, of course, pressure journal editors to enforce the guidelines to ensure future papers improve. In fields without standard reporting guidelines, work to create some so that every paper includes all the information needed to evaluate its conclusions.
In short, your task can be expressed in four simple steps.
1. Read a statistics textbook or take a good statistics course. Practice.
2. Plan your data analyses carefully in advance, avoiding the misconceptions and errors I’ve talked about. Talk to a statistician before you start collecting data.
3. When you find common errors in the scientific literature—such as a simple misinterpretation of p values—hit the perpetrator over the head with your statistics textbook. It’s therapeutic.
4. Press for change in scientific education and publishing. It’s our research. Let’s do it right.
 An important part of the ongoing Oncological Ontology Project to categorize everything into two categories: that which cures cancer and that which causes it.
 A yet-more-astute reader will ask why we should trust these studies suggesting current practice is wrong, given that so many studies are flawed. That’s a fair point, but we are left with massive uncertainty: if we don’t know which studies to trust, what are the best treatments?
 Mostly fat, bone, nerve, and connective tissue, though this article was sadly not actually open-access.26 The brand of chicken nuggets was not specified.
 Professional programmers often trade stories about the horrible code produced by self-taught academic friends.