# Statistics Done Wrong: The Woefully Complete Guide (2015)

### Chapter 7. Continuity Errors

So far in this book, I’ve focused on comparisons between groups. Is the placebo or the drug more effective? Do intersections that allow right turns on red kill more people than those that don’t? You produce a single statistic for each group—such as an average number of traffic accidents—and see whether these statistics are significantly different between groups.

But what if you can’t separate test subjects into clear groups? A study of the health impacts of obesity might measure the body mass index of each participant, along with blood pressure, blood sugar, resting heart rate, and so on. But there aren’t two clear groups of patients; there’s a spectrum, from underweight to obese. Say you want to spot health trends as you move from one end of this spectrum to the other.

One statistical technique to deal with such scenarios is called *regression modeling*. It estimates the *marginal* effect of each variable—the health impact of each additional pound of weight, not just the difference between groups on either side of an arbitrary cutoff. This gives much finer-grained results than a simple comparison between groups.

But scientists frequently simplify their data to avoid the need for regression analysis. The statement “Overweight people are 50% more likely to have heart disease” has far more obvious clinical implications than “Each additional unit of Metropolitan Relative Weight increases the log-odds of heart disease by 0.009.” Even if it’s possible to build a statistical model that captures every detail of the data, a statistician might choose a simpler analysis over a technically superior one for purely practical reasons. As you’ve seen, simple models can still be used incorrectly, and the process of simplifying the data introduces yet more room for error. Let’s start with the simplification process; in the next chapter, I’ll discuss common errors when using full regression models instead.

**Needless Dichotomization**

A common simplification technique is to *dichotomize* variables by splitting a continuous measurement into two separate groups. In the example study on obesity, for example, you might divide patients into “healthy” or “overweight” groups. By splitting the data, you don’t need to fuss over choosing the correct regression model. You can just compare the two groups using a *t* test.

This raises the question: how do you decide where to split the data? Perhaps there’s a natural cutoff or a widely accepted definition (as with obesity), but often there isn’t. One common solution is to split the data along the median of the sample, which divides the data into two equal-size groups—a so-called *median split*. A downside to this approach is that different researchers studying the same phenomenon will arrive at different split points, making their results difficult to compare or aggregate in meta-analyses.

An alternative to a median split is to select the cutoff that gives you the smallest *p* value between groups. You can think of this as choosing to separate the groups so they are the “most different.” As you might imagine, this approach makes false positives more likely. Searching for the cutoff with the best *p* value means effectively performing many hypothesis tests until you get the result you want. The result is the same as you saw previously with multiple comparisons: a false positive rate increased by as much as a factor of 10.^{1} Your confidence intervals for the effect size will also be misleadingly narrow.

Dichotomization problems cropped up in a number of breast cancer research papers in the early 1990s studying the S-phase fraction, the fraction of cells in a tumor that are busy copying and synthesizing new DNA. Oncologists believe this fraction may predict the ultimate course of a cancer, allowing doctors to target their patients’ treatments more effectively. Researchers studying the matter divided patients into two groups: those with large S-phase fractions and those with small ones.

Of course, each study chose a different cutoff between “large” and “small,” picking either the median or the cutoff that gave the best *p* value. Unsurprisingly, the studies that chose the “optimal” cutoff had statistically significant results. But when these were corrected to account for the multiple comparisons, not one of them was statistically significant.

Further studies have suggested that the S-phase fraction is indeed related to tumor prognosis, but the evidence was poor for many years. The method continued to be used in cancer studies for several years after its flaws were publicized, and a 2005 set of reporting guidelines for cancer prognostic factor studies noted the following: “Despite years of research and hundreds of reports on tumor markers in oncology, the number of markers that have emerged as clinically useful is pitifully small.”^{2} Apart from poor statistical power, incomplete reporting of results, and sampling biases, the choice of “optimal” cut points was cited as a key reason for this problem.

**Statistical Brownout**

A major objection to dichotomization is that it throws away information. Instead of using a precise number for every patient or observation, you split observations into groups and throw away the numbers. This reduces the statistical power of your study—a major problem when so many studies are already underpowered. You’ll get less precise estimates of the correlations you’re trying to measure and will often underestimate effect sizes. In general, this loss of power and precision is the same you’d get by throwing away a third of your data.^{3}

Let’s go back to the example study measuring the health impacts of obesity. Say you split patients into “normal” and “overweight” groups based on their *body mass index*, taking a BMI of 25 to be the maximum for the normal range. (This is the standard cutoff used in clinical practice.) But then you’ve lost the distinction between all BMIs above this cutoff. If the heart-disease rate increases with weight, it’s much more difficult to tell *how much* it increases because you didn’t record the difference between, say, mildly overweight and morbidly obese patients.

To put this another way, imagine if the “normal” group consisted of patients with BMIs of exactly 24, while the “overweight” group had BMIs of 26. A major difference between the groups would be surprising since they’re not very different. On the other hand, if the “overweight” group all had BMIs of 36, a major difference would be much less surprising and indicate a much smaller difference per BMI unit. Dichotomization eliminates this distinction, dropping useful information and statistical power.

Perhaps it was a silly choice to use only two groups—what about underweight patients?—but increasing the number of groups means the number of patients in each group decreases. More groups might produce a more detailed analysis, but the heart disease rate estimates for each group will be based on less data and have wider confidence intervals. And splitting data into more groups means making more decisions about *where* to split the data, making different studies yet more difficult to compare and making it even easier for researchers to generate false positives.

**Confounded Confounding**

You may wonder the following: if I have enough data to achieve statistical significance after I’ve dichotomized my data, does the dichotomization matter? As long as I can make up for the lost statistical power with extra data, why not dichotomize to make the statistical analysis easy?

That’s a legitimate argument. But analyzing data without dichotomizing isn’t that hard. Regression analysis is a common procedure, supported by nearly every statistical software package and covered in numerous books. Regression doesn’t involve dichotomization—it uses the full data, so there is no cutoff to choose and no loss of statistical power. So why water down your data? But more importantly, dichotomization does more than cut power. Counterintuitively, it also introduces false positives.

We are often interested in controlling for confounding factors. You might measure two or three variables (or two or three dozen) along with the outcome variable and attempt to determine the unique effect of each variable on the outcome after the other variables have been “controlled for.” If you have two variables and one outcome, you could easily do this by dichotomizing the two variables and using a two-way analysis of variance (ANOVA) table, a simple, commonly performed procedure supported by every major statistical software package.

Unfortunately, the worst that could happen isn’t a false negative. By dichotomizing and throwing away information, you eliminate the ability to distinguish between confounding factors.^{4}

Consider an example. Say you’re measuring the effect of a number of variables on the quality of health care a person receives. Health-care quality (perhaps measured using a survey) is the outcome variable. For predictor variables, you use two measurements: the subject’s personal net worth in dollars and the length of the subject’s personal yacht.

You would expect a good statistical procedure to deduce that wealth impacts quality of health care but yacht size does not. Even though yacht size and wealth tend to increase together, it’s not your yacht that gets you better health care. With enough data, you would notice that people of the same wealth can have differently sized yachts—or no yachts at all—but still get a similar quality of care. This indicates that wealth is the primary factor, not yacht length.

But by dichotomizing the variables, you’ve effectively cut the data down to four points. Each predictor can be only “above the median” or “below the median,” and no further information is recorded. You no longer have the data needed to realize that yacht length has nothing to do with health care. As a result, the ANOVA procedure falsely claims that yachts and health care are related. Worse, this false correlation isn’t statistically significant only 5% of the time—from the ANOVA’s perspective, it’s a *true* correlation, and it is detected as often as the statistical power of the test allows it.

Of course, you could have figured out that yacht size wouldn’t matter, even without data. You could have left it out of the analysis and saved a lot of trouble. But you don’t usually know in advance which variables are most important—you depend on your statistical analysis to tell you.

Regression procedures can easily fit this data without any dichotomization, while producing false-positive correlations only at the rate you’d expect. (Of course, as the correlation between wealth and yacht size becomes stronger, it becomes more difficult to distinguish between their effects.) While the mathematical theory of regression with multiple variables can be more advanced than many practicing scientists care to understand, involving a great deal of linear algebra, the basic concepts and results are easy to understand and interpret. There’s no good reason not to use it.

**TIPS**

§ Don’t arbitrarily split continuous variables into discrete groups unless you have good reason. Use a statistical procedure that can take full advantage of the continuous variables.

§ If you do need to split continuous variables into groups for some reason, don’t choose the groups to maximize your statistical significance. Define the split in advance, use the same split as in previous similar research, or use outside standards (such as a medical definition of obesity or high blood pressure) instead.