Model Abuse - Statistics Done Wrong: The Woefully Complete Guide (2015)

Statistics Done Wrong: The Woefully Complete Guide (2015)

Chapter 8. Model Abuse

Let’s move on to regression. Regression in its simplest form is fitting a straight line to data: finding the equation of the line that best predicts the outcome from the data. With this equation, you can use a measurement, such as body mass index, to predict an outcome like blood pressure or medical costs.

Usually regression uses more than one predictor variable. Instead of just body mass index, you might add age, gender, amount of regular exercise, and so on. Once you collect medical data from a representative sample of patients, the regression procedure would use the data to find the best equation to represent the relationship between the predictors and the outcome.

As we saw in Chapter 7, regression with multiple variables allows you to control for confounding factors in a study. For example, you might study the impact of class size on students’ performance on standardized tests, hypothesizing that smaller classes improve test scores. You could use regression to find the relationship between size and score, thus testing whether test scores rise as class size falls—but there’s a confounding variable.

If you find a relationship, then perhaps you’ve shown that class size is the cause, but the cause could also be another factor that influences class size and scores together. Perhaps schools with bigger budgets can afford more teachers, and hence smaller classes, and can also afford more books, higher teacher salaries, more support staff, better science labs, and other resources that help students learn. Class size could have nothing to do with it.

To control for the confounding variable, you record each school’s total budget and include it in your regression equation, thus separating the effect of budget from the effect of class size. If you examine schools with similar budgets and different class sizes, regression produces an equation that lets us say, “For schools with the same budget, increasing class size by one student lowers test scores by this many points.” The confounding variable is hence controlled for. Of course, there may be confounding variables you aren’t aware of or don’t know how to measure, and these could influence your results; only a truly randomized experiment eliminates all confounding variables.

There are many more versions of regression than the simple one presented here. Often the relationship between two variables isn’t a simple linear equation. Or perhaps the outcome variable isn’t quantitative, like blood pressure or a test score, but categorical. Maybe you want to predict whether a patient will suffer complications after a surgery, using his or her age, blood pressure, and other vital signs. There are many varieties of procedures to account for these possibilities.

All kinds of regression procedures are subject to common problems. Let’s start with the simplest problem: overfitting, which is the result of excessive enthusiasm in data analysis.

Fitting Data to Watermelons

A common watermelon selection strategy is to knock on the melons and pick those with a particularly hollow sound, which apparently results from desirable characteristics of watermelon flesh. With the right measurement equipment, it should be possible to use statistics to find an algorithm that can predict the ripeness of any melon from its sound.

I am particularly interested in this problem because I once tried to investigate it, building a circuit to connect a fancy accelerometer to my computer so I could record the thump of watermelons. But I tested only eight melons—not nearly enough data to build an accurate ripeness-prediction system. So I was understandably excited when I came across a paper that claimed to predict watermelon ripeness with fantastic accuracy: acoustic measurements could predict 99.9% of the variation in ripeness.1

But let’s think. In this study, panelists tasted and rated 43 watermelons using a five-point ripeness scale. Regression was used to predict the ripeness rating from various acoustic measurements. How could the regression equation’s accuracy be so high? If you had the panelists rerate the melons, they probably wouldn’t agree with their own ratings with 99.9% accuracy. Subjective ratings aren’t that consistent. No procedure, no matter how sophisticated, could predict them with such accuracy.

Something is wrong. Let’s evaluate their methods more carefully.

Each watermelon was vibrated at a range of frequencies, from 1 to 1,000 hertz, and the phase shift (essentially, how long it took the vibration to travel through the melon) was measured at each frequency. There were 1,600 tested frequencies, so there were 1,600 variables in the regression model. Each one’s relationship to ripeness has to be estimated.

Now, with more variables than watermelons, I could fit a perfect regression model. Just like a straight line can be made to fit perfectly between any two data points, an equation with 43 variables can be used to perfectly fit the measurements of 43 melons. This is serious overkill. Even if there is no relationship whatsoever between acoustics and ripeness, I can fit a regression equation that gives 100% accuracy on the 43 watermelons. It will account for not just the true relationship between acoustics and ripeness (if one exists) but also random variation in individual ratings and measurements. I will believe the model fits perfectly—but tested on new watermelons with their own measurement errors and subjective ratings, it may be useless.

The authors of the study attempted to sidestep this problem by using stepwise regression, a common procedure for selecting which variables are the most important in a regression. In its simplest form, it goes like this: start by using none of the 1,600 frequency measurements. Perform 1,600 hypothesis tests to determine which of the frequencies has the most statistically significant relationship with the outcome. Add that frequency and then repeat with the remaining 1,599. Continue the procedure until there are no statistically significant frequencies.

Stepwise regression is common in many scientific fields, but it’s usually a bad idea.2 You probably already noticed one problem: multiple comparisons. Hypothetically, by adding only statistically significant variables, you avoid overfitting, but running so many significance tests is bound to produce false positives, so some of the variables you select will be bogus. Stepwise regression procedures provide no guarantees about the overall false positive rate, nor are they guaranteed to select the “best” combination of variables, however you define “best.” (Alternative stepwise procedures use other criteria instead of statistical significance but suffer from many of the same problems.)

So despite the veneer of statistical significance, stepwise regression is susceptible to egregious overfitting, producing an equation that fits the data nearly perfectly but that may prove useless when tested on a separate dataset. As a test, I simulated random watermelon measurements with absolutely zero correlation with ripeness, and nonetheless stepwise regression fit the data with 99.9% accuracy. With so many variables to choose from, it would be more surprising if it didn’t.

Most uses of stepwise regression are not in such extreme cases. Having 1,600 variables to choose from is extraordinarily rare. But even in modest cases with 100 observations of a few dozen variables, stepwise regression produces inflated estimates of accuracy and statistical significance.3,4

Truth inflation is a more insidious problem. Remember, “statistically insignificant” does not mean “has no effect whatsoever.” If your study is underpowered—you have too many variables to choose from and too little data—then you may not have enough data to reliably distinguish each variable’s effect from zero. You’ll include variables only if you are unlucky enough to overestimate their effect on the outcome. Your model will be heavily biased. (Even when not using a formal stepwise regression procedure, it’s common practice to throw out “insignificant” variables to simplify a model, leading to the same problem.)

There are several variations of stepwise regression. The version I just described is called forward selection since it starts from scratch and starts including variables. The alternative, backward elimination, starts by including all 1,600 variables and excludes those that are statistically insignificant, one at a time. (This would fail, in this case: with 1,600 variables but only 43 melons, there isn’t enough data to uniquely determine the effects of all 1,600 variables. You would get stuck on the first step.) It’s also possible to change the criteria used to include new variables; instead of statistical significance, more-modern procedures use metrics like the Akaike information criterion and the Bayesian information criterion, which reduce overfitting by penalizing models with more variables. Other variations add and remove variables at each step according to various criteria. None of these variations is guaranteed to arrive at the same answer, so two analyses of the same data could arrive at very different results.

For the watermelon study, these factors combined to produce implausibly accurate results. How can a regression model be fairly evaluated, avoiding these problems? One option is cross-validation: fit the model using only a portion of the melons and then test its effectiveness at predicting the ripeness of the other melons. If the model overfits, it will perform poorly during cross-validation. One common cross-validation method is leave-out-one cross-validation, where the model is fit using all but one data point and then evaluated on its ability to predict that point; the procedure is repeated with each data point left out in turn. The watermelon study claims to have performed leave-out-one cross-validation but obtained similarly implausible results. Without access to the data, I’m not sure whether the method genuinely works.

Despite these drawbacks, stepwise regression continues to be popular. It’s an intuitively appealing algorithm: select the variables with statistically significant effects. But choosing a single model is usually foolishly overconfident. With so many variables to choose from, there are often many combinations of variables that predict the outcome nearly as well. Had I picked 43 more watermelons to test, I probably would have selected a different subset of the 1,600 possible acoustic predictors of ripeness. Stepwise regression produces misleading certainty—the claim that these 20 or 30 variables are “the” predictors of ripeness, though dozens of others could do the job.

Of course, in some cases there may be a good reason to believe that only a few of the variables have any effect on the outcome. Perhaps you’re identifying the genes responsible for a rare cancer, and though you have thousands of candidates, you know only a few are the cause. Now you’re not interested in making the best predictions—you just want to identify the responsible genes. Stepwise regression is still not the best tool; the lasso (short for least absolute shrinkage and selection operator, an inspired acronym) has better mathematical properties and doesn’t fool the user with claims of statistical significance. But the lasso is not bulletproof, and there is no perfect automated solution.

Correlation and Causation

When you have used multiple regression to model some outcome—like the probability that a given person will suffer a heart attack, given that person’s weight, cholesterol, and so on—it’s tempting to interpret each variable on its own. You might survey thousands of people, asking whether they’ve had a heart attack and then doing a thorough physical examination, and produce a model. Then you use this model to give health advice: lose some weight, you say, and make sure your cholesterol levels fall within this healthy range. Follow these instructions, and your heart attack risk will decrease by 30%!

But that’s not what your model says. The model says that people with cholesterol and weight within that range have a 30% lower risk of heart attack; it doesn’t say that if you put an overweight person on a diet and exercise routine, that person will be less likely to have a heart attack. You didn’t collect data on that! You didn’t intervene and change the weight and cholesterol levels of your volunteers to see what would happen.

There could be a confounding variable here. Perhaps obesity and high cholesterol levels are merely symptoms of some other factor that also causes heart attacks; exercise and statin pills may fix them but perhaps not the heart attacks. The regression model says lower cholesterol means fewer heart attacks, but that’s correlation, not causation.

One example of this problem occurred in a 2010 trial testing whether omega-3 fatty acids, found in fish oil and commonly sold as a health supplement, can reduce the risk of heart attacks. The claim that omega-3 fatty acids reduce heart attack risk was supported by several observational studies, along with some experimental data. Fatty acids have anti-inflammatory properties and can reduce the level of triglycerides in the bloodstream—two qualities known to correlate with reduced heart attack risk. So it was reasoned that omega-3 fatty acids should reduce heart attack risk.5

But the evidence was observational. Patients with low triglyceride levels had fewer heart problems, and fish oils reduce triglyceride levels, so it was spuriously concluded that fish oil should protect against heart problems. Only in 2013 was a large randomized controlled trial published, in which patients were given either fish oil or a placebo (olive oil) and monitored for five years. There was no evidence of a beneficial effect of fish oil.6

Another problem arises when you control for multiple confounding factors. It’s common to interpret the results by saying, “If weight increases by one pound, with all other variables held constant, then heart attack rates increase by . . .” Perhaps that is true, but it may not be possible to hold all other variables constant in practice. You can always quote the numbers from the regression equation, but in reality the act of gaining a pound of weight also involves other changes. Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.

Simpson’s Paradox

When statisticians are asked for an interesting paradoxical result in statistics, they often turn to Simpson’s paradox.[15] Simpson’s paradox arises whenever an apparent trend in data, caused by a confounding variable, can be eliminated or reversed by splitting the data into natural groups. There are many examples of the paradox, so let me start with the most popular.

In 1973, the University of California, Berkeley, received 12,763 applications for graduate study. In that year’s admissions process, 44% of male applicants were accepted but only 35% of female applicants were. The university administration, fearing a gender discrimination lawsuit, asked several of its faculty to take a closer look at the data.[16]

Graduate admissions, unlike undergraduate admissions, are handled by each academic department independently. The initial investigation led to a paradoxical conclusion: of 101 separate graduate departments at Berkeley, only 4 departments showed a statistically significant bias against admitting women. At the same time, six departments showed a bias against men, which was more than enough to cancel out the deficit of women caused by the other four departments.

How could Berkeley as a whole appear biased against women when individual departments were generally not? It turns out that men and women did not apply to all departments in equal proportion. For example, nearly two-thirds of the applicants to the English department were women, while only 2% of mechanical engineering applicants were. Furthermore, some graduate departments were more selective than others.

These two factors accounted for the perceived bias. Women tended to apply to departments with many qualified applicants and little funding, while men applied to departments with fewer applicants and surpluses of research grants. The bias was not at Berkeley, where individual departments were generally fair, but further back in the educational process, where women were being shunted into fields of study with fewer graduate opportunities.8

Simpson’s paradox came up again in a 1986 study on surgical techniques to remove kidney stones. An analysis of hundreds of medical records appeared to show that percutaneous nephrolithotomy, a minimally invasive new procedure for removing kidney stones, had a higher success rate than traditional open surgery: 83% instead of 78%.

On closer inspection, the trend reversed. When the data was split into small and large kidney-stone groups, percutaneous nephrolithotomy performed worse in both groups, as shown in Table 8-1. How was this possible?

Table 8-1. Success Rates for Kidney Stone Removal Surgeries


Diameter < 2 cm

Dia. ≥ 2 cm


Open surgery




Percutaneous nephrolithotomy




The problem was that the study did not use randomized assignment. It was merely a review of medical records, and it turned out that doctors were systematically biased in how they treated each patient. Patients with large, difficult-to-remove kidney stones underwent open surgery, while those with small, easy-to-remove stones had the nephrolithotomy.9 Presumably, doctors were more comfortable using the new, unfamiliar procedure on patients with small stones and reverted to open surgery for tough cases.

The new surgery wasn’t necessarily better but was tested on the easiest patients. Had the surgical method been chosen by random assignment instead of at the surgeon’s discretion, there’d have been no such bias. In general, random assignment eliminates confounding variables and prevents Simpson’s paradox from giving us backward results. Purely observational studies are particularly susceptible to the paradox.

This problem is common in medicine, as illustrated by another example. Bacterial meningitis is an infection of tissues surrounding the brain and spinal cord and is known to progress quickly and cause permanent damage if not immediately treated, particularly in children. In the United Kingdom, general practitioners typically administer penicillin to children they believe have meningitis before sending them to the hospital for further tests and treatment. The goal is to start treatment as soon as possible, without waiting for the child to travel to the hospital.

To see whether this early treatment was truly beneficial, an observational study examined records of 448 children diagnosed with meningitis and admitted to the hospital. Simple analysis showed that children given penicillin by general practitioners were less likely to die in treatment.

A more careful look at the data reversed this trend. Many children had been admitted directly to the hospital and never saw a general practitioner, meaning they didn’t receive the initial penicillin shot. They were also the children with the most severe illnesses—the children whose parents rushed them directly to the hospital. What if they are excluded from the data and you ask only, “Among children who saw their general practitioner first, did those administered penicillin have better outcomes?” Then the answer is an emphatic no. The children administered penicillin were much more likely to die.10

But this was an observational study, so you can’t be sure the penicillin caused their deaths. It’s hypothesized that toxins released during the destruction of the bacteria could cause shock, but this has not been experimentally proven. Or perhaps general practitioners gave penicillin only to children who had the most severe cases. You can’t be sure without a randomized trial.

Unfortunately, randomized controlled experiments are difficult and sometimes impossible to run. For example, it may be considered unethical to deliberately withhold penicillin from children with meningitis. For a nonmedical example, if you compare flight delays between United Airlines and Continental Airlines, you’ll find United has more flights delayed on average. But at each individual airport in the comparison, Continental’s flights are more likely to be delayed. It turns out United operates more flights out of cities with poor weather. Its average is dragged down by the airports with the most delays.7

But you can’t randomly assign airline flights to United or Continental. You can’t always eliminate every confounding factor. You can only measure them and hope you’ve measured them all.


§ Remember that a statistically insignificant variable does not necessarily have zero effect; you may not have the power needed to detect its effect.

§ Avoid stepwise regression when possible. Sometimes it’s useful, but the final model is biased and difficult to interpret. Other selection techniques, such as the lasso, may be more appropriate. Or there may be no need to do variable selection at all.

§ To test how well your model fits the data, use a separate dataset or a procedure such as cross-validation.

§ Watch out for confounding variables that could cause misleading or reversed results, as in Simpson’s paradox, and use random assignment to eliminate them whenever possible.

[15] Simpson’s paradox was discovered by Karl Pearson and Udny Yule and is thus an example of Stigler’s law of eponymy, discovered by Robert Merton, which states that no scientific discovery is named after the original discoverer.

[16] The standard version of this story claims that the university was sued for discrimination, but nobody ever says who filed the suit or what became of it. A Wall Street Journal interview with a statistician involved in the original investigation reveals that the lawsuit never happened.7 The mere fear of a lawsuit was sufficient to trigger an investigation. But the lawsuit story has been around so long that it’s commonly regarded as fact.