Pseudoreplication: Choose Your Data Wisely - Statistics Done Wrong: The Woefully Complete Guide (2015)

Statistics Done Wrong: The Woefully Complete Guide (2015)

Chapter 3. Pseudoreplication: Choose Your Data Wisely

In a randomized controlled trial, test subjects are assigned to either experimental or control groups randomly, rather than for any systematic reason. Though the word random makes such studies sound slightly unscientific, a medical trial is not usually considered definitive unless it is a randomized controlled trial. Why? What’s so important about randomization?

Randomization prevents researchers from introducing systematic biases between test groups. Otherwise, they might assign frail patients to a less risky or less demanding treatment or assign wealthier patients to the new treatment because their insurance companies will pay for it. But randomization has no hidden biases, and it guarantees that each group has roughly the same demographics; any confounding factors—even ones you don’t know about—can’t affect your results. When you obtain a statistically significant result, you know that the only possible cause is your medication or intervention.

Pseudoreplication in Action

Let me return to a medical example. I want to compare two blood pressure medications, so I recruit 2,000 patients and randomly split them into two groups. Then I administer the medications. After waiting a month for the medication to take effect, I measure each patient’s blood pressure and compare the groups to find which has the lower average blood pressure. I can do an ordinary hypothesis test and get an ordinary p value; with my sample size of 1,000 patients per group, I will have good statistical power to detect differences between the medications.

Now imagine an alternative experimental design. Instead of 1,000 patients per group, I recruit only 10, but I measure each patient’s blood pressure 100 times over the course of a few months. This way I can get a more accurate fix on their individual blood pressures, which may vary from day to day. Or perhaps I’m worried that my sphygmomanometers are not perfectly calibrated, so I measure with a different one each day.[8] I still have 1,000 data points per group but only 10 unique patients. I can perform the same hypothesis tests with the same statistical power since I seem to have the same sample size.

But do I really? A large sample size is supposed to ensure that any differences between groups are a result of my treatment, not genetics or preexisting conditions. But in this new design, I’m not recruiting new patients. I’m just counting the genetics of each existing patient 100 times.

This problem is known as pseudoreplication, and it is quite common.1 For instance, after testing cells from a culture, a biologist might “replicate” his results by testing more cells from the same culture. Or a neuroscientist might test multiple neurons from the same animal, claiming to have a large sample size of hundreds of neurons from just two rats. A marine biologist might experiment on fish kept in aquariums, forgetting that fish sharing a single aquarium are not independent: their conditions may be affected by one another, as well as the tested treatment.2 If these experiments are meant to reveal trends in rats or fish in general, their results will be misleading.

You can think of pseudoreplication as collecting data that answers the wrong question. Animal behaviorists frequently try to understand bird calls, for example, by playing different calls to birds and evaluating their reactions. Bird calls can vary between geographical regions, just like human accents, and these dialects can be compared. Prior to the 1990s, a common procedure for these experiments was to record one representative bird song from each dialect and then play these songs to 10 or 20 birds and record their reactions.3 The more birds that were observed, the larger the sample size.

But the research question was about the different song dialects, not individual songs. No matter how “representative” any given song may have been, playing it to more birds couldn’t provide evidence that Dialect A was more attractive to male yellow-bellied sapsuckers than Dialect B was; it was only evidence for that specific song or recording. A proper answer to the research question would have required many samples of songs from both dialects.

Pseudoreplication can also be caused by taking separate measurements of the same subject over time (autocorrelation), like in my blood pressure experiment. Blood pressure measurements of the same patient from day to day are autocorrelated, as are revenue figures for a corporation from year to year. The mathematical structure of these autocorrelations can be complicated and vary from patient to patient or from business to business. The unwitting scientist who treats this data as though each measurement is independent of the others will obtain pseudoreplicated—and hence misleading—results.

Accounting for Pseudoreplication

Careful experimental design can break the dependence between measurements. An agricultural field experiment might compare growth rates of different strains of a crop in each field. But if soil or irrigation quality varies from field to field, you won’t be able to separate variations due to crop variety from variations in soil conditions, no matter how many plants you measure in each field. A better design would be to divide each field into small blocks and randomly assign a crop variety to each block. With a large enough selection of blocks, soil variations can’t systematically benefit one crop more than the others.

Alternatively, if you can’t alter your experimental design, statistical analysis can help account for pseudoreplication. Statistical techniques do not magically eliminate dependence between measurements or allow you to obtain good results with poor experimental design. They merely provide ways to quantify dependence so you can correctly interpret your data. (This means they usually give wider confidence intervals and larger p values than the naive analysis.) Here are some options:4

§ Average the dependent data points. For example, average all the blood pressure measurements taken from a single person and treat the average as a single data point. This isn’t perfect: if you measured some patients more frequently than others, this fact won’t be reflected in the averaged number. To make your results reflect the level of certainty in your measurements, which increases as you take more, you’d perform a weighted analysis, weighting the better-measured patients more strongly.

§ Analyze each dependent data point separately. Instead of combining all the patient’s blood pressure measurements, analyze every patient’s blood pressure from, say, just day five, ignoring all other data points. But be careful: if you repeat this for each day of measurements, you’ll have problems with multiple comparisons, which I will discuss in the next chapter.

§ Correct for the dependence by adjusting your p values and confidence intervals. Many procedures exist to estimate the size of the dependence between data points and account for it, including clustered standard errors, repeated measures tests, and hierarchical models.5,6,7

Batch Biology

New technology has led to an explosion of data in biology. Inexpensive labs-on-a-chip called microarrays allow biologists to track the activities of thousands of proteins or genes simultaneously. Microarrays contain thousands of probes, which chemically bind to different proteins or genes; fluorescent dyes allow a scanner to detect the quantity of material bound to each probe. Cancer research in particular has benefited from these new technologies: researchers can track the expression of thousands of genes in both cancerous and healthy cells, which might lead to new targeted cancer treatments that leave healthy tissue unharmed.

Microarrays are usually processed in batches on machines that detect the fluorescent dyes. In a large study, different microarrays may be processed by different laboratories using different equipment. A naive experimental setup might be to collect a dozen cancerous samples and a dozen healthy samples, inject them into microarrays, and then run all the cancerous samples through the processing machine on Tuesday and the healthy samples on Wednesday.

You can probably see where this is going. Microarray results vary strongly between processing batches: machine calibrations might change, differences in laboratory temperature can affect chemical reactions, and different bottles of chemical reagents might be used while processing the microarrays. Sometimes the largest source of variation in an experiment’s data is simply what day the microarrays were processed. Worse, these problems do not affect the entire microarray in the same way—in fact, correlations between the activity of pairs of genes can entirely reverse when processed in a different batch.8 As a result, additional samples don’t necessarily add data points to a biological experiment. If the new samples are processed in the same batch as the old, they just measure systematic error introduced by the equipment—not anything about cancerous cells in general.

Again, careful experimental design can mitigate this problem. If two different biological groups are being tested, you can split each group evenly between batches so systematic differences do not affect the groups in different ways. Also, be sure to record how each batch was processed, how each sample was stored, and what chemical reagents were used during processing; make this information available to the statisticians analyzing the data so they use it to detect problems.

For example, a statistician could perform principal components analysis on the data to determine whether different batches gave wildly different results. Principal components analysis determines which combinations of variables in the data account for the most variation in the results. If it indicates that the batch number is highly influential, the data can be analyzed taking batch number into account as a confounding variable.

Synchronized Pseudoreplication

Pseudoreplication can occur through less obvious routes. Consider one example in an article reviewing the prevalence of pseudoreplication in the ecological literature.9 Suppose you want to see whether chemicals in the growing shoots of grasses are responsible for the start of the reproductive season in cute furry rodents: your hypothesis is that when the grasses sprout in springtime, the rodents eat them and begin their mating season. To test this, you try putting some animals in a lab, feed half of them ordinary food and the other half food mixed with the grasses, and wait to see when their reproductive cycles start.

But wait: you vaguely recall having read a paper suggesting that the reproductive cycles of mammals living in groups can synchronize—something about their pheromones. So maybe the animals in each group aren’t actually independent of each other. After all, they’re all in the same lab, exposed to the same pheromones. As soon as one goes into estrus, its pheromones could cause others to follow, no matter what they’ve been eating. Your sample size will be effectively one.

The research you’re thinking of is a famous paper from the early 1970s, published in Nature by Martha McClintock, which suggested that women’s menstrual cycles can synchronize if they live in close contact.10 Other studies found similar results in golden hamsters, Norway rats, and chimpanzees. These results seem to suggest that synchronization could cause pseudoreplication in your study. Great. So does this mean you’ll have to build pheromone-proof cages to keep your rodents isolated from each other?

Not quite. You might wonder how you prove that menstrual or estrous cycles synchronize. Well, as it turns out, you can’t. The studies “proving” synchronization in various animals were themselves pseudoreplicated in an insidious way.

McClintock’s study of human menstrual cycles went something like this:

1. Find groups of women who live together in close contact—for instance, college students in dormitories.

2. Every month or so, ask each woman when her last menstrual period began and to list the other women with whom she spent the most time.

3. Use these lists to split the women into groups that tend to spend time together.

4. For each group of women, see how far the average woman’s period start date deviates from the average.

Small deviations would mean the women’s cycles were aligned, all starting at around the same time. Then the researchers tested whether the deviations decreased over time, which would indicate that the women were synchronizing. To do this, they checked the mean deviation at five different points throughout the study, testing whether the deviation decreased more than could be expected by chance.

Unfortunately, the statistical test they used assumed that if there was no synchronization, the deviations would randomly increase and decrease from one period to another. But imagine two women in the study who start with aligned cycles. One has an average gap of 28 days between periods and the other a gap of roughly 30 days. Their cycles will diverge consistently over the course of the study, starting two days apart, then four days, and so on, with only a bit of random variation because periods are not perfectly timed. Similarly, two women can start the study not aligned but gradually align.

For comparison, if you’ve ever been stuck in traffic, you’ve probably seen how two turn signals blinking at different rates will gradually synchronize and then go out of phase again. If you’re stuck at the intersection long enough, you’ll see this happen multiple times. But to the best of my knowledge, there are no turn signal pheromones.

So we would actually expect two unaligned menstrual cycles to fall into alignment, at least temporarily. The researchers failed to account for this effect in their statistical tests.

They also made an error calculating synchronization at the beginning of the study: if one woman’s period started four days before the study began and another’s started four days after, the difference is only eight days. But periods before the beginning of the study were not counted, so the recorded difference was between the fourth day and the first woman’s next period, as much as three weeks later.

These two errors combined meant that the scientists were able to obtain statistically significant results even when there was no synchronization effect outside what would occur without pheromones.11,12

The additional data points the researchers took as they followed subjects through more menstrual cycles did not provide evidence of synchronization at all. It was merely more statistical evidence of the synchronization that would’ve happened by chance, regardless of pheromones. The statistical test addressed a different question than the scientists intended to ask.

Similar problems exist with studies claiming that small furry mammals or chimpanzees synchronize their estrous cycles. Subsequent research using corrected statistical methods has failed to find any evidence of estrous or menstrual synchronization (though this is controversial).13 We only thought our rodent experiment could have pseudoreplication because we believed a pseudoreplicated study.

Don’t scoff at your friends if they complain about synchronized periods, though. If the average cycle lasts 28 days, then two average women can have periods which start at most 14 days apart. (If your period starts 20 days after your friend’s, it’s only eight days before her next period.) That’s the maximum, so the average will be seven days, and since periods can last for five to seven days, they will frequently overlap even as cycles converge and diverge over time.

TIPS

§ Ensure that your statistical analysis really answers your research question. Additional measurements that are highly dependent on previous data do not prove that your results generalize to a wider population—they merely increase your certainty about the specific sample you studied.

§ Use statistical methods such as hierarchical models and clustered standard errors to account for a strong dependence between your measurements.

§ Design experiments to eliminate hidden sources of correlation between variables. If that’s not possible, record confounding factors so they can be adjusted for statistically. But if you don’t consider the dependence from the beginning, you may find there is no way to save your data.


[8] I just wanted an excuse to use the word sphygmomanometer.