An ambitious new project is attempting to replicate every single study published in 2008 in three leading academic psychology journals. It’s called the Reproducibility Project.
As the name suggests, the aim is to explore just how reproducible the results of psychological experiments are, and the current issue of Perspectives in Psychological Science is dedicated to it. It’s a laudable goal, but why is it necessary? Surely statistical analysis of experimental data should tell us whether we’re likely to see the same result again.
What statistics don’t tell us
There’s a widespread misconception that the statistical analysis typically reported in scientific journals address replication. In particular, many people, including researchers themselves, believe the “statistical significance” of a result speaks directly to the reproducibility of the experimental effect. It does not.
Readers of scientific papers may be familiar with the term “statistical significance”. It’s often expressed as p<.05 after a result, or as an asterisk in a table or figure referring to “significance at the 5% level”. In psychology, statistical significance tests are used to support primary outcomes in 97% of empirical articles.
The most common misinterpretation is that p<.05 means there’s a less than 5% probability that the experimental effect is due to chance. (Read this bit carefully — the preceding sentence described a misconception.) From this “due to chance” misconception, we quickly arrive at the false conclusion that the effect is very probably real and that it will replicate.
In fact, a p value is a conditional probability: the probability of observing a particular experimental result, or one more extreme, given that it doesn’t actually exist in the world. The reason that statement doesn’t equate to “due to chance”, “reproducible” or “real” is that it only describes one type of error – the error of finding something that isn’t really there. It doesn’t say anything about the chance of missing what is there, or even indicate how hard we looked!
Statistical significance depends on the size of effect (that is, how much difference the drug or therapy made), the variability in the sample (that’s how much people vary in their reactions to the drug or therapy), as well as other design features of the experiment, including sample size.
In psychology, small sample sizes, modest treatment effects and considerable individual differences (high amounts of variability) work together to create low statistical power in many experiments. Statistical power tells us how likely it is that a given experiment will detect an effect of a certain magnitude as “statistically significant”, if the effect really exists in the world.
Calculations of the average statistical power of published psychology experiments hovers at around 50%. This means that conducting an average psychology experiment is roughly equivalent to flipping a coin, in terms of whether you get a statistically significant result or not.
Many statistically non-significant results are therefore not good evidence of “no effect”, and many statistically significant results that get published are false positives, as we explain below.
Publication bias and false positives
An average statistical power of 50%, combined with journals’ biases towards only publishing statistically significant results produces a skewed literature, one that potentially only tells half the story – the statistically significant half. We simply don’t hear about the studies that failed to reach the significance threshold, even though there may be more of them. Those studies stay locked in file drawers.
Publication bias pushes the number of false positives in the literature far beyond the 5% rate we expect from tests that report p<.05. False positive results detect statistically significant effects when there are no real effects there, like a pregnancy test that reports you are pregnant when you are not.
Add to the mix external pressure to publish — from funding agencies or drug companies – and flexible research designs allowing researchers to stop collecting data when they cross the statistical significance threshold (rather than a predetermining a sample size and sticking to it), and the false positive rate grows even higher.
It might be inconvenient that typical statistics don’t provide us with direct information about the reproducibility of results, but it needn’t be the downfall of the scientific enterprise. The problem lies in falsely believing, as many psychology researchers do, that a p<.05 means that it’s really likely you’ll see another low p value next time.
If you think p values tell you what you need to know about replication, you’re much less likely to actually replicate a study. And that means that false positive results can loiter in the literature for a very long time, distorting our understanding of issues, misleading scientific research, and underpinning inappropriate management or policy decisions.
In 2005, John P. Ioannidis made headlines when he claimed that up to 90% of published medical findings may be false. Ioannidis described conditions of small sample sizes, small effect sizes, publication bias, pressure to publish and flexible stopping rules — all the problems we identify above. His quantitative conclusions about error rates and false positives were based on simulations, not “real” data.
Unfortunately, looking at real data is just as disheartening. Over the past decade, a group of researchers attempted to replicate 53 “landmark” cancer studies. They were interested in how many would again produce results deemed strong enough to drive a drug-development program (their definition of reproducibility). Of those 53 studies, the results of only six could be robustly reproduced.
It seems reproducible results are pretty hard to come by. Quantifying exactly how hard is what the Reproducibility Project is all about. From it, we’ll learn a lot about which psychological phenomena are real, and which aren’t. We may also learn a lot about how poor statistical practice can delay progress and mislead science.