Putting psychological research to the test with the Reproducibility Project

An ambitious new project is attempting to replicate every single study published in 2008 in three leading academic psychology journals. It’s called the Reproducibility Project. As the name suggests, the aim is to explore just how reproducible the results of psychological experiments are, and the current…

8fd3kpy6-1353978927
Statistical significance doesn’t speak directly to the reproducibility of an experimental effect. Daniel Leininger

An ambitious new project is attempting to replicate every single study published in 2008 in three leading academic psychology journals. It’s called the Reproducibility Project.

As the name suggests, the aim is to explore just how reproducible the results of psychological experiments are, and the current issue of Perspectives in Psychological Science is dedicated to it. It’s a laudable goal, but why is it necessary? Surely statistical analysis of experimental data should tell us whether we’re likely to see the same result again.

What statistics don’t tell us

There’s a widespread misconception that the statistical analysis typically reported in scientific journals address replication. In particular, many people, including researchers themselves, believe the “statistical significance” of a result speaks directly to the reproducibility of the experimental effect. It does not.

Readers of scientific papers may be familiar with the term “statistical significance”. It’s often expressed as p<.05 after a result, or as an asterisk in a table or figure referring to “significance at the 5% level”. In psychology, statistical significance tests are used to support primary outcomes in 97% of empirical articles.

The most common misinterpretation is that p<.05 means there’s a less than 5% probability that the experimental effect is due to chance. (Read this bit carefully — the preceding sentence described a misconception.) From this “due to chance” misconception, we quickly arrive at the false conclusion that the effect is very probably real and that it will replicate.

In fact, a p value is a conditional probability: the probability of observing a particular experimental result, or one more extreme, given that it doesn’t actually exist in the world. The reason that statement doesn’t equate to “due to chance”, “reproducible” or “real” is that it only describes one type of error – the error of finding something that isn’t really there. It doesn’t say anything about the chance of missing what is there, or even indicate how hard we looked!

Statistical significance depends on the size of effect (that is, how much difference the drug or therapy made), the variability in the sample (that’s how much people vary in their reactions to the drug or therapy), as well as other design features of the experiment, including sample size.

In psychology, small sample sizes, modest treatment effects and considerable individual differences (high amounts of variability) work together to create low statistical power in many experiments. Statistical power tells us how likely it is that a given experiment will detect an effect of a certain magnitude as “statistically significant”, if the effect really exists in the world.

Calculations of the average statistical power of published psychology experiments hovers at around 50%. This means that conducting an average psychology experiment is roughly equivalent to flipping a coin, in terms of whether you get a statistically significant result or not.

Many statistically non-significant results are therefore not good evidence of “no effect”, and many statistically significant results that get published are false positives, as we explain below.

Publication bias and false positives

An average statistical power of 50%, combined with journals’ biases towards only publishing statistically significant results produces a skewed literature, one that potentially only tells half the story – the statistically significant half. We simply don’t hear about the studies that failed to reach the significance threshold, even though there may be more of them. Those studies stay locked in file drawers.

Publication bias pushes the number of false positives in the literature far beyond the 5% rate we expect from tests that report p<.05. False positive results detect statistically significant effects when there are no real effects there, like a pregnancy test that reports you are pregnant when you are not.

Add to the mix external pressure to publish — from funding agencies or drug companies – and flexible research designs allowing researchers to stop collecting data when they cross the statistical significance threshold (rather than a predetermining a sample size and sticking to it), and the false positive rate grows even higher.

It might be inconvenient that typical statistics don’t provide us with direct information about the reproducibility of results, but it needn’t be the downfall of the scientific enterprise. The problem lies in falsely believing, as many psychology researchers do, that a p<.05 means that it’s really likely you’ll see another low p value next time.

If you think p values tell you what you need to know about replication, you’re much less likely to actually replicate a study. And that means that false positive results can loiter in the literature for a very long time, distorting our understanding of issues, misleading scientific research, and underpinning inappropriate management or policy decisions.

In 2005, John P. Ioannidis made headlines when he claimed that up to 90% of published medical findings may be false. Ioannidis described conditions of small sample sizes, small effect sizes, publication bias, pressure to publish and flexible stopping rules — all the problems we identify above. His quantitative conclusions about error rates and false positives were based on simulations, not “real” data.

Unfortunately, looking at real data is just as disheartening. Over the past decade, a group of researchers attempted to replicate 53 “landmark” cancer studies. They were interested in how many would again produce results deemed strong enough to drive a drug-development program (their definition of reproducibility). Of those 53 studies, the results of only six could be robustly reproduced.

It seems reproducible results are pretty hard to come by. Quantifying exactly how hard is what the Reproducibility Project is all about. From it, we’ll learn a lot about which psychological phenomena are real, and which aren’t. We may also learn a lot about how poor statistical practice can delay progress and mislead science.

Articles also by These Authors

Sign in to Favourite

Want to follow The Conversation?

Sign up to our free newsletter to get the day's top stories in your inbox each morning, with a special wrap on Saturday.

Spinner
Help us have better conversations — donate

Join the conversation

3 Comments sorted by

  1. Leslie Newsome

    Senior Lecturer in Psychology (retired)

    Good stuff. Many year back down the drain I used to take a class in practical statistics. Once had just over 100 students in the room and I gave each a sheet of paper of data, randomly generated for each student. the exercise that each student had to organize it into a 3way Analysis of Variance on interpret the results. When they were finished I asked those who had found main effects to put their hands up. A number of hands went up. I then pointed out that if they were researcher's around the country working on the same idea, at least several would cry "Eureka" and rush off to publish. The rest would file their results in a wast-paper basket and move on elsewhere. Two separate publications normally are taken as a significant replication and the "so-called" effect moves on to be taught as a scientific fact to be accommodated by any subsequent theory.

    report
    1. Brad Adams

      logged in via Twitter

      In reply to Leslie Newsome

      Paul Meehl's writings on philosophy of science and statistical significance testing should be required reading in any psychology course. Cohen's article "The earth is round (p < .05)" should be as well.Throw in a few of Gigerenzer's articles on statistical thinking. They are not. In 4 years of studying psychology the only time I came across any of this was one reference in one textbook to Cohen's article.
      How many psych lecturers do you think take your sceptical approach Leslie?

      report
  2. Alex A. Sanchez

    Post-Doc in Clinical Psychology

    As enlightening as this article is, it also makes me a bit sad. I now question the validity of some of my favorite studies.
    On the other hand, bodies of work, such as that belonging to Dr. James Pennebaker, are so thorough, that it must be safe to agree on some degree of credibility to his results.

    report