Why hypothesis and significance tests ask the wrong questions

Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”. Gwenae l Piaser

Empirical science needs data. But all data are subject to random variation, and random variation obscures patterns in data. So statistical methods are used to make inferences about the true patterns or effects that underlie noisy data.

Most scientists use two closely related statistical approaches to make inferences from their data: significance testing and hypothesis testing. Significance testers and hypothesis testers seek to determine if apparently interesting patterns (“effects”) in their data are real or illusory. They are concerned with whether the effects they observe could just have emanated from randomness in the data.

The first step in this process is to nominate a “null hypothesis” which posits that there is no effect. Mathematical procedures are then used to estimate the probability that an effect at least as big as that which was observed would have arisen if the null hypothesis was true. That probability is called “p”.

Significance testing

If p is small (conventionally less than 0.05, or 5%) then the significance tester will claim that it is unlikely an effect of the observed magnitude would have arisen by chance alone. Such effects are said to be “statistically significant”. Sir Ronald Fisher who, in the 1920s, developed contemporary methods for generating p values, interpreted small p values as being indicative of “real” (not chance) effects. This is the central idea in significance testing.

Hypothesis testing

Significance testing has been under attack since it was first developed. Two brilliant mathematicians, Jerzy Neyman and Egon Pearson, argued that Fisher’s interpretation of p was dodgy. They developed an approach called hypothesis testing in which the p value serves only to help the researcher make an optimised choice between the null hypothesis and an alternative hypothesis: If p is greater than or equal to some threshold (such as 0.05) the researcher chooses to believe the null hypothesis. If p is less than the threshold the researcher chooses to believe the alternative hypothesis. In the long run (over many experiments) adoption of the hypothesis testing approach minimises the rate of making incorrect choices.

Critics have pointed out that there is limited value in knowing only that errors have been minimised in the long run – scientists don’t just want to know they have been wrong as infrequently as possible, they want to know if they can believe their last experiment!

The most vociferous critic of hypothesis testing was Fisher, who hounded Neyman in print for decades (Leonard Jimmie Savage said Fisher “published insults that only a saint could entirely forgive”). Perhaps largely as a result of Fisher’s intransigence, the issues that divided significance testing and hypothesis testing were never resolved.

Hypothesis and significance tests are widely misinterpreted. Image from shutterstock.com

Statistical inference

Today’s scientists typically use a messy concoction of significance testing and hypothesis testing. Neither Fisher nor Neyman would be satisfied with much of current statistical practice.

Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted.

Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.

So what’s the point in showing empirically that the null hypothesis is not true? Researchers who conduct clinical trials need to determine if the effect of treatment is big enough to make the intervention worthwhile, not whether the treatment has any effect at all.

A more technical issue is that p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data. The difference might sound subtle but it’s not. It is like the difference between the probability that a prime minister is male and the probability a male is prime minister!

A better approach to statistical inference

There are alternatives to significance testing and hypothesis testing. A simple alternative is “estimation”. Estimation helps scientists ask the right question, and provides better (more statistically defensible, if not more mathematically rigorous) answers.

Another very different approach is “Bayesian” analysis. Bayesian statisticians try to quantify uncertainty and use data to modify their certainty about particular beliefs. In many ways Bayesian methods are superior to classic methods but scientists have been slow to adopt Bayesian approaches.

Significance testing and hypothesis testing are so widely misinterpreted that they impede progress in many areas of science. What can be done to hasten their demise? Senior scientists should ensure that a critical exploration of the methods of statistical inference is part of the training of all research students. Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”, especially when support for such claims is based on the evil p.