Empirical science needs data. But all data are subject to random variation, and random variation obscures patterns in data. So statistical methods are used to make inferences about the true patterns or effects that underlie noisy data.
Most scientists use two closely related statistical approaches to make inferences from their data: significance testing and hypothesis testing. Significance testers and hypothesis testers seek to determine if apparently interesting patterns (“effects”) in their data are real or illusory. They are concerned with whether the effects they observe could just have emanated from randomness in the data.
The first step in this process is to nominate a “null hypothesis” which posits that there is no effect. Mathematical procedures are then used to estimate the probability that an effect at least as big as that which was observed would have arisen if the null hypothesis was true. That probability is called “p”.
Significance testing
If p is small (conventionally less than 0.05, or 5%) then the significance tester will claim that it is unlikely an effect of the observed magnitude would have arisen by chance alone. Such effects are said to be “statistically significant”. Sir Ronald Fisher who, in the 1920s, developed contemporary methods for generating p values, interpreted small p values as being indicative of “real” (not chance) effects. This is the central idea in significance testing.
Hypothesis testing
Significance testing has been under attack since it was first developed. Two brilliant mathematicians, Jerzy Neyman and Egon Pearson, argued that Fisher’s interpretation of p was dodgy. They developed an approach called hypothesis testing in which the p value serves only to help the researcher make an optimised choice between the null hypothesis and an alternative hypothesis: If p is greater than or equal to some threshold (such as 0.05) the researcher chooses to believe the null hypothesis. If p is less than the threshold the researcher chooses to believe the alternative hypothesis. In the long run (over many experiments) adoption of the hypothesis testing approach minimises the rate of making incorrect choices.
Critics have pointed out that there is limited value in knowing only that errors have been minimised in the long run – scientists don’t just want to know they have been wrong as infrequently as possible, they want to know if they can believe their last experiment!
The most vociferous critic of hypothesis testing was Fisher, who hounded Neyman in print for decades (Leonard Jimmie Savage said Fisher “published insults that only a saint could entirely forgive”). Perhaps largely as a result of Fisher’s intransigence, the issues that divided significance testing and hypothesis testing were never resolved.

Statistical inference
Today’s scientists typically use a messy concoction of significance testing and hypothesis testing. Neither Fisher nor Neyman would be satisfied with much of current statistical practice.
Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted.
Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.
So what’s the point in showing empirically that the null hypothesis is not true? Researchers who conduct clinical trials need to determine if the effect of treatment is big enough to make the intervention worthwhile, not whether the treatment has any effect at all.
A more technical issue is that p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data. The difference might sound subtle but it’s not. It is like the difference between the probability that a prime minister is male and the probability a male is prime minister!
A better approach to statistical inference
There are alternatives to significance testing and hypothesis testing. A simple alternative is “estimation”. Estimation helps scientists ask the right question, and provides better (more statistically defensible, if not more mathematically rigorous) answers.
Another very different approach is “Bayesian” analysis. Bayesian statisticians try to quantify uncertainty and use data to modify their certainty about particular beliefs. In many ways Bayesian methods are superior to classic methods but scientists have been slow to adopt Bayesian approaches.
Significance testing and hypothesis testing are so widely misinterpreted that they impede progress in many areas of science. What can be done to hasten their demise? Senior scientists should ensure that a critical exploration of the methods of statistical inference is part of the training of all research students. Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”, especially when support for such claims is based on the evil p.
Sue Ieraci
Public hospital clinician
Thanks for the article.
One of the important aspects here is terminology: a significance test giving a p-value of <0.05 doesn't mean that something is "true" - it means that the difference between the two things being compared was highly unlikely to have occurred by chance. The way that evidence becomes stronger is when the same thing is replicated in further studies. Clinical practice, for example, rarely changes on the basis of one study.
The other important thing in clinical studies is the difference between test-based results and patient-based results. For example, a certain medication may lower blood pressure by a result that is statistically significant, but makes no difference to the patients' clinical outcome (risk of stroke or heart disease, for example).
Steve Kass
Professor of Mathematics and Computer Science
I'm not sure what your point is. There is nothing evil about p-values or hypothesis testing, although there is plenty of confusion about what they mean.
You take just a glancing blow at one very important question, that of the effect size of a statistically significant experimental result. You begin:
"So what’s the point in showing empirically that the null hypothesis is not true?"
First of all, one never shows "empirically that the null hypothesis is not true." Instead, what researchers…
Read moreRob Herbert
Senior Principal Research Fellow at Neuroscience Research Australia
Thanks Sue and Steve for your comments.
I was, of course, having a bit of fun calling p values "evil". p values aren't anything other than a number which may or may not be interpreted correctly. But my opinion is (still) that if p values are interpreted correctly they are of little or no use to most scientists. Mostly it is not plausible that the null hypothesis could be exactly true. So most significance tests and hypothesis tests evaluate the truth of a nonsensical hypothesis. The particular…
Read moreSteve Kass
Professor of Mathematics and Computer Science
Rob,
p values don't "evaluate the truth of a nonsensical hypothesis." Anyone who thinks they really evaluate that is interpreting them badly. The (1-tailed) p-value associated with the result of experiment X (based on a random sample from a population it's intended to represent) is this:
It's the probability that the same or a more extreme result might arise from a different experiment: the experiment where the same size sample is randomly chosen from a totally hypothetical population where…
Read moreRoger Jones
Professorial Research Fellow at Victoria University
Steve,
what you seem to be describing is a simple system but not a complex system and certainly not one that is reflexive.
I really appreciate Rob's article and was heartened to see it because I'm about to blow the whistle on a whole scientific community who have constructed methodologies around p values for largely sociological reasons.
And to overlook the reality that p(hypothesis) is not equal to p(effect) in a world where science takes p(experiment) = p(expert advice) is negligent.
Roger Jones
Professorial Research Fellow at Victoria University
Oh yeah, was too quick with the post comment. p(experiment) is determined by methods, not theory.
Sue Ieraci
Public hospital clinician
Thanks, Rob,
I'm not sure I agree with your assertion that "Mostly it is not plausible that the null hypothesis could be exactly true."
(I guess it depends what you mean by "exactly true").
There are many clinical studies where a (proposed) therapeutic substances is compared with placebo or against an existing therapy. The null hypothesis would be that the substance being tested resulted in no benefit over placebo (or over existing therapy). Why couldn't that proposal be "exactly true"?
John Kruschke
Professor
I'm glad to see another call for researchers to think carefully about their statistical methods, and especially to consider Bayesian methods as possibly more appropriate for the answers they seek. Linked below is a very recent article for those who want a simple but thorough example of what Bayesian methods can do. The article is titled, "Bayesian estimation supersedes the t test." You can find it, and associated software and videos, at http://www.indiana.edu/~kruschke/BEST/
Derek Bolton
Retired s/w engineer
The article makes good points about the risk of misuse of hypothesis testing. (I recall a magazine article some years ago by a maths prof explaining H.T to the layperson. In his worked example he managed to invert the test completely, declaring 80% for a chance result when his test was for 20%.)
But I've always felt the major obfuscation is that the choice of p value is relied upon to address two unknowns: the prior probability (made explicit in Bayesian analysis) and the costs of error. The choice of criterion for a decision must not be divorced from the costs of a wrong decision. (The article partly addressed this topic in pointing out that the pharmaceuticals tester wants to know the strength of an effect, not merely whether there is one.) In practice, I suspect, most engineers and researchers pick a standard .05 or .01 out of the air, with no real understanding of the consequences.
Rob Herbert
Senior Principal Research Fellow at Neuroscience Research Australia
Sue, significance testing and null hypothesis testing are concerned with whether the null hypothesis is true (which is why they are sometimes collectively referred to as “null hypothesis testing”). That means they test whether the effect is exactly zero. My point was that, while it might be reasonable to assert in a clinical trial that an intervention has no discernible or practical or useful benefit over placebo, it will usually not be reasonable to assert that the intervention has exactly no effect…
Read moreSteve Kass
Professor of Mathematics and Computer Science
Rob,
I like the fact that this site is titled "The Conversation." I'm not trying to reach a resolution to anything in this discussion, which I think is enlightening and informative.
Also, I think I need to be clear that I have nothing at all against Bayesian statistics. In fact, I like it a lot, but just as with non-Bayesian statistics, it's important to know what the conclusions mean and what the assumptions are. I give an example below where a Bayesian analysis gets the right answer and a…
Read moreStephen S Holden
Associate Professor, Marketing at Bond University
Great article. A topic well worth visiting - and re-visiting! And fun to hear something of the history too. Would have liked a little more about what you meant by 'estimation'. Were you referring to the practice of calculating effect sizes or something else?
Rex Gibbs
Engineer/Director
Thanks
Read moreI am an engineer and as such have to deal with rare events and compound multivariate problems. I am regularly required to design systems with 95% confidence of outcome based on a data set of 1 or 2 points. Large data sets only become available in the event of dispute. I currently work in wastewater treatment. We are asked to design Million dollar treatment plants based on a single test. Often because it is impractical to test ore where there is nothing to test. And one must design in a competitive…
Rob Buttrose
University of Melbourne
There is a vast literature on this topic going back many years. Most writers agree completely with the claims made in the article (except for some psychologists and guy called Chow). The American journal Epidemiology banned p-vales outright in the 90s.
Unfortunately, the faulty logic of statistical significance testing continues to bamboozle researchers. Contrary to what almost everyone believes, if a result is unlikely given that the null-hypothesis is true, that does not mean that if the result…
Read moreSteve Kass
Professor of Mathematics and Computer Science
Here is an excerpt from what seems to be the most recent (2001) editors' statement in Epidemiology about p-values [http://journals.lww.com/epidem/Fulltext/2001/05000/The_Value_of_P.2.aspx]. The journal never banned them.
"We will not ban P-values. But neither did Rothman [The editor in 1998. See http://journals.lww.com/epidem/Citation/1998/01000/That_Confounded_P_Value.4.aspx]. He called for caution, and we do the same. The question is not whether the P-value is intrinsically bad, but whether…
Read moreRob Buttrose
University of Melbourne
Steve, you've researched this more thoroughly than me. I thought Rothman did manage to ban p-values for a time at least. He certainly wanted to! I have met Rothman and I do now he had very strong views about p-values and significance tests. Quoted here http://intl-pss.sagepub.com/content/15/2/119.full are his revise-and-submit letters when he was assistant editor on another journal prior to his founding Epidemiology:
"All references to statistical hypothesis testing and statistical significance should be removed from the paper. I ask that you delete p values as well as comments about statistical significance. If you do not agree with my standards (concerning the inappropriateness of significance tests), you should feel free to argue the point, or simply ignore what you may consider to be my misguided view, by publishing elsewhere."
Rob Buttrose
University of Melbourne
The probability of the null hypothesis given a result (i.e. the probability that the result was due to chance) not only depends on the probability of observing the result given the null hypothesis, but also on a number of other probabilities including the prior probability of the null and the probability of the result given the alternative ( the "not null") - Bayes' theorem.
This is good value: http://www.indiana.edu/~stigtsts/quotsagn.html
Stephen Prowse
CEO at Wound CRC
I am an ex-scientist who regularly used significance testing in analysing data, mostly from controlled trials where there was one intervention and on an individual experiment basis, the process seems to serve its purpose. If used appropriately, recognising the limitations, I do not really see a problem. I well understand the issues of a null hypothesis and the magnitude of an effect etc.
It would seem that a Bayesian approach is required when one is dealing with more complex, multi-factor whole population analysis where it is inappropriate to try to dissect out individual variables.