Yesterday’s article by Geoff Cumming, based on a very recent Proceedings of the National Academy of Science paper, argued that “null hypothesis significance tests” (NHST) are flawed – and he is correct. But he points the finger of blame in the wrong direction.
The problem with NHST has several layers – the first of which is that there should be nothing going by the name of NHST.
British biologist and statistician Ronald Fisher devised an approach called “significance testing” whereby data are distilled into a statistic that represents the degree to which the data are discordant with a null hypothesis (the null hypothesis usually being no effect).
That statistic is the p value. Fisher didn’t invent the p value, but his 1925 book Statistical Methods for Research Workers served to popularise p values as indices of evidence against the null hypothesis.
Initially it was assumed by many (including Fisher himself for a short time, according to Lehman), that their framework was a development of Fisher’s significance testing, a mistake that persists even today, perhaps because of the several jargon terms shared by both approaches which, confusingly, take subtly or overtly differing meanings.
The aims of the approaches differ:
- Fisher wanted to make inferences based on the evidence in the data
- Neyman and Pearson explicitly assumed that no experiment could provide evidence about any particular hypothesis and instead, they focused on minimising erroneous decisions.
The product of Fisher’s significance test is a p value, and the product of a Neyman-Pearson hypothesis test is a decision to accept or reject a hypothesis. In other words, Fisher made inference from evidence while Neyman avoided errors. (Pearson eventually withdrew his support for the all-or-none version of the Neyman and Pearson framework.)
It has been argued that the most widely used variant of those approaches is a hybrid of the two. “Null hypothesis significance test” is the perfect name for such a hybrid.
Despite its wide usage, the hybrid is dysfunctional. It preserves neither the evidential aspect of Fisher’s p values nor the control of error rates promised by Neyman’s approach. German psychologist Gerd Gigerenzer called it a “mishmash”.
Basically, NHST should not exist.
P values and evidence
The next level of problem relates to perceptions of the usefulness and desirability of p values themselves.
Many papers dispute the utility of p values and their validity as indices of evidence, and some have amusing and pointed titles such as The irreconcilability of p values and evidence, or Hail the impossible: p- values, evidence, and likelihood.
There is a kernel of truth in their arguments, but that truth is conditional on the arguments being evaluated within the realm of either a Neyman-Pearson hypothesis test or of the mishmash hybrid.
P values may not be numerical measures of evidence, but they certainly do relate to it. The smaller the p value from a significance test, the stronger the evidence against the null hypothesis and the stronger the evidence in favour of parameter values close to the observed value. The larger the sample size in the experiment, the more closely the observed parameter value will reflect the true value of the parameter for any given p value.
Neither of those relationships will surprise anyone who has experience in converting their experimental data into p values using a significance test.
However, there is more to it, because each p value from a significance test points to a likelihood function. A likelihood function displays the strength of evidence and how that evidence favours different possible values of the parameter of interest.
The relationship between p values (along with their sample size) and likelihood functions is one-to-one and so a p value serves as an index to, or a numerical summary of, the likelihood function. Just as an abstract summarises a research paper without including all of its detail and richness, a p value represents the evidence despite being only a summary.
The connection between p values and likelihood functions is not widely appreciated, an unsurprising state of affairs given that likelihood itself is usually missing from statistics texts.
Valen Johnson, author of the paper Cumming’s article is based upon, derides p values in favour of “Bayes factors”, which are indices of how much the evidence in a dataset should alter the opinion of a rational observer as to the truth of competing hypotheses.
He shows that the relationship between p values and the Bayes factor that would be obtained from the same data is quite non-linear and concludes that p values misrepresent the evidence by substantially overstating it, as others have before.
Because the conventional p value cutoff for claiming “statistical significance” in the hybrid NHST procedure represents a lower level evidence than he assumes an ordinary user of the method would assume it to be, Johnson concludes that a more stringent cutoff should be required.
However, the Bayes factor Johnson uses is nothing more than the ratio of two points on a likelihood function, the same likelihood function as that indexed by the p value! The whole likelihood function is more informative than a single point Bayes fact just as it is more informative than the p value summary and, importantly, it supports interval estimations of the effect size in a manner that Geoff Cumming would approve.
So what can we do?
More stringent criteria for claiming “significance” in a mishmash hybrid framework is not a sensible response to the need for better statistical support of scientific inference.
We do not need to discard p value to start using the evidential meaning of our data – we just need to understand their properties. The reforms needed are:
- to eschew the dysfunctional hybrid NHST in favour of significance tests
- to understand the relationship between p values and evidence
- to make more use of the likelihood functions that are indexed by the p values.
Most importantly, we need measured reasoning to go along with any p value or likelihood function: statistics can only provide part of the principled argument about evidence.
Further reading: The problem with p values: how significant are they, really?