Fisher did not take Neyman and Pearson’s criticisms well. In response, he called their methods “childish” and “absurdly academic.” In particular, Fisher disagreed with the idea of deciding between two hypotheses, rather than calculating the “significance” of available evidence, as he’d proposed. Whereas a decision is final, his significance tests gave only a provisional opinion, which could be later revised. Even so, Fisher’s appeal for an open scientific mind was somewhat undermined by his insistence that researchers should use a 5 percent cutoff for a “significant” p-value, and his claim that he would “ignore entirely all results which fail to reach this level.”
Acrimony would give way to decades of ambiguity, as textbooks gradually muddled together Fisher’s null hypothesis testing with Neyman and Pearson’s decision-based approach. A nuanced debate over how to interpret evidence, with discussion of statistical reasoning and design of experiments, instead became a set of fixed rules for students to follow.
Mainstream scientific research would come to rely on simplistic p-value thresholds and true-or-false decisions about hypotheses. In this role-learned world, experimental effects were either present or they were not. Medicines either worked or they didn’t. It wouldn’t be until the 1980s that major medical journals finally started breaking free of these habits.
Ironically, much of the shift can be traced back to an idea that Neyman coined in the early 1930s. With economies struggling in the Great Depression, he’d noticed there was growing demand for statistical insights into the lives of populations. Unfortunately, there were limited resources available for governments to study these problems. Politicians wanted results in months—or even weeks—and there wasn’t enough time or money for a comprehensive study. As a result, statisticians had to rely on sampling a small subset of the population. This was an opportunity to develop some new statistical ideas. Suppose we want to estimate a particular value, like the proportion of the population who have children. If we sampled 100 adults at random and none of them are parents, what does this suggest about the country as a whole? We can’t say definitively that nobody has a child, because if we sampled a different group of 100 adults, we might find some parents. We therefore need a way of measuring how confident we should be about our estimate. This is where Neyman’s innovation came in. He showed that we can calculate a “confidence interval” for a sample which tells us how often we should expect the true population value to lie in a certain range.
Confidence intervals can be a slippery concept, given they require us to interpret tangible real-life data by imagining many other hypothetical samples being collected. Like those type I and type II errors, Neyman’s confidence intervals address an important question, just in a way that often perplexes students and researchers. Despite these conceptual hurdles, there is value in having a measurement that can capture the uncertainty in a study. It’s often tempting—particularly in media and politics—to focus on a single average value. A single value might feel more confident and precise, but ultimately it is an illusory conclusion. In some of our public-facing epidemiological analysis, my colleagues and I have therefore chosen to report only the confidence intervals, to avoid misplaced attention falling on specific values.
Since the 1980s, medical journals have put more focus on confidence intervals rather than standalone true-or-false claims. However, habits can be hard to break. The relationship between confidence intervals and p-values hasn’t helped. Suppose our null hypothesis is that a treatment has zero effect. If our estimated 95 percent confidence interval for the effect doesn’t contain zero, then the p-value will be less than 5 percent, and based on Fisher’s approach, we will reject the null hypothesis. As a result, medical papers are often less interested in the uncertainty interval itself, and instead more interested in the values it does—or doesn’t—contain. Medicine might be trying to move beyond Fisher, but the influence of his arbitrary 5 percent cutoff remains.
Excerpt adapted from Proof: The Uncertain Science of Certainty, by Adam Kucharski. Published by Profile Books on March 20, 2025, in the UK.