4 The frequentist statistical paradigm

Should probability enter [statistics] to capture degrees of belief about claims? To measure variability? Or to ensure we won’t reach mistaken interpretations of data too often in the long run of experience? Modern statistical methods grew out of attempts to systematize doing all of these.

— Deborah Mayo, Statistical Inference as Severe Testing

Let me ask you something. If the rule you followed brought you to this, of what use was the rule?

— Anton Chigurh, No Country for Old Men

In the previous chapter, we saw that there are many different ways to interpret probability. Some of those ways take probability to be describing some objective feature of the world, for example, relative frequencies that arise in repeated measurements. Others take probability to be describing something epistemic, for example, partial belief or incomplete information. These different interpretations of probability have given rise to different ways of producing statistical inferences. Frequentist statisticians take the so-called objective approach, assigning probabilities only to relative frequencies that arise over repeated measurements and repeated sampling. They use those probabilities to produce inferences to hypotheses or theories that may have given rise to the data. In this chapter, we’ll study the philosophical foundations and practical methods of the frequentist statistical inference paradigm. First, we’ll consider the formal construct— a statistical model—that provides the basis from which probability statements are made. Then, we’ll study the foundational tools of frequentist inference—including hypothesis testing, point estimation, and interval estimation—with special attention to philosophical justifications (or lack thereof).

4.1 Statistical models

Consider the following hypothesis about reading trends in America:

$H$: At least ten precent of Americans have read The Brothers Karamazov (BK).

As with all statistical hypotheses, $H$ refers to a population level parameter—in this case, the proportion of Americans who have read BK: \[ p = \frac{\# \text{ individuals in the population who have read {\it BK}}}{\# \text{ number of individuals in the population}}. \]
$H$ is difficult to confirm or falsify. Part of the difficulty lies in the fact that the population in question—all Americans—is very large.¹. It is not practical to observe all of the relevant information at the population level. Instead, we can gather data relevant to $H$ in a sample—a subset of the population. There are many different ways that a sample can be gathered from a given population. A data generating process (DGP) is a description of the sampling process that gives rise to the data. Consider the following DGP related to hypothesis $H$:

¹ And, more broadly, in many statistical problems, populations are, at least in theory, infinite. When testing the theory “All ravens are black”, we may be interesting in learning about all existing ravens, a finite set. But we also may be interested in whether this property is true of any raven, as a general law, and not just the finite number that actually do exist.

DGP$_H$: Randomly sample $n$ Americans and record whether they have ($1$) or have not ($0$) read The Brothers Karamazov (BK).

A key feature of this DGP is the notion of a random sample. Randomness is notoriously difficult to define with precision (Eagle, 2010). For our purposes, we define a random sample as a sample such that each individual—in this case, each American—has the same chance of being included.² DGPs give way to actual or observed data. For example, imagine a random sample of size $n = 15$ arising from DGP$_H$: \[ \mathbf{x} = (0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0). \] Intuitively, there is information in $\mathbf{x}$ relevant to $p$, and thus, $H$. Generally, sample information relevant to a parameter or hypothesis can be summarized through a functions of the data, called a statistics. In this case, the relevant statistic is the proportion of individuals in the sample that have read BK: \[ \widehat p = \frac{\# \text{ of individuals in the sample who have read {\it BK}} }{n}. \] By convention, in statistics, when we have a sample quantity that corresponds to a population level quantity, we use the same notation, but add a “hat” on top.

Eagle, A. (2010). Chance versus randomness. Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chance-randomness/

² For now, we’ll put aside the question of how (if at all) one might actually achieve a truly random sample!

For the data $\mathbf x$ above, $\widehat p = 2/15$. $\widehat p$ does not precisely answer questions related to $p$ and $H$; it contains information from the sample only. It’s easy to see how $p$ and $\widehat p$ might be different. Imagine that researchers had gathered a different random sample of size $n$, say, $\mathbf{x'} = (x'_1,...,x'_n)$. Using $\mathbf{x'}$, the sample proportion, $\widehat p'$ would likely be different, based on the fact that $\mathbf{x'}$ would (very likely) include different individuals with different reading habits. It is likely that both $\widehat p$ and $\widehat{p'}$ would be different from $p$. How different? A statistical model can help us answer this and related questions.

A statistical model is a set of formal assumptions about the DGP that gives rise to observed data. Specifically, a statistical model makes the assumption that actual data $\mathbf{x} = (x_1,...,x_n)$ are realizations of a stochastic (probabilistic) process $\mathbf{X} = (X_1,...,X_n)$. $\mathbf{X}$ is chosen to encode information from the DGP (e.g., DGP$_H$), with a connection to the theory or hypothesis of interest (e.g., $p$, $H$). The assumption that actual data, $\mathbf{x}$, come from a stochastic process, $\mathbf{X}$, is often justified in terms of repeated sampling or measurement error. If we had observed the same phenomena (e.g., sampling process, experiment, physical process) again under sufficiently similar conditions, we would have observed different values, say, $\mathbf{x'} = (x'_1,...,x'_n)$. The statistical model quantifies how likely all of the possible datasets are under various hypotheses. For example, possible data \[ \mathbf{x'} = (1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1), \] with $\widehat{p'} = 9/15$, would be relatively likely under the hypothesis that $p = 0.7$; on the other hand, the actual data $\mathbf x$ (with $\widehat{p} = 2/15$) would be relatively unlikely under that hypothesis. For frequentist statisticians, these probabilities—probabilities assigned to various actual and hypothetical datasets—are relevant to what we ought to infer about parameters and hypotheses. Crucially, these probabilities are the only relevant ones for inferences about parameters and hypotheses. Probability theory enters the frequentist statistical model to quantify and subsequently test expectations we have about which data sets would would not arise under various hypotheses.³

³ Howson & Urbach (2005) provide a justification for this assumption in the context of measurement error:

The assumption of...[stochasticity] is more or less realistic in many cases, for instance, where an instrument is used to measure some physical quantity. The instrument would, as a rule, deliver a spread of results if used repeatedly under similar conditions, and experience shows that this variability, or error distribution, often approximates a normal curve.

Howson, C., & Urbach, P. (2005). Scientific reasoning: The bayesian approach. Open Court.

At this point, we note that statistical models, as frequentists define them, are not capable of the kind of probabilistic inference demanded under probabilism. Recall from Chapter 2 that probabilism is the philosophical view that all inferences from data to hypotheses must be encoded as probability values assigned to hypotheses, given the available data. Under the frequentist formulation of a statistical model, such inferences are not possible, because the model is by design incapable of assigning probabilities to hypotheses. Instead, hypotheses are encoded as statistical model parameters, which are assumed to be non-random features that govern probability assignments.

In the simplest cases, the DGP produces independent and identically distributed (iid) data. Informally, for $i=1,...,n$, and $j = 1,...,n$:

$X_i$ is independent from $X_j$ if the occurrence of $X_i$ does not influence the probability of occurrence of $X_j$, for all $i,j = 1,...,n$, $i \ne j$.

$X_1,...,X_n$ are assumed to have the same probability distribution, i.e., the same “shape” (e.g., normal, binomial), center, scale, etc. With respect to DGP$_H$, if we assume that (i) no individual influences any other with respect to the reading BK, and that (id) the probability, $p$, that any randomly selected person in the population has read BK is the same for each person, then DGP$_H$ can be modeled by $\mathbf{X} = (X_1,...,X_n)$, where each $X_i$ has a Bernoulli distribution.⁴ Thus, by definition, $X_i$ is equal to either $1$ with probability $p$, or $0$ with probability $1-p$ and has the following probability distribution function (pdf): \[ \begin{aligned} f(x_i; p) = p^{x_i}(1-p)^{1-x_i}, \,\,\, x \in \{0,1\}. \end{aligned} \] The pdf $f(x_i; p)$ provides an answer to the question: under the assumption of a specific value of $p$, what is the probability that the random variable $X_i$ is equal to some specific value $x_i$ (where $x_i = 0$ or $x_i = 1$)? Under the iid assumptions, the joint pdf associated with the process is the product of the pdfs for each $X_i$:⁵ \[ f(\mathbf{x}; p) = \prod_{i=1}^n f(x_i; p) = p^{\sum_{i=1}^n x_i}(1-p)^{n - \sum_{i=1}^n x_i}, \,\,\, x \in \{0,1\}. \] The joint distribution $f(\mathbf{x}; p)$ answers questions like: what is the probability of observing $\mathbf{x}$? A statistical model is comprised of a stochastic process, $\mathbf X$, along with the family of pdfs, $f(\mathbf x; p)$: $$ \[\begin{align} \mathcal{M}_{p}(\mathbf{x}) = \{\, \left(\, \mathbf{X}, \, f(\mathbf{x}; p) \, \right) : p \in (0,1) , \, \, x_i \in \{0,1\} \}. \end{align}\] $$ {#eq-BernModel} By a family of probability distribution functions, we mean the collection of $f(\mathbf{x}; p)$ for all possible values of $p \in (0,1)$. Since we do not know which value of $p$ generated the observed data $\mathbf x$, the statistical model encodes the family of distributions as the set of all possible explanations of the data. Intuitively, some distributions in this family are much more plausible than others as explanations for the actual data; for DGP$_H$, the “sub-family” of distributions singled out by $p \in (0.9,1)$ are implausible as explanations for the actual data $\mathbf{x}$.

⁴ Strictly speaking, these assumptions are not realistic. Friends influence each other on what they read, rendering independence unrealistic. And some individuals are much more likely to read Russian literature books than others, rendering identically distributed unrealistic. The implausibility of these assumptions means that an iid statistical model would produce less-than-optimal inferences. An inference is only as good as the model used to produce them! There are some more advanced modeling assumptions that could correct these assumptions. For example, we could control for friend groups, books clubs, and types of readers.

⁵ The joint distribution is greatly simplified as the product of the individual (marginal) distributions because each of the random variables $X_i$ are independent from one another.

4.1.1 Frequentist probability and frequentist statistics

It is worth emphasizing again that the statistical model described above uses probability theory in one (and only one) way: to model the process that gives rise to observed data. Possible samples (i.e., data) that arise from the DGP are modeled as events in a probabilistic process, and are thus assigned probabilities under an assumed parameter value or hypothesis. Statistical models in the frequentist statistical paradigm do not assign probabilities to parameters (e.g., $p$ in the example above), functions of parameters (e.g., $\text{Var}(X_i) = p(1-p)$), or hypotheses (e.g., $H$). In summary, $\mathbf X = (X_1,...,X_n)$ is probabilistic but $p$ and $H$ are not; instead, they are fixed constants—in the case of $p$—or descriptions of fixed features of the world—in the case of $H$.

Frequentists adopt this use of probability largely on philosophical grounds, both endorsing the objective approach of tethering probability to relative frequencies that are, in theory at least, observable; and rejecting subjective probability as philosophically unsuitable. Frequentist statistician Ronald Fisher argued that “probability is a ratio of frequencies, and about the frequencies of such [hypotheses] we can know nothing whatever” Fisher (1922). Similarly, economist, philosopher, and frequentist statistician Aris Spanos explicitly connects statistical models with the frequentist interpretation of probability, and rejects the subjective interpretation of probability as unscientific:

a statistical model is built upon the systematic information contained in the observed data in an attempt to provide an appropriate description of the stochastic mechanism that gave rise to the data. The stance that observed data $\mathbf x=(x_1, ..., x_n)$ contain systematic statistical information in the form of chance regularity patterns, “stochasticity” (randomness), is a feature of real-world phenomena and exists independently of one’s beliefs; its appropriateness can be tested against the data. Moreover, the frequentist interpretation of probability provides a way to relate these regularities to abstract statistical models in a manner that renders the probabilistic assumptions of the model testable vis-à-vis data $\mathbf x$. In addition, learning from data about observable stochastic phenomena cannot be solely in the mind of a particular individual, having to do with revising an individual’s degrees of belief represented by a prior and a posterior distribution. Scientific knowledge needs to be testable and independent of one’s beliefs (Spanos, 2019).

We see in Spanos a commitment to objectivity and testability: statistics is about using empirical information to test hypotheses. We now turn our attention the ways in which frequentists achieve these goals in practice, by studying some of the core tools used in frequentist statistical inference: hypothesis testing, maximum likelihood estimation, and confidence interval estimation.

4.2 Hypothesis testing: logic and applications

Hypothesis testing—also known as significance testing or null hypothesis significance testing (NHST)—is a primary frequentist tool for learning about hypotheses from data. As we saw in Chapter 2, the logic and philosophy of hypothesis testing emerged out of Karl Popper’s falsificationist framework as a “solution” to the problem of induction.⁶ Popper’s solution was to circumvent induction using falsification. According to Popper, we cannot justify the inference from: P: “All observed ravens are black” to T: “Therefore, all ravens are black”. But we can make scientific progress, according to Popper, through falsification and corroboration. That is, scientific progress is made by probing the world in the right ways in an attempt to show that scientific theories or hypotheses are false. In this view, T is only valuable as a scientific theory if it is possible that T be proven false through empirical observation. If we find at least one non-black raven, we have falsified our theory, and thus, can update it accordingly. Contrast T with the theory A: “repressed childhood trauma influences adult behavior.” What kind of observable evidence could render A false? Ostensibly any adult behavior can be viewed through the lens of repressed childhood trauma (or lack thereof). Thus A is unfalsifiable and, at least in Popper’s view, not suitable as a scientific theory.⁷

⁶ Recall that the problem of induction refers to the a lack of logical justification for inductive inference.

⁷ There is much to be said about the general philosophical value of the criterion of falsifiability. My view is that falsifiability is a valuable criterion for inference, but its value can be limited, for at least two reasons. First, and most relevant to the philosophy of statistics, many scientific theories, even if theoretically falsifiable, are not practically falsifiable. The claim $H$: a high carbohydrate diet causes an increase in body weight, discussed in Chapter 2 is one such example. In theory, we can imagine a world in which any time one eats a high carbohydrate diet, they gain weight. But the real world is messy. Any sample that suggested that, on average, a high carbohydrate diet reduced weight would not count as an unequivocal falsification; there are various alternative explanations one might produce to “save the theory” $H$.

Second, and more broadly, many of the most essential questions about being a human being—questions related to loss, death, love, morality, God, and meaning—have answers worth engaging with, but that are not falsifiable. Further, some of these questions may lack any answer that is falsifiable. If we take Popper seriously, as did many logical positivist philosophers of the 20$^{th}$ century, we should not entertain any of these essential questions; they are outside of the scope of what can be known scientifically. My view is that such suggestions diminish the richness and complexity of human experience.

Popper’s views were philosophical and couched in the language of science, broadly construed. He was not a statistician. Statisticians, such as Fisher , Jerzy Neyman (1894 – 1981), and Egon Pearson (1895 – 1980)—and later, contemporary philosophers of statistics Deborah Mayo and Aris Spanos—developed the statistical formulations of Popper’s falsificationist ideas. Fisher , for example, wrote that “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis”, strongly echoing Popper’s falsificationist ideas (Fisher, 1935). However, in statistical situations, where data are modeled as stochastic processes, experiments cannot strictly strictly falsify—or disprove—hypotheses. Statistical hypotheses are almost always falsifiable, in the sense that they can, in theory, be proven false. But they are rarely like T above, where we know exactly what information would falsify them (i.e., one non-black raven). Instead, statistical hypotheses are often logically consistent with many different datasets and pieces of evidence. This fact makes statistical inference much harder to conduct. What does it look like for data to “sort of” but not strictly falsify a hypothesis? What rules (if any) might govern rejecting or corroborating hypotheses when, strictly speaking, the observed data are logically consistent with all the hypotheses in question?

Falsification, like the frequentist statistical modeling paradigm described in the previous section, is an implicit rejection of probabilism. Popper was starkly opposed to so-called confirmation theories that used probability as a measure of theory or hypothesis confirmation. “It is often assumed that the degree of confirmation of $x$ by $y$ must be the same as the (relative) probability of $x$ given $y$...My first task is to show the inadequacy of this view” (Mayo, 2018; Popper, [1959] 2005).⁸ More broadly, Popper also rejected epistemic notions of probability—subjective, logical, or otherwise. Inference about scientific hypotheses would need to proceed with different philosophical and technical tools. Rather than probabilism, Popper and his statistician successors were interested in other philosophical values, such as long-run performance, error, and the notion of a “severely tested” or “well-probed” hypothesis. In the context of statistical inference, the statistical model described in 1.1 provides the means for operationalizing and quantifying these philosophical values. Under a properly specified statistical model and hypothesis test, if observed data $\mathbf x$—along with data more “extreme”—are improbable under an assumed hypothesis $H$, then $H$ may be considered suspect. If enough such tests (with different data) are conducted to probe $H$, and those tests point in the same direction, we may consider $H$ statistically falsified; it has been well-probed thus severely tested. Similarly, if a statistical model and series of tests attempt to find discrepancies between data and hypothesis, and fail to do so, we may consider $H$ corroborated based on it being well-probed and thus severely tested.

Popper, K. ([1959] 2005). The logic of scientific discovery. Routledge.

⁸ Popper uses general symbols, $x$ and $y$ here. Clearly, $x$ is meant to stand in for a scientific hypothesis, and $y$ evidence or data.

⁹ Mayo & Cox (2011) describes this version as providing the “core elements of significance testing in a version very strongly related to but in some respects different from both Fisherian and Neyman-Pearson approaches, at least as usually formulated.” However, the careful attention to error rates and a deemphasis on p-values suggests that it is much closer to Neyman’s and Pearson’s testing.

Mayo, D. G., & Cox, D. (2011). Frequentist statistics as a theory of inductive inference. In D. G. Mayo & A. Spanos (Eds.), Error and inference: Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press.

Historically, there are two different approaches to statistical hypothesis tests: Neyman and Pearson’s inductive behavior method and Fisher’s inductive inference method. These two approaches have been adapted and blended in various ways. As we will see in 1.3.5, some of the ways of blending these different historical approaches leads to undesirable inferences and consequences (Gigerenzer, 2004). But some blended methods, such as those outlined in Mayo & Cox (2011) and Mayo (2018), provide a stronger foundation to the edifice of frequentist inference.⁹ After some analysis and examples, we will discuss philosophical and interpretative differences between the Fisher and Neyman-Pearson frameworks.

First, let’s generalize our notation for a statistical model, so that we can describe statistical hypothesis testing at a high level of generality, i.e., not just in the context of $H$ and DGP$_H$ above. Denote the statistical model associated with data $\mathbf{x} = (x_1,...,x_n)$, as \[ \begin{aligned} \mathcal{M}_{\boldsymbol\theta}(\mathbf{x}) = \{\, \big(\, \mathbf{X}, \, f(\mathbf{x}; \boldsymbol\theta) \, \big) : \boldsymbol\theta \in \boldsymbol\Theta, \, \, \mathbf{x} \in \mathcal{X}\}, %\mathbb{R}^n \end{aligned} \] where

$\boldsymbol\theta$ is a parameter (or vector of parameters) that encode research hypotheses about empirical processes. $\boldsymbol\Theta$ is the parameter space, and includes all of the possible values of the parameter $\boldsymbol\theta$.
$f(\mathbf{x}; \boldsymbol\theta)$ represents the family of joint probability distribution functions that describe the stochastic process given by $\mathbf X$. $\mathcal{X}$ is the set within which the data live (e.g., $\mathcal{X}$ might be $\mathbb{R}^n$ or $(0,1)^n$).

The statistical model $\mathcal{M}_{\boldsymbol\theta}(\mathbf{x})$ is used to conduct a hypothesis test according to the following steps:

Specify two hypotheses, \[ \begin{aligned} H_0&: \boldsymbol\theta \in \boldsymbol\Theta_0 \\ H_1&: \boldsymbol\theta \in \boldsymbol\Theta \setminus \Theta_0 \end{aligned} \] where $\boldsymbol\Theta_0 \subset \boldsymbol\Theta$. $H_0$ is referred to as the null hypothesis. The word “null” is used here is short for “nullified”; that is, the null hypothesis is one in which the researcher seeks to falsify.¹⁰ We typically think of null hypotheses as some standard, status quo, or default theory, to be overturned in the face of strong evidence to the contrary. In the simplest case, where $\boldsymbol\Theta_0 = \{\boldsymbol\theta_0 \}$, the statistical model is reduced to a single distribution over $\mathbf{X}$, because the null hypothesis contains only a single point. $H_1$ is referred to as the alternative hypothesis. $H_1$ contains all values in the parameter space that are not contained in $H_0$.

¹⁰ As noted in Gigerenzer (2004), “null” does not necessarily to refer to a parameter value that reflect “no effect”, “no correlation” or “no relationship” among variables.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

Decide on a distance measure, $d(\mathbf{X})$, for the sample, under $H_0$. This measure is called the test statistic, and is a mapping from the $n$-dimensional DGP space $\mathcal{X}$ to a subset of the real numbers. As a function of the random variables that comprise the DGP, $d(\mathbf{X})$ is a random variable; in many cases, its distribution can be formally derived or estimated from the probability distribution, $f(\mathbf{x}; \boldsymbol\theta)$ under the statistical model, $\mathcal{M}_{\boldsymbol\theta}(\mathbf{x})$. An intuitive way to understand the distribution of $d(\mathbf{X})$ is to consider how $d(\mathbf{X})$ would differ from resampling the population under $H_0$.

Specify a rejection region or critical region. The former is a region of the output of $d(\mathbf{X})$ that corresponds to a “rare” dataset, under $H_0$. The latter is a region of the $n$-dimensional input space of $\mathbf{X}$ that would correspond to a “rare” dataset under $H_0$.

Collect the relevant data $\mathbf{x}$, according to the DGP, and calculate $d(\mathbf{x})$ under $H_0$. If $d(\mathbf{x})$ falls within the pre-specified rejection region, then we may infer that the data indicate a genuine deviation from $H_0$. If $d(\mathbf{x})$ falls outside of the rejection region, then we do not have an indication of a genuine deviation from $H_0$ (Mayo, 2018). Importantly, for this formulation of hypothesis testing, note that the decisions involved in steps 1–3 ought to be made prior to observing the actual data $\mathbf{x}$. As we will see below, making these decisions “pre-data”—rather than “post-data”—are critical to the strength of the logic of hypothesis testing.

These steps involve consequential choices. How does one choose a null hypothesis, test statistic, or rejection region? And based on these choices, what inferences might one make? Let’s answer these questions by working through steps one through four using the research hypothesis $H$, DGP$_H$, and actual data, $\mathbf x$, from 1.1:

$H$: At least ten precent of Americans have read BK.

DGP$_H$: Randomly sample $n$ Americans and record whether they have ($1$) or have not ($0$) read BK.

Data: $\mathbf{x} = (0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0).$

When constructing steps one through three, we will put ourselves in the position of statisticians designing a test, prior to observing the actual data. Then, we imagine collecting the actual data and computing the results in step four.

Parameter space, null and alternative hypotheses. In this case, since our parameter is a proportion, the parameter space is the interval $\mathcal{P} = (0,1)$. Broadly speaking, the null hypothesis ought to be some subset $\mathcal{P}$. Given that our research hypothesis is related to how often Americans might read a notoriously long and difficult book known for being left unread,¹¹ a reasonable null hypothesis that represents the status quo might be that “the proportion of Americans who have read the BK is less than or equal to ten percent”. The alternative hypothesis would then be the complement of the null hypothesis. Symbolically: \[ \begin{aligned} H_0&: p \le 0.1 \\ H_1&: p > 0.1. \end{aligned} \] By convention, we always include the boundary of the null-alternative division—i.e., the “equal to” portion—within the null hypothesis.

¹¹ https://bit.ly/3DK9tIM

Test statistic as a distance measure. The intuitive distance measure is $d(\mathbf{X}) = \widehat p = \sum X_i/n$. Under $H_0$, $\widehat p$ will differ from sample to sample and track how “rare” the sample is. For example, samples yielding $\widehat p = 2/15$ or less (i.e., a sample with at most two ones) would not be rare under $H_0$. Samples yielding $\widehat p = 9/15$ or greater (i.e., a sample with at least nine ones) would be very rare under $H_0$.

The rejection region. There are intuitively better and worse choices for a rejection region. Regions that includes values of $\widehat p$ that are far from the $p$ specified in $H_0$ are good choices. For example, $\mathcal R = [\frac{4}{15}, 1]$ is a reasonable choice; values of $\widehat p \ge \frac{4}{15}$ would be rare in a population where less than ten percent of people have read the BK. We’ll use $\mathcal R$ as our rejection region for this test. Note that $\mathcal R' = (0, \frac{3}{15})$ is not a reasonable choice for $H_0$, because $\mathcal R'$ would reject samples that yield $\widehat p = 0.1$. But such a value ought not constitute evidence against $H_0$.

Inference or decision. The actual data $\mathbf{x}$ yields $d(\mathbf x) = \widehat p = 2/15$. In this case, $d(\mathbf{x})$ falls outside of the pre-specified rejection region. As such, we do not have an indication of a genuine deviation from $H_0$—we fail to reject it. Results like this one that fall outside the rejection region are sometimes said to fail to reach statistical significance (or are said to be statistically insignificant). Conversely, if the actually data yielded a test statistic that fell inside the rejection region, the result is sometimes called statistically significant.¹²

¹² As we will discuss below in 1.3, the term statistically significant can be misleading, and is not necessarily suggestive of a large effect size or practically significant deviation from the null hypothesis.

What have we actually inferred or concluded with this test? Under the falsificationist logic of Popper, we have failed to falsify—or failed to reject—the null hypothesis. That does not necessarily mean that we ought to accept the null hypothesis. With one test and a relatively small sample size, our result is suggestive of a lack of evidence against $H_0$, but not clear evidence for it; it is possible that, in accepting $H_0$, we might be in error. In 1.4, we will discuss tools for corroborating hypotheses, including null hypotheses.

Note that, if the data were \[ \mathbf{x'} = (1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1), \] then $\widehat{p'} = 9/15$ would fall in the rejection region, and thus would have some evidence against $H_0$. If a dichotomous decision needed to be made, we would reject $H_0$. Again, the strength of the evidence depends on the sample size and whether other similar tests of been conducted to further probe these relevant hypotheses.

We can think of a hypothesis test as a procedure or recipe, applied to DGPs, for the purpose of producing an inference. The procedure involves: specifying a statistical model based on the DGP; using the model to choose a null hypothesis, alternative hypothesis, test statistic (with induced sampling distribution), and rejection region; and deriving a conclusion about actual data based on whether the computed test statistic falls inside or outside of the rejection region. As a procedure, the hypothesis test may work well, or not so well, at achieving the goals of inference. What criteria might we use to evaluate how well a hypothesis test achieves its goals of inference?

4.2.1 Error control

One useful measure in evaluating the hypothesis test procedure is long run error: if we imagine using the procedure over and over again, on similar data—i.e., possible data arising from the DGP, $\mathbf{X}$—how often would we produce an erroneous inference? That is, how often would we either reject $H_0$—because $d(\mathbf{x})$ falls in the rejection region—when $H_0$ is, in fact, true; or fail to reject $H_0$—because $d(\mathbf{x})$ fell outside of the rejection region—when $H_0$ is false? The first of these errors—incorrectly rejecting the null hypothesis—is called a type I error (or colloquially, a false positive). The second—incorrectly failing to reject the null hypothesis—is called a type II error (or colloquially, a false negative). Ideally, we would never make either of these errors. But with uncertain inference, they are bound to happen. Our goal, then, according to frequentist statisticians like Neyman and Pearson, should be to construct our hypothesis test procedure in a way that minimizes how often each of these errors occurs. Since “often” here refers to the relative frequency of occurrence over the hypothetical repeated use of the hypothesis testing tool on data arising from the DGP—i.e., over “the long run”—frequentist statisticians are perfectly content with considering the probability of each of these errors. This application of probability is perfectly consistent with the relative frequency interpretation!

Error rates are a function of the sample size, $n$; the choice of hypotheses; the choice of test statistic; and the choice of a rejection region. Generally, for a fixed test statistic and rejection region, the larger the sample size, the lower the type I and type II error rates; more information (a larger $n$) allows us to develop a procedure that is more sensitive to uncovering the truth. The choice of $n$ might be constrained by time, money, or other resources. We may not have enough time or money to survey individuals about their reading habits, or to closely monitor individuals’ diets and body weights. In the age of “big data”, obtaining a large sample may not seem like an issue. But there are many contexts in which good, careful data collection is difficult, and sample sizes may end up being smaller than one might hope. Further, even in contexts where data are plentiful, we often want to make very targeted inferences. “Targeted” often means that we want to condition on many variables; for example, what is the percentage of American women with a college degree and some knowledge of the Russian language have read $BK$? Conditioning on all of these variables effectively reduces the sample size.

There are various methods for choosing test statistics, including pivot quantities, likelihood ratio tests, and sufficient statistics.¹³ As in the $BK$ example, test statistics are often intuitive. For now, we’ll instead focus on the choice of rejection region and its role in error rates and discuss test statistic choices below, in the section on “best tests”.

¹³ A rigorous course in mathematical statistics, e.g., one using Casella & Berger (2024) or Corcoran (2022), would give a strong foundation in choosing test statistics.

4.2.1.1 Type I error

In our example above, we chose our rejection region to be $\mathcal R = [\frac{4}{15}, 1]$. The probability of a type I error—also called the significance level or size of the test and denoted $\alpha$—is the probability, under the null hypothesis, that the test statistic (as a random variable, prior to observing the actual data) falls in the rejection region. That is, \[ \begin{aligned} \alpha = P(\text{Type I error}) &= P(d(\mathbf X) \in \mathcal R \, ; \, H_0), \end{aligned} \] where, in our example $d(\mathbf X) \in \mathcal R$ is equivalent to $\widehat p \ge \frac{4}{15}$, and $H_0: p \le 0.1$. How do we calculate a probability under $H_0$, which, in this example, is specified by a range of values, rather than a single value—sometimes called a compound hypothesis? The typical convention has been to use the value of the parameter under $H_0$ that maximizes the probability: \[ \begin{aligned} \alpha = P(\text{Type I error}) &= \max_{p \in H_0}P(d(\mathbf X) \in \mathcal R \, ; \, H_0). \end{aligned} \] The value on the border between the null and alternative hypotheses will maximize these probabilities. Thus, \[ \begin{aligned} \alpha = P(\text{Type I error}) &= \max_{p \in H_0}P(d(\mathbf X) \in \mathcal R \, ; \, H_0) \\ & = P(\widehat p \ge \frac{4}{15} \, ; \, p = 0.1) \end{aligned} \]

Given DGP$_H$, this probability is equivalent to the probability that the probabilistic process $\mathbf X$ produced data that had four or more ones. We can compute this probability by defining a new random variable, $Y$, that counts up the number of ones in $\mathbf X$. It can be shown that $Y$ has a binomial distribution, with parameters $n = 15$ and $p$: $Y \sim \text{binom}(n,p)$.¹⁴ Thus, \[ \begin{aligned} \alpha = P(\text{Type I error}) &= \max_{p\in H_0}P(d(\mathbf X) \in \mathcal R \, ; \, H_0) \\ &= P(\widehat p \ge \frac{4}{15} \, ; \, p = 0.1) \\ &= P(Y \ge 4 \, ; \, n = 15, p = 0.1) \\ &\approx 0.056. \end{aligned} \]

¹⁴ For reference, the binomial distribution with parameters $n$ and $p$ is defined as the number of successes that occur within a string of $n$ independent Bernoulli trials, each with probability $p$.

The probability mass function for Y ∼ binom(n = 15, p) under H₀ : p = 0.1. The gold region represents the rejection region ℛ. Under H₀, the probability of falling in ℛ is α ≈ 0.056.

That is, if the null hypothesis is true, and the testing procedure were applied to data produced by DGP$_H$, it would produce an erroneous inference just above $5\%$ of the time. That’s not too bad! 1.1 shows a visualization of the probability distribution for $Y$, along with rejection region $\mathcal R$, colored in gold. The probability of the gold region—the sum of the heights of the gold bars—is $\alpha \approx 0.056$.

Can we do better, by changing the rejection region? Surely. Consider $\mathcal R^* = [5/15, 1]$. The probability of type I error for this rejection region is, \[ \begin{aligned} P\left(\text{Type I error} \right)&= \max_{p \in H_0}P(d(\mathbf X) \in \mathcal R^* \, ; \, H_0) \\ & = P(\widehat p \ge 5/15 \, ; \, p = 0.1) \\ &= P(Y \ge 5 \, ; \, n = 15, p = 0.1) \\ &\approx 0.013. \end{aligned} \] But this lower rate of type I error comes at a cost. The type II error rate is inversely related to the type II error rate. So, there is a tradeoff: for a given sample size, if we move the rejection region to reduce the rate of type I error, we will increase the rate of type II error.

4.2.1.2 Type II error

The probability of a type II error—often denoted $\beta$—is the probability, under an alternative hypothesis, that the test statistic (as a random variable, prior to observing the actual data) falls outside of the rejection region. When $H_1$ contains only a single value of the parameter, the type II error can be unambiguously defined as: \[ \begin{aligned} \beta = P\left(\text{Type II error} \right)&= P(d(\mathbf X) \not\in \mathcal R \, ; \, H_1). \end{aligned} \] When $H_1$ is compound—i.e., when $H_1$ contains more than one value of the parameter—the definition of type II error must be modified, by considering the maximum probability over all values of the parameter $\theta$ under the alternative hypothesis: \[ \begin{aligned} \beta = P\left(\text{Type II error} \right)&= \max_{\theta \in H_1}P(d(\mathbf X) \not\in \mathcal R \, ; \, H_1). \end{aligned} \] We may also think of type II error as a function of the parameter: \[ \begin{aligned} \beta(\theta) = P\left(\text{Type II error} \, ; \, \theta \right)&= P(d(\mathbf X) \not\in \mathcal R \, ; \, \theta \in H_1). \end{aligned} \] Here, the distance between the boundary of the null hypothesis and the value of $\theta$, used in the type II error calculation, is defined as the effect size, denoted $\gamma$. In our example, for the specific alternative hypothesis $H_1^*: p = 0.3$—i.e., for an effect size $\gamma = 0.2$—and for our original rejection $\mathcal R = [\frac{4}{15},1]$, \[ \begin{aligned} \beta(0.3)&= P(d(\mathbf X) \not\in \mathcal R \, ; \, H_1^*) \\ &= P(\widehat p < \frac{4}{15} \, ; \, p = 0.3) \\ &= P( Y < 4 \, ; \, n = 15, p = 0.3) \\ &\approx 0.3. \end{aligned} \] Roughly $30\%$ of the time, with this effect size, we would incorrectly “fail to reject” $H_0$. 1.2 shows a visualization of the probability distribution for $Y$ under $H_1^*$. The gray bars correspond to values of $Y$ that result in a type II error, i.e., fail to reject $H_0$ when $H_1^*$ is true. The probability of the gray region—the sum of the heights of the gray bars—is $\beta \approx 0.3$, representing the probability of type II error.

The probability mass function for Y ∼ binom(n = 15, p) under H₁^* : p = 0.3. The gray region represents a type II error. Under H₁*, the probability of a type II error is β ≈ 0.3. The gold region represents a correct rejection of H₀. Under H₁*, the probability of a correct rejection is the power: 𝒫 = 1 − β ≈ 0.7.

Suppose we chose an effect size of $\gamma = 0.3$, i.e., $H_1^{\dagger}: p = 0.4$. In this case, $\beta \approx 0.09$. As a general rule, for a fixed sample size and rejection region, the larger the effect size—in this case, the further $p$ is from $0.1$ in the positive direction—the smaller the type II error.

The relationship between type I error, type II error, and effect size can be illustrated by analogy.¹⁵ Imagine a metal detector used in a field to detect metal objects buried underground. The detector is prone to error: some errors will be “false positives”—suggesting that metal is present, when in fact, no metal is present. This kind of error is analogous to a type I error. Other errors will be “false negatives”—failing to alert the user when metal is present, analogous to a type II error. The rate of these errors will be a function of how sensitive the detector is and how big of a piece of metal we are trying to find. Intuitively, if the detector is calibrated to minimize false positives, by being conservative in its alerts, it is more likely to miss metal that is actually there. That is, when we try to shrink the type I error rate, we increase the type II error rate. And, conversely, if the detector is calibrated to minimize false negatives, by being liberal in its beeping, it is likely to beep for objects that aren’t metal, or for very small pieces of metal; reducing the type II error rate increases the type I error rate.

¹⁵ I am indebted to Deborah Mayo and Aris Spanos for this analogy, which I remember from their 2019 Summer Seminar in Philosophy of Statistics, Mayo & Spanos (2019).

Mayo, D., & Spanos, A. (2019). Summer seminar on philosophy of statistics. https://summerseminarphilstat.com

How does effect size fit in to this analogy? If the metal we are looking for is small—we have a small effect size—then we need a detector that is calibrated to finding tiny metal objects. In the analogous hypothesis testing scenario, we make our tests more sensitive to small effects by increasing the sample size. Alternatively, if the metal we are looking for is large—we have a large effect size—a detector that is calibrated to finding tiny objects might not be all that efficient. It will beep all the time, for tiny objects, cause us to dig, and perhaps be disappointed. Instead, we may want a less sensitive detector. The same is true with hypothesis testing: if we are searching for a large effect size, surprisingly, a test with a large sample size might be inefficient. For this reason, it is important to also estimate the size of the effect, to know whether what we have found is “practically significant” or relevant (see 1.5 for a discussion of frequentist estimation techniques). That is, when the detector beeps, we should dig to see what set it off; sometimes, it won’t be something we’re interested in!

4.2.1.3 Statistical power

The power of a test, $\mathcal P$, is defined as the maximum probability of rejecting the null hypothesis when an alternative hypothesis is true: \[ \begin{aligned} \mathcal P = \text{power} &= \max_{\theta \in H_1}P(d(\mathbf X) \in \mathcal R \, ; \, H_1). \end{aligned} \] We note that the power is the probability of the complement of a type II error: $\mathcal P = 1 - \beta$. The power for $H_1^*$, $\mathcal P \approx 0.7$, is represented by the probability of the gold bars in 1.2. As with type II error, we can think of power as a single value, computed for a specific value of $\theta$ under $H_1$, or as a function of $\theta$, $\mathcal P(\theta) = 1 - \beta(\theta)$. Power is an important concept that is sometimes neglected in some areas of research that use statistical testing. Low powered tests lead to studies that fail to find a true effect, an increase in type I errors, and an overestimate of effect sizes (Button et al., 2013; Maxwell, 2004). The first of these impacts follows directly from the definition of power. The second and third impacts may seem counterintuitive. Often, the connection between low power and high type I error is the result of using the same dataset to test multiple hypotheses. Along these lines, Maxwell (2004) writes, “when power is low for any specific hypothesis but high for the collection of tests, researchers will usually be able to obtain statistically significant results, but which specific effects are statistically significant will tend to vary greatly from one sample to another, producing a pattern of apparent contradictions in the published literature.” If we were to test several hypotheses about reading habits, say:

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376.

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9(2), 147.

At least ten precent of Americans have read BK.
At least ten precent of Americans have read The Road, by Cormac McCarthy.
At least ten precent of Americans have read Jayber Crow, by Wendell Berry.
At least ten precent of Americans have read Gilead, by Marilynne Robinson.
...

then it is likely that overall we will identify some effect, but also very likely that we will be wrong about the effect we find. For example, a hypothesis test may incorrectly suggest the third hypothesis above is true, when in fact, only the second hypothesis test is true.

4.2.1.4 So-called “best” tests

So, how do we typically manage the balance of errors? Are some hypothesis tests “better” than others, and if so, according to what metric? In our BK example, recall that the rejection region $\mathcal R = [\frac{4}{15},1]$ had a type I error rate of approximately $0.056$. There are many realizations of $\mathbf X$ that correspond to this rejection region, including:

\[ \mathbf{x}_1 = (0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0). \] \[ \mathbf{x}_2= (0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0). \] \[ \mathbf{x}_3 = (1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0). \] \[ \mathbf{x}_4 = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1). \] any many others. It turns out that the rejection region \[ \mathcal R^\dagger~=~\{\frac{4}{15},\frac{5}{15},\frac{6}{15}\} \] also has size $\alpha \approx 0.056$.¹⁶ And yet, $\mathcal R$ and $\mathcal R^\dagger$ have important differences. For example, dataset $\mathbf{x}_4$ falls in $\mathcal R$ but not in $\mathcal R^\dagger$. As it turns out, for $H_1^*: p = 0.3$, the type II error rate associated with $\mathcal R$ is approximately $0.30$, and the type II error rate associated with $\mathcal R^\dagger$ is approximately $0.43$. Thus, the former is a “better” test of size $\alpha$, if by “better”, we mean a lower type II error rate (or higher power) for a fixed type I error rate.

¹⁶ In discrete DGPs (such as a Bernoulli DGP modeling our BK example), it may be difficult or impossible to find two rejection regions of exactly the same size. So here, for simplicity, we look at two rejection regions of approximately the same $\alpha$. For continuous DGPs, there will be infinitely many rejection regions of size $\alpha$.

Is there ever a best test of size $\alpha$ according to this metric? That is, for a given size, $\alpha$, is there one test statistic and rejection region that is better, in terms of type II error, than all others? In some (but not all) cases, the answer is “yes” (Casella & Berger, 2024; Corcoran, 2022). One way at arriving at a best test is called the Neyman-Pearson Lemma.

For simple null hypothesis $H_0: \theta = \theta_0$ and simple alternative hypothesis $H_1: \theta = \theta_1$, the best test of size $\alpha$ is one where the likelihood ratio* is less than some value $c$, where $c$ is chosen such that the size of the test is equal to $\alpha$. That is, we want a rejection region such that \[ \begin{aligned} {l} = \frac{f(\mathbf{x}; \theta_0)}{f(\mathbf{x}; \theta_1)} \le c, \end{aligned} \] where $P(l \le c) \le \alpha$.*

The intuition here is that if a potential dataset from the DGP, $\mathbf x = (x_1,...,x_n)$, is more likely under $H_1$ than under $H_0$, then $f(\mathbf{x}; \theta_1)$ will be larger, and thus, $l$ will be smaller (Corcoran, 2022).

In the case of DGP$_H$, for $Y = \sum^{15}_{i=1}X_i$, the likelihood ratio is: \[ \begin{aligned} {l} &= \frac{f(\mathbf{x}; \theta_0)}{f(\mathbf{x}; \theta_1)} = \frac{(0.1)^{Y}(1-0.1)^{15 -Y}}{(0.3)^{Y}(1-0.3)^{15 - Y}} \\ &= \frac{(0.1)^{Y}(0.9)^{15 - Y}}{(0.3)^{Y}(0.7)^{15 - Y}} \\ &= \left(\frac{1}{3}\right)^{Y} \left(\frac{9}{7}\right)^{15 - Y} \\ &= \left(\frac{1}{3}\right)^{Y} \left(\frac{9}{7}\right)^{15}\left(\frac{9}{7}\right)^{-Y} \\ &= \left(\frac{1}{3}\right)^{Y} \left(\frac{9}{7}\right)^{15}\left(\frac{7}{9}\right)^{Y} \\ &= \left(\frac{7}{27}\right)^{Y} \left(\frac{9}{7}\right)^{15} \end{aligned} \] Since $l$ is a decreasing function of $Y$, the rule “reject when $l \le c$” is equivalent to the rule “reject when $Y \ge k$” for some constant $k$. The last step is to find a value $k$ such that the size of the test is $\alpha$. From above, for $\alpha \approx 0.056$, we know $k = 4$. Thus, the best test of size $\alpha \approx 0.056$ is given by the rejection region $Y \ge 4$, or equivalently, written in terms of $\widehat p = Y/n$, $\widehat p \in R = [\frac{4}{15}, 1]$!

It’s worth stepping back for a moment and seeing what we’ve accomplished. Above, there was an intuition that $R$ was better rejection region that $R^\dagger$. Under this hypothesis test, it doesn’t make much sense to fail to reject datasets like $\mathbf x_4$ above (i.e., a sample where everyone has read BK). But that’s what $R^\dagger$ does. The Neyman-Pearson Lemma matches this intuition, correctly selecting between these two rejection regions. And, in fact, it states that $R$ is at least as good as any other rejection region of this size. This result is very useful, especially in cases where there are infinitely many rejection regions of size $\alpha$ and there isn’t necessarily a clear intuition about which is best.

4.2.2 Behaviors vs belief: the p-value approach

Thus far, we’ve described a version of hypothesis testing closely aligned with Neyman’s and Pearson’s methods. Practitioners select a parameter space partitioned by null and alternative hypotheses; a test statistic with a distribution under the null hypothesis; a rejection region, ideally, selected based on desirable error probabilities; and make a decision based on whether the computed test statistic falls in the rejection region. Neyman and Pearson testing is decision and action-oriented: we decide that the null hypothesis ought to be rejected or not, and act in accordance with that decision. If we use Neyman-Pearson tests as a guide for action, we will act in accordance with the truth more often than not. Hypothesis tests as described thus far are not meant to provide inductive support for belief in a hypothesis or to quantify the strength of evidence for or against a hypothesis. (Neyman, 1950, pp. 258–259) writes that¹⁷

¹⁷ I’m indebted to Mayo (1992) for excavating this passage in Neyman (1950).

Mayo, D. G. (1992). Did pearson reject the neyman-pearson philosophy of statistics? Synthese, 90, 233–262.

Neyman, J. (1950). First course in probability and statistics. First Course in Probability and Statistics.

the problem of testing a statistical hypothesis occurs when circumstances force us to make a choice between two courses of action: either take step A or take step B...To accept a hypothesis H means only to decide to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true.

The focus, for Neyman (and Pearson), was not inductive support for a belief but behavior based on the output of a statistical test.

This behavioral focus conceals another possible motivating goal of hypothesis testing: to provide inductive evidential support for or against a hypothesis. Fisher was a proponent of this interpretation of hypothesis testing. He developed a hypothesis testing metric that could be used as an indication of inductive support. This metric, called a p-value, purportedly offers practitioners a number that represents discordance (or lack thereof) between data and the null hypothesis, with small p-values representing large discordance, and vice versa. Some view p-values as an evidence metric, with small p-values providing evidence against the null, and larger values providing no such evidence. For any individual hypothesis test, practitioners can then decide whether the given p-value is small enough to constitute evidence against the null hypothesis or not. As a continuous metric, and in accordance with Fisher’s intentions, p-values by themselves do not produce dichotomous decisions. With that said, a practice has developed that blends Fisher’s p-value approach with the Neyman-Pearson decision approach that compares the p-value to the significance level, $\alpha$, rejecting $H_0$ just in case the p-value is less than $\alpha$. We will explore the virtues and vices of this hybrid approach, and its contribution to the replication crisis, in 1.3.5.

Before we formally define and analyze p-values, let’s describe the slightly different testing context in which they arise. Fisher’s inductive tests follow roughly the following steps (Perezgonzalez, 2015):

choose a test statistic for the DGP;

pose a null hypothesis and a direction for the effect;

compute a metric that represents discordance between the null hypothesis and the data (the p-value); and

assess and interpret the statistical and domain-relevant significance of the results. This procedure shares some commonalities with the approach given above. Decision-oriented tests and inductive tests both account for the DGP, formulate a null hypothesis, and summarize data with a test statistic. At the same time, there are important differences between these two approaches to testing. Inductive tests do not formulate an alternative hypothesis; and, since they do not make dichotomous decisions, inductive tests do not place emphasis on error rates. After all, errors are committed when strict, dichotomous decisions are made, and p-values, at least as Fisher envisioned them, do not produce dichotomous decisions.

Formally, a p-value is defined as the probability, under the null hypothesis, that one observes a test statistic at least as “extreme” as the test statistic actually observed. The concept of “extreme” is formalized in step 2 above, by specifying the direction of the effect. In the BK example, under $H_0: p \le 0.1$, the direction of the effect would be large values of $\widehat p$, i.e., ones further from $H_0$ in the positive direction. This example suggests that, at least in some cases, what counts as extreme is unidirectional (Perezgonzalez, 2015). In other cases, what counts as extreme is not unidirectional. If we hope to learn whether a coin is unfair, or whether two populations differ with respect to the mean of some variable, “extreme” is not unidirectional; a large number of heads or tails would be extreme, as would a large difference across populations, regardless of which population has the higher mean.

Perezgonzalez, J. D. (2015). Fisher, neyman-pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6, 223.

For the BK example, with data $\mathbf x$, the p-value is the probability, under $H_0$, that $\widehat p$ could have been $\widehat p = 2/15$ or greater. Again, defining $Y$ as the number of ones in the data vector, the p-value for this hypothesis test is: \[ \begin{aligned} P(Y \ge 2; H_0) = P(Y \ge 2; n = 15, p = 0.1) \approx 0.45. \end{aligned} \] As probabilities, p-values are always between zero and one. This p-value is, by any stretch, large, and thus does not suggest any discordance between the data and the null hypothesis. In general, in reporting a p-value, rather than a dichotomous decision, each practitioner could then assess whether the given p-value constitutes evidence against the null, and then make an inference about $H_0$ accordingly. In some contexts, Fisher advocated for significance levels—analogous to Neyman and Pearson’s $\alpha$—as a heuristic in assessing the strength of evidence against the null hypothesis. Fisher wrote that $\alpha = 0.05$

“is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. [...] If [the p-value] is between $0.1$ and $0.9$ there is certainly no reason to suspect the hypothesis tested. If it is below $0.02$ it is strongly indicated that the hypothesis fails to account for the whole of the facts” Fisher ([1925] 2017).

Fisher, R. A. ([1925] 2017). Statistical methods for research workers. Kalpaz.

In summary, broadly speaking, there are two approaches to hypothesis testing: the Neyman-Pearson approach that favors tracking and minimizing hypothetical long-run error rates based on binary decisions; and the Fisher approach that favors a measure of the degree to which evidence is in tension with a null hypothesis. At a conceptual and philosophical level, the “classical” hypothesis testing of Fisher, Neyman, and Pearson ends here. The logic of hypothesis testing was established and unchanged for most of the second half of the $20^{\text{th}}$ century. Since then, the main work of frequentist statisticians has been to apply this logic to the myriad DGPs and parameter types that arise in science and business. For readers interested in further applications of hypothesis testing, see the optional next section, 1.2.3, that implements two additional tests. Readers less interested in application might skip ahead to 1.3, which describes fallacies and misuses of hypothesis testing.

4.2.3 Further hypothesis testing examples

Our running hypothesis test example has been an elementary test concerning a population proportion. To give a sense of the applicability of hypothesis testing, in this section, we consider two additional tests. The first will test whether there is evidence for a difference in the mean of a variable across two different populations. The second will test whether there is evidence that suggests that a physical parameter is less than previously thought. Readers familiar with hypothesis test applications, or readers who would rather skip more mathematically demanding examples, can skip ahead to 1.3.

4.2.3.1 A difference across means example

The following example is modified from an exercise in Ugarte et al. (2016) and is also used in Gardiner & Zaharatos (2022). Marilynne Ames is the senior statistician at Ames’ Appliances. Marilynne would like to test whether a refrigerator’s energy consumption—measured in kilowatts over a 24-hour period—would be impacted by a modification to its motor. Marilynne is not yet sure whether the modification will result in better or worse energy efficiency. She considers the following research hypotheses:

Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2016). Probability and statistics with r. CRC Press, Taylor & Francis Group.

Gardiner, G., & Zaharatos, B. (2022). The safe, the sensitive, and the severely tested: A unified account. Synthese, 200(5), 369.

¹⁸ This “constant variance” assumption is made for simplicity and may not be reasonable in practice. There are other tests designed for comparing population means with unequal population variances.

\[ \begin{aligned} R_0&: \text{ The motor modification will not impact energy consumption} \\ R_1&: \text{ The motor modification will impact energy consumption} \end{aligned} \] Marilynne then chooses a statistical model, that allows her to translate her research hypotheses to statistical hypotheses. Specifically, based on knowledge of the energy consumption measurement process, she assumes that measurements of energy consumption are independent, and normally distributed. Under these assumptions, Marilynne randomly samples $n = 60$ refrigerators from her inventory, and randomly assigns a label unmodified or modified to each one. She does nothing to the $n_x = 30$ refrigerators that receive the unmodified label, and performs that motor modification on the remaining $n_y = 30$ refrigerators. Denote $X_1,...,X_{n_x} \overset{iid}{\sim} N(\mu_x, \sigma^2)$ as the DGP for the energy consumption of unmodified group and $Y_1,...,Y_{n_y} \overset{iid}{\sim} N(\mu_y, \sigma^2)$ as the DGP for the energy consumption of modified group. $\mu_x$ represents the mean consumption of the unmodified group, and $\mu_y$ the mean consumption of the modified group. Importantly, in using these models, Marilynne is assuming that the variance, $\sigma^2$, in kilowatt measurements is the same across both groups.¹⁸

These models give Marilynne a way to translate the research hypotheses to statistical hypotheses. If we interpret the research hypotheses as claiming something about differences on average, the statistical hypotheses can be written as: \[ \begin{aligned} S_0&: \mu_x=\mu_y \\ S_1&: \mu_x\ne \mu_y, \end{aligned} \] or equivalently, \[ \begin{aligned} S_0&: \mu_x-\mu_y = 0 \\ S_1&: \mu_x-\mu_y \ne 0. \end{aligned} \] To write this statistical model in the form of ?eq-model, define $\mathbf{Z} = ( \mathbf{X}, \mathbf{Y}) = ( X_1,...X_{n_x}, Y_1,...,Y_{n_y})$, i.e., a random vector of length $n = n_x + n_y$ including all energy consumption measurements. Then, \[ \begin{aligned} \mathcal{M}_{\boldsymbol\theta}(\mathbf{z}) = \left\{\, \left(\,\mathbf{Z}, \, f(\mathbf{z}; \boldsymbol{\theta}) \, \right) : \boldsymbol\theta = (\mu_x, \mu_y, \sigma^2) \in \mathbb{R}^2 \times (0,\infty), \, \, \mathbf{z} \in \mathbb{R}^n\right\}, \end{aligned} \] where $f$ is the multivariate distribution describing the random process $\mathbf{Z}$: \[ \begin{aligned} f(\mathbf{z}; \boldsymbol{\theta}) &= (2\pi\sigma^2)^{-n/2} \exp\Biggl\{ -\frac{1}{2\sigma^2}\Biggl[ \sum_{i=1}^{n_x}(z_i-\mu_x)^2 + \sum_{i=n_x+1}^{n}(z_i-\mu_y)^2 \Biggr] \Biggr\}. \end{aligned} \] Using this model, along with some theorems of probability theory, Marilynne knows that: \[ \begin{aligned} \bar{X} = \sum_{i=1}^{n_x} X_i &\sim N\left(\mu_x, \, \, \frac{\sigma^2}{n_x}\right), \\ \bar{Y} = \sum_{i=1}^{n_y} Y_i&\sim N\left(\mu_y, \,\, \frac{\sigma^2}{n_y}\right), \text{ and } \\ \bar{X} - \bar{Y} &\sim N\left(\mu_x - \mu_y, \,\, \sigma^2\left(\frac{1}{n_y} + \frac{1}{n_y}\right) \right). \end{aligned} \] These results follow from the facts that linear combinations of normal random variables are normal, and from simple calculations of expectations and variances under independence. For example, \[ \begin{aligned} E(\bar{X}) &= E\left(\frac{1}{n_x}\sum_{i=1}^{n_x} X_i \right) = \frac{1}{n_x}E\left(\sum_{i=1}^{n_x} X_i\right) \\ &= \frac{1}{n_x}\sum_{i=1}^{n_x} E\left(X_i\right) = \frac{1}{n_x}\sum_{i=1}^{n_x}\mu_x \\ &= \frac{1}{n_x}n_x\mu_x = \mu_x, \end{aligned} \] and \[ \begin{aligned} \text{Var}(\bar{X}) &= \text{Var}\left(\frac{1}{n_x}\sum_{i=1}^{n_x} X_i \right) = \frac{1}{n_x}\text{Var}\left(\sum_{i=1}^{n_x} X_i\right) \\ &\overset{i}{=} \frac{1}{n_x}\sum_{i=1}^{n_x} \text{Var}\left(X_i\right) =\frac{1}{n_x}\sum_{i=1}^{n_x}\sigma^2 \\ &= \frac{1}{n_x}n_x\sigma^2 = \frac{\sigma^2 }{n_x}. \end{aligned} \] The “$i$” in the third equality of the variance calculation denotes that this equality relies on independence across measurements. From these distributional assumptions, Marilynne constructs a test statistic relevant to the parameters of interest, i.e., $\mu_x - \mu_y$. Intuitively, $\bar{X} - \bar{Y}$ is a reasonable start. A standardized version of this statistic has a standard normal distribution under $S_0$: \[ \begin{aligned} Z_0 = \frac{\bar{X} - \bar{Y} - (\mu_x - \mu_y)}{\sigma\sqrt{\frac{1}{n_y} + \frac{1}{n_y}} } \overset{\text{under } S_0}{\sim} N(0,1). \end{aligned} \] This test statistic, which can be justified mathematically through the likelihood ratio test, would allow Marilynne to construct a rejection region and make comparisons on the standard normal scale. For example, for $\alpha = 0.05$, an intuitive rejection region of size $\alpha = 0.05$ is¹⁹ \[ \mathcal{R}_z = (-\infty, -1.96) \cup (1.96, -\infty). \] That is, one rejects the null hypothesis if the test statistic computed for the actual data $\mathbf{z} = (x_1,...,x_{n_x}, y_1,...,y_{n_x})$ is below $-1.96$ or above $1.96$. Such rejections directly correspond to the difference in sample means $\bar{X} - \bar{Y}$ being excessively large or small.

¹⁹ It is easy to check that, for $Z_0 \sim N(0,1)$, $P(Z_0 \le -1.96) + P(Z_0 \ge 1.96) \approx 0.05$ (approximation due to rounding).

There’s one problem: to compute $Z_0 = z_0$, Marilynne must know all of the quantities that define $Z_0$. With actual data $\mathbf{z}$, she clearly will know $\bar{x}$, $\bar{y}$, $n_x$, and $n_y$. Under the null hypothesis—which is how we compute test statistics—she will also know that $\mu_x - \mu_y = 0$. But she will not necessarily know $\sigma^2$. After all, $\sigma^2$ is another unknown population parameter, like $\mu_x$ and $\mu_y$. But unlike $\mu_x$ and $\mu_y$, fixing the null hypothesis does not render $\sigma^2$ known. Thankfully, Marilynne knows a solution to this problem. Define

\[ \begin{aligned} S^2_X &= \frac{1}{n_x}\sum^{n_x}_{i=1}(X_i - \bar{X})^2 \,\,\,\,\,\, \text{(estimator of }\sigma^2), \\ S^2_Y &= \frac{1}{n_y}\sum^{n_y}_{i=1}(Y_i - \bar{Y})^2 \,\,\,\,\,\, \text{(another estimator of }\sigma^2), \\ \text{ and } S^2_p &= \frac{\left(n_x - 1\right)s_X^2 + \left(n_y - 1\right)s_Y^2 }{n_x^2 + n_y^2 - 2} \,\,\,\,\,\, \text{(``pooled'' estimator of }\sigma^2). \end{aligned} \] $S^2_p$ is called the pooled variance estimator, and is appropriate for when cases like this one, where $\sigma^2$ is shared across groups. It turns out that if one substitutes $S_p$ in for $\sigma$ in $Z_0$, the resulting test statistic, $T_0$, has a t-distribution with $d$ degrees of freedom: \[ \begin{aligned} T_0 = \frac{\bar{X} - \bar{Y} - (\mu_x - \mu_y)}{S_p\sqrt{\frac{1}{n_y} + \frac{1}{n_y}}} \,\, \overset{\text{under } S_0}{\sim} \,\, t_{d}, \end{aligned} \] where $d = n_x + n_y -2$. The $t$-distribution is similar to the normal distribution in that it is symmetric and bell-shaped. It has “heavier tails”, in the sense that values $T_0 = t_0$ farther from the mean have relatively higher probability than they would under a normal distribution. This higher variance in $T_0$, when compared to $Z_0$, makes sense: In $T_0$, we have introduced an additional source of variation ($S_p$). As the degrees of freedom increase, the $t$-distribution converges (in distribution) to the standard normal distribution. For $d = n_x + n_y -2 = 30 + 30 -2 = 58$, the $t$ and standard normal distributions are quite similar. Specifically, for $d = 58$, \[ \mathcal{R}_t = (-\infty, -2) \cup (2, -\infty) \] is a rejection region of size $\alpha = 0.05$. For the data collected, $t_0\approx 2.51>2$. According to Neyman-Pearson hypothesis testing logic, if the modeling assumptions are correct, Marilynne can reject $S_0$: $\mu_x = \mu_y$, and act as if $R_1$ is true: on average, the motor modification has an impact on energy consumption. The sign of $t_0$ suggests which group consumes less energy. Since $t_0$ is positive, it must be that $\bar{x}> \bar{y}$ (the denominator is always positive), which implies that the modified group used less energy, and is more energy efficient.

If Marilynne were to compute a p-value, the result would be statistically significant: the p-value would be smaller than the significance level $\alpha = 0.05$. But statistical significance alone does not imply that the result is practically significant, i.e., that the difference in energy consumption is worth caring about. Is the efficiency gain $1\%$? Or $15\%$? The test alone, as presented, does not say. We will discuss statistical and practical significance more in 1.4.

4.2.3.2 A physical sciences example

Wendell is interested in learning about the acceleration due to gravity, $g$, from measurements of the distance that a falling object travels over time. The default hypothesis about $g$ is that

$R:$ neglecting air resistance, the acceleration due to gravity, $g$, is greater than $9.7$ meters per second squared.

Suppose Wendell has experimental measurements that record the distance traveled by a falling object over time. Denote the $n$ distance measurements as $\mathbf{z} = (z_1,...,z_n)$ and the $n$ time measurements as $\mathbf{t} = (t_1,...,t_n)$. Newtonian physics provides a “substantive” empirical theory that connects these measurements to $g$: $$ \[\begin{align} z_i = \frac{1}{2}gt_i^2. \end{align}\] $$ {#eq-noerror} Thus far, there is nothing “statistical” about this theory. It is a mathematical model that describes a physical process. Observable quantities—$z$ and $t$—are considered data. Unobservable quantities to be inferred—$g$—are considered parameters to be estimated from data. It is almost always the case that data measurements are observed with some error; that is, there is a deviation between the true value of the variable and our recording of it. For example, Wendell might collect data $z$, as in Aguilar et al. (2015), by measuring the distance traveled by a free-falling object using frames of a video recording. There is a fact of the matter about where the object is in space, and thus, how far it has traveled. But measurements contain error due to motion blur in the video and object rotation, creating difficulties identifying the exact point of the object used in the definition of the distance measurement. In this context, a statistical model adds an additional layer of probabilistic assumptions to the empirical theory, meant to model measurement error and account for it in inference.²⁰

Aguilar, O., Allmaras, M., Bangerth, W., & Tenorio, L. (2015). Statistics of parameter estimates: A concrete example. Siam REVIEW, 57(1), 131–149.

Wendell’s statistical model will include both the substantive Newtonian theory (?eq-noerror) and the statistical component modeling measurement error: $$ \[\begin{align} Z_i = \frac{1}{2}gt_i^2 + \varepsilon_i. \end{align}\] $$ {#eq-error} We can read ?eq-error as follows: a measurement process, $Z$, produces measurements, $z$, which are relate to $g$ and $t$ according to Newton’s laws, up to some random component represented by $\varepsilon_i$. The probability distribution given to $\varepsilon_i$ should depend on the context of the measurement error. In this measurement process, how do errors arise? Are they symmetric or asymmetric? Is each measurement error independent from one another, or does one error potentially impact another? Are they systematic—errors that arise when a measurement instrument is off by some fixed amount (say, $\delta$)—or random—errors that arise based on unexplainable fluctuation? Systematic errors, if known, can often be corrected prior to statistical work, for example, by repairing the measurement instrument or introducing a mathematical bias correction into the mathematical equation (say, shifting the equation by $\pm \delta$). Random measurement errors arise when the measurement instrument is off, but not by the same amount every time; the amount by which it is off is not law-like, but random, hence its treatment as a probabilistic process.

A common assumption that reflects random, symmetric measurement error, is that $\varepsilon_i$ is normally distributed: $$ \[\begin{align} \varepsilon_i \overset{iid}{\sim} N(0,\sigma^2). \end{align}\] $$ {#eq-errordist} The zero mean assumption in ?eq-errordist reflect that there is no systematic error in our measurement The variance, $\sigma^2$, describes how far deviations will be from the true value. Larger values for $\sigma^2$ would mean that our measurement instrument seems to produce more measurements far from the true value (on either side of zero). The “iid” assumption reflects that no one measurement impacts another, and that each has the same tendency to vary from the true value.

From this model, it follows that \[ \begin{aligned} Z_i \overset{iid}{\sim} N\left(\frac{1}{2}gt_i^2, \,\, \sigma^2 \right). \end{aligned} \] That is, on average, each $Z_i$ will be centered according to Newton’s laws, but will vary according to $\sigma^2$. The joint distribution of $\mathbf{Z} = (Z_1,...,Z_n)$ is a joint normal distribution with mean parameters $\boldsymbol \mu = \frac{g}{2}(t_1^2, ,t_2^2,...,t_n^2)^T$ and variance-covariance matrix \[ \Sigma = \left( \begin{matrix} \sigma^2 & 0 & ... & 0\\ 0 & \sigma^2 & ... & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & ... & \sigma^2 \end{matrix}\right). \] Putting this all together, Wendell’s statistical model is: $$ \[\begin{align} \mathcal{M}_{g}(\mathbf{z}) = \{\, \big(\, \mathbf{Z}, \, f(\mathbf{z}; g, \sigma^2) \, \big) : g \in (9.7,\infty), \, \, \mathbf{z} \in \mathbb{R}^n\}, \end{align}\] $$ {#eq-model} where $f\left(\mathbf{z} \, ; g, \, \sigma^2 \right)$ is the joint probability density function associated with $\mathbf{Z} = (Z_1,...,Z_n)$. The formulation of the hypothesis $R$ above dictates the parameter space $g \in (9.7,\infty)$. The implicit assumption is that $g$ is known to be no lower than $9.7\text{ m/s}^2$. We can translate $R$ into the following statistical hypotheses:

\[ \begin{aligned} H_0&: g = 9.7 \text{ m/s}^2 \\ H_1&: g > 9.7 \text{ m/s}^2. \end{aligned} \] We will use a Wald test statistic to test these hypotheses.²¹ For a general parameter $\theta$, a Wald test statistic is defined as \[ \begin{aligned} W = \frac{\widehat \theta - \theta_0}{se(\widehat \theta)}, \end{aligned} \] where

²¹ Corcoran (2022) and Casella & Berger (2024) both present the theory of using this quantity as a test statistic, though they do not refer to this test statistic with the moniker “Wald”.

Corcoran, J. N. (2022). The simple and infinite joy of mathematical statistics.

Casella, G., & Berger, R. (2024). Statistical inference. CRC press.

$\widehat \theta$ is an estimator of $\theta$, i.e., a function of the DGP meant to produce a single number as a stand in for $\theta$
$\theta_0$ is the value of $\theta$ under $H_0$; and
$se(\widehat \theta)$ is the standard error of $\widehat \theta$, defined as the standard deviation of $\widehat \theta$, i.e., $se(\widehat \theta) = \sqrt{\text{Var}(\widehat \theta)}$.

It can be shown that in most cases, $W$ is approximately $t$-distributed, with $n-1$ degrees of freedom (where $n$ is the sample size used to calculate $\widehat \theta$). In the context of our gravity example, the Wald test statistic is \[ \begin{aligned} T = \frac{\widehat g - g_0}{se(\widehat g)} \sim t_{n-1}, \end{aligned} \] where $\widehat g$ is the maximum likelihood estimator (MLE) of $g$ (MLEs are described in 1.5.1), $g_0 = 9.7$, and $se(\widehat g)$ is the standard error of the MLE (also derived in 1.5.1). The fact that $T$ has a $t$-distribution can be derived from the fact that $T$ can be written as a ratio of a standard normal random variable and a function of a $\chi^2$ random variable with $n-1$ degrees of freedom, and a theorem that states that such statistics have a $t$-distribution. That derivation is left as an exercise for Wendell!

A table of time ($t_i$) and distance ($z_i$) measurements for Wendell’s experiment corroborating the gravitational constant, $g$.
$t_i$	$z_i$
0.67	2.17
1.43	10.80
2.60	33.00
3.21	51.51
3.28	52.90
3.53	61.61
3.68	67.68
4.15	83.87
4.57	102.48
4.69	107.61

The best rejection region of size $\alpha = 0.05$ is $\mathcal{R} = (1.83 , \infty)$.²² For the data that Wendell collected, given in table 1.1, the actual test statistic is, \[ \begin{aligned} T = t &= \frac{\widehat g - g_0}{se(\widehat g)} \approx \frac{9.83 - 9.70}{0.03} \approx 4.37. \end{aligned} \] Since the test statistic falls in the rejection region $\mathcal{R}$, we reject the null hypothesis and conclude that $g$ is in excess of $9.7 \text{ m/s}^2$.²³ Intuitively, since $T = t = 4.37 \gg 1.83$ is well into $\mathcal{R}$, we might conclude that the true value of $g$ is well into the alternative hypothesis. However, such inferences should be made carefully to avoid fallacious reasoning. We discuss how to avoid such fallacious reasoning in 1.3!

²² In this case, under a $t$-distribution with $n-1 = 9$ degrees of freedom, $P(T > 1.83) \approx 0.05$.

²³ If we are considering the p-value approach to testing, $p = P(T \ge t) \approx 0.0009$, which is well below our $\alpha = 0.05$ threshhold.

4.3 Hypothesis testing: objections, fallacies, and misuses

There are many criticisms of frequentist hypothesis tests. In general, we might classify them into two categories. The first category is philosophical, and suggest that there is something flawed in conceptual or logical foundation of hypothesis tests that in turn leads to bad inferences. The second category we might call sociological or cultural, and consist of criticisms about the practical misuse of hypothesis tests. If warranted, sociological critiques are at least a call for better practices and education about the proper use of hypothesis tests. However, they may also provide some support for abandoning frequentist hypothesis tests: if the methods are that prone to widespread misuse, then it may be useful to consider other methods that are easier to apply correctly. In this section, we will explore these criticisms and weaknesses of hypothesis testing. We also consider a further development of the philosophical foundations of hypothesis testing to surmount some core philosophical weaknesses.

4.3.1 Acceptance, rejection, and large $n$

The fallacies of acceptance, rejection, and the large $n$ problem are a set of philosophical objections to frequentist tests.

4.3.1.1 The fallacy of acceptance

As a consequence of Neyman’s, Pearson’s, and Fisher’s commitment to falsificationism, there is an asymmetry in the kinds of inferences that hypotheses tests can produce. If, according to Fisher (1935), “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis”, what can we infer or decide in cases where we haven’t “disproved”—i.e., haven’t gathered evidence against—the null hypothesis? In classical hypothesis testing—i.e., hypothesis testing as described in this chapter thus far, save for more recent innovations in testing logic developed in Mayo & Spanos (2011c) and Mayo (2018) and described in 1.4—the answer is, “not much”. The logic of hypothesis testing does not allow us to infer inductive support for the null hypothesis. A test statistic is calculated under the assumption that the null hypothesis is true, and a test statistic that does not fall in the rejection region does not constitute confirmation, corroboration, or evidence for the null hypothesis; in the logic of hypothesis testing, the null is an assumption, and strictly speaking, not something classical testing can confirm, corroborate, or build evidence for.

Fisher, R. A. (1935). The design of experiments. 1935. In Oliver & Boyd Edinburgh, Scotland. Oliver; Boyd.

Similarly, the p-value is computed under the assumption that the null hypothesis is true, and thus does not produce a consistent metric for measuring the level of confirmation, corroboration, or strength of evidence for the null hypothesis. In our BK example, under $H_0$, the p-value was large ($\approx 0.45$). But we would also obtain a large p-value under many other null hypotheses. For example, if our null hypothesis were $H_0': p = 0.3$, the p-value would be $\approx 0.96$. So then, is there evidence for $H_0'$? Is there more evidence for $H_0'$ than $H_0$? Unfortunately not. The p-value measures discordance between the null hypothesis and evidence. A large p-value simply suggests that there is no such discordance; it does not suggest that there is strong agreement or more agreement when comparing the null hypothesis with other hypotheses. Claiming that there is evidence for the null hypothesis based on a large p-value (or equivalently, based on a test statistic failing to fall in the rejection region) is called the fallacy of acceptance (Spanos, 2011).

4.3.1.2 The fallacy of rejection

The fallacy of acceptance is relatively well-known. Most accounts of hypothesis testing are careful to address and avoid it—e.g., see Corcoran (2022) and Casella & Berger (2024). There is another hypothesis testing fallacy that receives less coverage in the literature but causes similar confusion: the fallacy of rejection. The fallacy of rejection occurs when researchers infer a particular alternative hypothesis based on rejecting the null hypothesis. A test statistic that falls inside the rejection region, or equivalently, a small p-value, does not constitute evidence for specific alternative hypotheses (Spanos, 2011). In the BK example, dataset \[ \mathbf{x}_3 = (1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) \] yields a test statistic in the rejection region $\mathcal R$. However, this rejection of $H_0: p \le 0.1$ alone is not suggestive of specific alternatives, e.g., $H_1^*: p = 0.3$ or $H_1^{**}: p \ge 0.8$. Inferring specific alternatives would require additional inferential tools.

Spanos, A. (2011). On a new philosophy of frequentist inference exchanges with david cox and deborah g. mayo. Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, 315.

²⁴ Similarly, Marilynne might only care about efficiency gains of greater than $10 \%$ and Wendell might only care about estimates of $g$ greater than $g = 9.8$.

The fallacy of rejection is closely related to the distinction between statistical significance—defined in 1.2—and practical (also called clinical or substantive) significance. Dataset $\mathbf{x}_3$ yields a statistically significant result; however, statistical significance does not imply that the finding is practically significant. Practical significance is related to the effect size and its relevance—or lack thereof—in the application area in question. An effect size might be much too small for domain experts to pay attention to. For example, literacy researchers may not care that the proportion of Americans that have read the BK is inferred to be $p = 0.11$, i.e., $\gamma = 0.01$ higher than the default assumption in $H_0$.²⁴ A test that rejects the null hypothesis—even if that test had a large test statistic and small p-value—does not alone guarantee any specific effect size. Thus, one commits the fallacy of rejection if one claims that a very small p-value (e.g., $\mathit{p} < 10^{-5}$) implies a large effect size, because an inference to a large effect size implies that one is inferring a specific subset of the alternative hypothesis.

4.3.1.3 The large $n$ problem

The large $n$ problem is also closely related to the fallacy of rejection. Consider a DGP that yields normally distributed data according to $X_1,...,X_n \overset{iid}{\sim} N(\mu, \sigma^2)$, with $\mu$ unknown. Researchers are interested in learning whether $\mu$ is larger than zero. They know that the data are approximately normal, with $\sigma^2 = 1$.

They set up a $z$-test with the following hypotheses: \[ \begin{aligned} H_0&: \mu = 0, \\ H_1&: \mu > 0. \end{aligned} \] This one-sample $z$ test has the following test statistic, $Z$, and p-value ${p}$: \[ \begin{aligned} Z &= \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} = \frac{\sqrt{n}}{\sigma}\left(\bar{X} - \mu_0\right), \\ \\ p &= P(Z > z; \mu = 0), \end{aligned} \] where $z$ is the value of $Z$ for the actual data. There are two important properties of $Z$, and consequently of $p$, worth noting. First, for all else fixed, as $n$ increases, the absolute value of $Z$ will trend upward, and thus, $p$ will trend downward. Specifically, for a fixed, small deviation from zero, say $\mu_{\text{true}} = 0.03$, there will always be a large enough $n$ such that we can reject the null hypothesis. The simulation in 1.5 shows how, as we increase $n$ by adding data points to a sample, the p-value trends toward zero. Thus, in many cases, we can achieve statistical significance with a large sample size, even though the effect size is practically unimportant. This phenomenon is one reason why frequentist logic requires researchers to fix a sample size in advance. In 1.3.5, we will discuss the pernicious practice known as p-hacking, which can arise from retesting with a larger $n$ after observing a statistically insignificant initial result.

Second, for a fixed $n$, a given p-value will have different evidential weight for different effect sizes (Spanos, 2019). For example, for $n = 10$, a p-value of just below $\alpha = 0.05$ is stronger evidence for a larger effect size, say $\gamma = 1$ (corresponding to $\mu_{\text{true}} = 1$), than it is for a small one, say $\gamma = 0.03$ (corresponding to $\mu_{\text{true}} = 0.03$). So, under the classical hypothesis testing of Neyman, Pearson, and Fisher , if researchers have a large sample, it is difficult to assess practical significance: statistical significance may be picking up on a practically insignificant deviation from $H_0$. Both of these properties suggest that the p-value, in absence of information about sample size and effect size, is of limited value.

The fallacies of acceptance and rejection, and the large $n$ problem, put hypothesis testers in a tricky position. On the one hand, hypothesis testing provides a tool for rejecting null hypotheses. In cases where a test fails to reject the null hypothesis, classical tests do not provide justification for null hypothesis corroboration. On the other hand, in the case where the null hypothesis is rejected, classical tests, in the absence of information about sample size and effect size, have no means of inferring practical significance. Philosophers of statistics Deborah Mayo and Aris Spanos attempt to solve these problems by extending classical hypothesis testing and situating it within a philosophical framework (Mayo, 2018; Mayo & Spanos, 2011c). We will study this framework in 1.4.

4.3.2 Do hypothesis tests commit the base-rate fallacy?

Suppose that Jonah, a 45 year old man, decides to take a genetic test to learn whether he has at least one copy of the $\epsilon 4$ allele, a genetic marker that is associated with increased risk of Alzheimer’s disease (Smith, 2000). The test says that Jonah is positive for one copy of $\epsilon 4$. What is the probability that someone like Jonah (a 45 year old man with a copy of $\epsilon 4$) will develop Alzheimer’s disease? That is, if $A$ represents the event that someone like Jonah develops Alzheimer’s, what is $P(A \, | \, \epsilon4)$? Suppose there are good reasons to set $P(\epsilon 4 \, | \, A) = 2/3$.²⁵ Does that imply that there is a $2/3$ chance that Jonah will develop Alzheimer’s disease?

Smith, J. D. (2000). Apolipoprotein E4: An allele associated with many diseases. Annals of Medicine, 32(2), 118–127.

²⁵ For example, we might set this equality based on parameter estimation; see 1.5.

Fortunately, it does not. It follows directly and deductively from the axioms of probability that, in general, $P(A \, | \, \epsilon4) \ne P(\epsilon 4 \, | \, A)$; it is not acceptable to “reverse the conditional”. To be sure, these probabilities are related, but some information needs to be added to the right hand side for equality to hold. Specifically, one must also know (1) the base-rate of Alzheimer’s and (2) the base-rate of the $\epsilon 4$ allele, in the population of which Jonah is a member. For this reason, conflating $P(A \, | \, \epsilon4)$ and $P(\epsilon 4 \, | \, A)$ is called the base-rate fallacy or base-rate neglect.

Bayes’ theorem properly accounts for the base-rate of Alzheimer’s and $\epsilon 4$ when finding $P(A \, | \, \epsilon4)$. By Bayes’ theorem, \[ \begin{aligned} P(A \, | \, \epsilon4) = \frac{P(\epsilon 4 \, | \, A)P(A)}{P(\epsilon 4)}. \end{aligned} \] Alzheimer’s Association (2025) and The Tech Interactive (2023) estimate that about one tenth of males develop Alzheimer’s disease, and about one quarter of people carry $\epsilon 4$. Again, for the sake of argument, let $P(A) = 1/10$, and $P(\epsilon 4) = 1/4$. Thus, \[ \begin{aligned} P(A \, | \, \epsilon4) = \frac{P(\epsilon 4 \, | \, A)P(A)}{P(\epsilon 4)} = \frac{(2/3)(1/10)}{1/4} \approx 0.27. \end{aligned} \] With this new knowledge that Jonah carries the $\epsilon 4$ allele, his risk of Alzheimer’s disease—quantified as a probability derived from a population of sufficiently similar individuals—moved from $0.10$ to $0.27$. This is a large risk, to be sure (a $2.7$x increase in risk!); but note that, thankfully, his risk is not as high as $P(\epsilon 4 \, | \, A) \approx 0.66$.

Alzheimer’s Association. (2025). 2025 Alzheimer’s disease facts and figures. https://www.alz.org/alzheimers-dementia/facts-figures.

The Tech Interactive. (2023). Alzheimer’s genetics and the APOE e4 allele. https://www.thetech.org/ask-a-geneticist/articles/2023/alzheimer-genetics-and-apoe-e4/.

Gelman, A. (2013a). Commentary: P: Values and statistical practice. Epidemiology, 24(1), 69–72.

Howson, C. (1997). Error probabilities in error. Philosophy of Science, 64(S4), S185–S194.

Some accuse frequentist hypothesis testing of committing the base-rate fallacy (Gelman, 2013a; Howson, 1997). They argue that the logic of hypothesis testing is essentially like the genetic testing case above; frequentist tests equate the probability of the evidence given the hypothesis, with the probability of the hypothesis, given the evidence. That is, frequentists “reverse the conditional” without accounting for base-rate information. Where in frequentist inference does this reversal happen? According to this criticism, the frequentist p-value is analogous to the probability of evidence, given the hypothesis; that is, analogous to $P(\epsilon 4 \, | \, A)$ in the example above. On its face, this analogy has some plausibility. The p-value is akin $P(\epsilon 4 \, | \, A)$ in the sense that it is a probability about data. Specifically, p-values are probabilities assigned to test statistics—which are built from data—under the assumption of the null hypothesis.

However, this analogy fails to demonstrate that hypothesis testing commits the base-rate fallacy, on two related fronts. First, the p-value—unlike $P(\epsilon 4 \, | \, A)$—is not a conditional probability. For example, the p-value for an “upper-tailed test” (e.g., the test in the BK example) is \[ P\left(d(\mathbf{X}) \ge d(\mathbf{x}); H_0\right), \] where $d(\mathbf{X})$ is the test statistic as a random variable, $d(\mathbf{x})$ is the value of the test statistic for the data in question, and $H_0$ is a hypothesis that fixes a subset of the statistical model that describes the DGP. The use of “$\, ; \,$” rather than “$\, | \,$” is not a distinction without a difference. The former is used intentionally to emphasize the fact that a p-value is not a conditional probability, but rather a probability of data computed under an assumed statistical model, i.e., the model specified by $H_0$. In fact, it would be incoherent for a frequentist to write \[ P\left(d(\mathbf{X}) \ge d(\mathbf{x}) \, | \, H_0\right), \] for the simple fact that conditional probabilities are defined in terms of conditioning on a random event or variable, and for a frequentist, $H_0$ is not a random event or variable. Even if $H_0$ were treated as a random variable, in many cases, $H_0$ has zero probability, e.g., $H_0: \mu_1 - \mu_2 = 0 \in \mathbb{R}$, and thus the conditional probability would be undefined.²⁶ ²⁷

²⁶ This claim follows directly from the definition of conditional probability. $P(A \, | \, B) = \frac{P(A \cap B)}{P(B)}$ is undefined when $P(B) = 0$.

²⁷ For this argument, I am indebted to Aris Spanos’ and Deborah Mayo’s insightful, contentious, and frankly hilarious comments on the blogpost Gelman (2013b). Just to cite one comment, from Spanos: “I’m amazed that competent statisticians are even debating [whether p-values are conditional probabilities]...The statement ‘conditional on $H_0$ being true’ is the source of confusion in this context. In frequentist statistics it means ‘evaluating the probability of a certain legitimate event, say $\{d(X) > d(x))\}$, under the scenario that $H_0$ is true’; any other interpretation is illegitimate. In frequentist statistics legitimate events are only those that belong to the sigma-field generated by the sample space, i.e. any Borel function of the sample $X$. As pointed out by Wasserman conditioning on assertions pertaining to the unknown parameters like $\theta = \theta_0$, makes NO sense in frequentist inference; period!”

Gelman, A. (2013b). Misunderstanding the p-value. https://statmodeling.stat.columbia.edu/2013/03/12/misunderstanding-the-p-value/ (original), archived at https://archive.ph/J9Aqf.

Second, the base rate fallacy involves incorrectly equating $P(A \, | \, \epsilon4)$—a conditional probability of a hypothesis—with $P(\epsilon 4 \, | \, A)$—a conditional probability of a piece of evidence. There is no such equivalent misconception in a properly conducted frequentist hypothesis test. A frequentist p-value is not meant to stand in as a probability of hypothesis, null or otherwise. Instead, it is a measure of discordance. When a practitioner uses a p-value to stand in for the probability of a hypothesis, they are doing so in violation of the underlying logic of hypothesis testing. Such misuses and misinterpretations are common, and when they occur, they are fallacious.

Many textbooks, papers, and professional guidelines have explicitly warn about the misuse of p-values, including the specific misinterpretation of the p-value as a probability of the hypothesis (Goodman, 2008; Wasserstein et al., 2019; Wasserstein & Lazar, 2016). In just one example, Wasserstein & Lazar (2016) write:

Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45, 135–140.

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p< 0.05.” In The American Statistician (No. sup1; Vol. 73, pp. 1–19). Taylor & Francis.

Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.

The fact that some misinterpret p-values is not a criticism of the logic of hypothesis testing. In this context, the base-rate fallacy may be a legitimate sociological critique rather than a philosophical critique: p-values are easily misinterpreted; in fact, they are so easily misinterpreted that we ought to use other inference methods. However, it may also be the case that the foundations of frequentist inference are philosophically sound for producing reliable inferences, and rather than abandoning those foundations, we need better education and research practices to minimize misuses.

Thus far, we have established that p-values are not miscalculated probabilities of hypotheses, and thus, frequentist inference is not guilty of the base-rate fallacy. But how might a frequentist approach the Jonah case? We will argue that the Jonah case is not a hypothesis testing problem, but instead, a probability prediction problem, consistent with frequentist principles.

As outlined in Spanos (2010), there are two different ways we might think of randomness in the Jonah case. The first way is to define $A$ as the event that someone like Jonah develops Alzheimer’s disease, and $P(A \, | \, \epsilon 4)$ as the probability that someone like Jonah has Alzheimer’s disease, given that they have $\epsilon 4$ (this is how we framed the case above). This approach suggests that Jonah is a randomly selected person from a population.²⁸ The selection process is implicitly governed by a statistical model, which assigns probabilities to the relevant events. The statistical model is comprised of two possibly correlated Bernoulli random variables, say $Z_1 \in \{1, 0\}$ and $Z_2 \in \{1, 0\}$, which model whether an individual has or doesn’t have Alzheimer’s, and a copy $\epsilon 4$, respectively. Formally, the statistical model can be defined as \[ \begin{aligned} \mathcal{M}_{\boldsymbol\theta}(\mathbf{z}) =\{ \mathbf{Z} = (Z_1, Z_2) \sim \text{BBernoulli}(\boldsymbol\theta)\}, \end{aligned} \] where:

²⁸ This phrasing is intentional, but also vague, and should remind us of the reference class problem from Chapter 3!

$\text{BBernoulli}(\boldsymbol\theta)$ denotes the bivariate Bernoulli distribution;
$Z_1 \in \{1, 0 \}$, where $1$ denotes that an individual has Alzheimer’s disease and $0$ denotes that an individual does not have Alzheimer’s disease;
$Z_2 \in \{1, 0\}$, where $1$ denotes that an individual has a copy of $\epsilon 4$ and $0$ denotes that an individual does not have a copy of $\epsilon 4$.
$\boldsymbol\theta = (\theta_{11}, \theta_{10}, \theta_{01}, \theta_{00}) \in (0,1)^4$ represents the probabilities of various outcomes, e.g., $P(Z_1 = 1, Z_2 = 1\, ; \, \boldsymbol\theta) = \theta_{11}$;²⁹
$\theta_{11} + \theta_{10} + \theta_{01} + \theta_{00} = 1$; and
The probability distribution function of $\text{BBernoulli}(\boldsymbol\theta)$ is written as \[ f(Z_1 = z_1, Z_2 = z_2\, ; \, \boldsymbol\theta) = \theta_{11}^{z_1z_2}\theta_{10}^{z_1(1-z_2)}\theta_{01}^{(1-z_1)z_2}\theta_{00}^{(1-z_1)(1-z_2)}. \]

²⁹ Note that the comma should be read as “and”, i.e., that a sampled individual has Alzheimer’s and a copy fo $\epsilon 4$.

The notation above includes “$\, ; \, \boldsymbol\theta$” in probability statements, to be explicit about the parameters and the assumed statistical model. Written this way, events—everything before “;”—are clearly differentiated from and parameters—everything after “;”.

With this framing, we can see that the probabilities used in the setup of the Jonah case are probabilities of events—i.e., probabilities involving $\mathbf{Z}$—under this statistical model, and thus, just functions of the model parameters, $\boldsymbol\theta$: \[ \begin{aligned} P(A) &= P(Z_1 = 1\, ; \, \boldsymbol\theta) \\ &= f(Z_1 = 1, Z_2 = 0\, ; \, \boldsymbol\theta) + f(Z_1 = 1, Z_2 = 1\, ; \, \boldsymbol\theta) \\ &= \theta_{10} + \theta_{11}. \\ \\ P(\epsilon 4 \, | \, A) &= P(Z_2 = 1 \, | \, Z_1 = 1\, ; \, \boldsymbol\theta) \\ &= \frac{f(Z_2 = 1, Z_1= 1\, ; \, \boldsymbol\theta)}{f(Z_1= 1\, ; \, \boldsymbol\theta)} \\ &= \frac{\theta_{11}}{\theta_{10} + \theta_{11}}. \\ \\ P(\epsilon 4) &= P(Z_2 = 1\, ; \, \boldsymbol\theta) \\ &= f(Z_1 = 0, Z_2 = 1\, ; \, \boldsymbol\theta) + f(Z_1 = 1, Z_2 = 1\, ; \, \boldsymbol\theta) \\ &= \theta_{01} + \theta_{11}. \end{aligned} \] Thus, when setting up the Jonah problem, we implicitly assumed values for $\boldsymbol\theta = \boldsymbol\theta^*$. $$ \[\begin{align} \begin{aligned} P(A) = 1/10 \\ P(\epsilon 4 \, | \, A) = 2/3 \\ P(\epsilon 4) = 1/4 \end{aligned} \quad &\iff\quad \begin{aligned} \theta^*_{11} &= 1/15\\ \theta^*_{10} &= 1/30\\ \theta^*_{01} &= 11/60 \\ \theta^*_{00} &= 43/60 \end{aligned} \end{align}\] \[ {#eq-BBparams} Under this interpretation of the problem, the goal of finding \] \[\begin{align} P(A \, | \, \epsilon 4) = P(Z_1 = 1 \, | \, Z_2 = 1\, ; \, \boldsymbol\theta^*) \end{align}\] $$ {#eq-post} is not a frequentist hypothesis test—i.e., not a problem of inferring the value of unknown parameters given data—but a prediction problem under assumed values of the model parameters, $\boldsymbol\theta = \boldsymbol\theta^*$. The interpretation of $A$ as arising from random sampling under $\mathcal{M}_{\boldsymbol\theta^*}(\mathbf{z})$ renders it permissible to compute the (relative frequency) probability to ?eq-post. This probability can be computed in (at least) two ways:

From the definition of conditional probability. \[ \begin{aligned} P(Z_1 = 1 \, | \, Z_2 = 1\, ; \, \boldsymbol\theta) &= \frac{f(Z_1 = 1, Z_2 = 1\, ; \, \boldsymbol\theta) }{f(Z_2 = 1\, ; \, \boldsymbol\theta) } \\ &= \frac{\theta_{11}}{\theta_{01} + \theta_{11}} \\ &= \frac{1/15}{11/60 + 1/15} \approx 0.27. \end{aligned} \]
From Bayes’ theorem (as we did above). \[ \begin{aligned} P(Z_1 = 1 \, | \, Z_2 = 1\, ; \, \boldsymbol\theta) &= \frac{f(Z_2 = 1\, | \, Z_1 = 1\, ; \, \boldsymbol\theta)P(Z_1 = 1 \, ; \, \boldsymbol\theta)}{f(Z_2 = 1\, ; \, \boldsymbol\theta) } \\ &= \frac{P(\epsilon 4 \, | \, A)P(A)}{P(\epsilon 4)} \\ &\approx 0.27. \,\,\,\, \text{}\\ \end{aligned} \]

To summarize this approach, under the frequentist lens, the Jonah case is not a hypothesis testing or inference problem, but rather a prediction problem, where the prediction is made under assumed or inferred values of parameters. Frequentist hypothesis testing or estimation theory can be used to in infer the values of $\boldsymbol\theta$, which in turn, are used to make the prediction.

The second way to think about randomness in this case might be to define $A$ as the event that a specific individual Jonah—and not just that someone like Jonah randomly sampled from some population—develops Alzheimer’s disease. This framing reflects cases where Jonah is not a randomly selected individual from some population, but a specific individual presented to us, unrelated to random selection (e.g., perhaps Jonah is a relative). The issue here is that, according to frequentist statisticians and those that adhere to the frequentist interpretation of probability, $A$ is no longer an event in $\mathcal{M}_{\boldsymbol\theta}(\mathbf{z})$, because it does not arise from the probabilistic process in $\mathcal{M}_{\boldsymbol\theta}(\mathbf{z})$. Instead, it is a fixed feature of the world—either Jonah will develop Alzheimer’s disease or not. To quote Spanos (2010) on this interpretation,

Spanos, A. (2010). Is frequentist testing vulnerable to the base-rate fallacy? Philosophy of Science, 77(4), 565–583.

the event[$A$] lies outside the intended scope of the statistical model...which aims to provide an idealized description of the disease’s incidence in the target population, and not the affliction of a particular individual; for the latter, one needs to specify a different statistical model.

Under $\mathcal{M}_{\boldsymbol\theta}(\mathbf z)$ and this interpretation of $A$, frequentist inference does not have anything to say about probabilities related to $A$, because $A$ is not an event. Spanos (2009) provides an alternative pathway—a logistic regression statistical model—for assigning a probability to $A$ under this interpretation. In that case, $A$ can be contextualized as a legitimate event arising from a probabilistic process, and thus, can be assigned a probability.

To summarize, the question what is $P(A \, | \, \epsilon 4)$? is not a question for frequentist hypothesis testing. Tests involve inference to unknown parameters arising in a statistical model. Once the statistical model in the Jonah example was made explicit, it becomes clear that that all model parameters were assumed to be known. Instead, the question what is $P(A \, | \, \epsilon 4)$? is a simple prediction made based on the assumed statistical model with (allegedly) known parameters. A true frequentist inference as part of a hypothesis test would involve a set of unknown parameters about which we want to learn.

Spanos, A. (2009). Model-based induction and the frequentist interpretation of probability. Virginia Tech working paper, 2009. Philosophy of Econometrics 393.

4.3.3 Do hypothesis tests commit Bernoulli’s fallacy?

Bernoulli’s fallacy is a supposed mistake in reasoning that occurs when an inference to a hypothesis is made based solely on sampling probabilities—probabilities assigned to data under a specified hypothesis—without accounting for inferential probabilities—probabilities assigned to hypotheses themselves (Clayton, 2021). According to Clayton (2021), the founders of frequentist methods—including Jacob Bernoulli, Francis Galton, Karl Pearson, and Ronald Fisher —built their methods on this fallacy: “Bernoulli’s Fallacy @sec-context for a discussion of probabilism). Is it philosophically coherent to assign probabilities to hypotheses? Do Dutch Book arguments, or arguments that generalize deductive logic, provide the proper foundation and normative structure for inferential probabilities? To be sure, Clayton (2021, Chapter 1) engages with these questions and the larger debates; he endorses a logical interpretation of probability, out of which, his Bayesian probabilism flows. But, as discussed in [sec:logicial_objections], there are serious objections to this view, some of which, Clayton (2021) does not engage with directly. A rational person might take these objections seriously enough to reject the logical interpretation of probability and the Bayesian probabilism that follows from it.

Now that we have established the connection between Bernoulli’s fallacy and probabilism, there are two paths forward for arguing that frequentist inference commits Bernoulli’s fallacy. The first is to claim that frequentists intend to produce an inferential probability, but do so incorrectly by reversing the conditional. Thus, what they end up with is a probability that is meant to represent an inferential probability, but is not calculated correctly. Given our discussion of the base-rate fallacy, and what we know about the correct usage of frequentist inference—i.e., that properly understood, it is not a kind of probabilism —this path is a dead end. Frequentist methods produce inferences by means other than inferential probabilities.

The second path is to claim that frequentists ought to produce inferential probabilities, but do not. Instead, what frequentists actually do—construct test procedures that control error rates in an attempt to not often be wrong—is fallacious. This path strikes at the core of what inference actually is. Is the goal of inference to assign a number to a hypothesis that represents the strength of the evidence at hand? Or is the goal of inference to put hypotheses through stringent tests, and infer those hypotheses that emerge from those tests? Unfortunately, outside of claims that frequentist inference commits the base-rate fallacy or similar—claims that we refuted above—Clayton (2021) does not establish that frequentist inference commits Bernoulli’s fallacy by engaging directly with the proper philosophical foundations and logic of frequentist inference. Why is it that probabilism is favorable to the most charitable interpretation of hypothesis testing, as a method for probing the world and severely testing claims? Are there reliable ways of learning about the world, if one rejects probabilism on philosophical or logical grounds? As we will see in 1.4, Mayo and Spanos provide a compelling philosophical foundation for the logic of frequentist inference—called the error statistical philosophy—based on long-run performance, probativeness, and severity (Mayo, 2018; Mayo & Spanos, 2011c).

³⁰ “Base rate neglect and the prosecutor’s fallacy are the same thing, and both are examples of Bernoulli’s Fallacy” (Clayton, 2021)

Interestingly, Clayton (2021) argues that Bernoulli’s fallacy is a generalization of the base-rate fallacy.³⁰ But as we saw in the last section, the base-rate fallacy occurs when a statistical model is used incorrectly in the “forward direction”—i.e., it is used to compute probabilities of events under assumed model parameter values, but the computations are wrong because they neglected base-rates. In light of this analysis, it should be clear that when the base-rate fallacy is committed in the frequentist context, it does not involve inferential probabilities, and thus, cannot be a specific example of Bernoulli’s fallacy. Under a frequentist statistical model, inferential probabilities do not exist—probabilities cannot be assigned to hypotheses. Thus, despite Clayton (2021)’s claim that Bernoulli’s fallacy is a generalization of the base-rate fallacy, it is not.

4.3.4 Probabilistic modus tollens?

Recall that Popper’s proposed solution to the problem of induction—falsificationism— provided some motivation for the logic of hypothesis testing. Additional observations of black ravens could not deductively establish the theory that all ravens are black. But, a single non-black raven could falsify the theory, through the following argument:

If all ravens are black then no one will ever observe a white raven.
But researchers just observed a white raven.
Therefore, it’s not the case that all ravens are black.

This argument is deductively valid: if the premises are true, then the conclusion must be true. The general form of this valid argument is called modus tollens (“the mode of denying”):³¹

³¹ Clayton (2021, Chapter 2) calls this argument form reductio ad absurdum (“reduced to the absurd”). Indeed, modus tollens can be interpreted as a specific example of a reductio ad absurdum. The latter is a more general argument strategy that involves asserting a claim, $C$, from it, proving a contradiction, therefore establishing the negation of $C$.

Clayton, A. (2021). Bernoulli’s fallacy: Statistical illogic and the crisis of modern science. Columbia University Press.

If $H$ then $e$ is impossible.
But, $e$.
Therefore, not $H$.

Modus tollens is the logical foundation for hypothesis falsification. However, as mentioned above, when evidence is probabilistic, rather than deterministic, falsification becomes much more challenging. The probabilistic modus tollens (PMT) argument:

If $H$ then $e$ is improbable.
But, $e$.
Therefore, not $H$. (Or: C: Therefore, $e$ is evidence against $H$)

is not a valid form. Worse than being invalid, it is often an inductively weak argument (Dickson & Baird, 2011):

If H: Andrii is a hockey player then it is improbable that e: Andrii is a hockey goalie.³²
But, e: Andrii is a hockey goalie.
Therefore, $\neg H$: Andrii is not a hockey player.

³² P1 is empirically true. Most NHL teams roster 23 players, and only two are goalies.

³³ Not every zero probability event is impossible, but this one is!

Clearly, the fact that Andrii is a hockey goalie is demonstrative evidence that he is a hockey player. The primary issue is that PMT does not consider alternative hypotheses. Under the alternative hypothesis $\neg H$, that Andrii is not a hockey player, the probability of the evidence $e$ is even less probable: it’s zero! That additional fact should be evidence in favor of $H$, since even though $e$ is improbable under $H$, it’s impossible under $\neg H$.³³

Some, e.g., Dickson & Baird (2011) and Clayton (2021) accuse frequentist hypothesis testing of committing the fallacy of PMT. At a high level of generality, the argument underlying hypothesis testing is similar to PMT:

Dickson, M., & Baird, D. (2011). Significance testing. Philosophy of Statistics, 199–229.

If $H_0$, then the probability (prior to collecting data) that the test statistic $d(\mathbf{X})$ falls in the rejection region is low, i.e., $P(d(\mathbf{X}) \in \mathcal{R}; H_0) < 0.05$.
But, for the given data $\mathbf x$, the test statistic did fall in the rejection region, i.e., $d(\mathbf{x}) \in \mathcal{R}$.
Therefore, $H_0$ is false. (Or: C: Therefore, $d(\mathbf{x}) \in \mathcal{R}$ is evidence against $H_0$.)

As stated, it does seem that frequentist testing is guilty of the fallacy of PMT. In the absence of a clearly defined statistical model that encodes various alternative hypotheses, the inference from a test statistic in a rejection region (or low p-value) to the negation of the null hypothesis is not warranted.

However, at least with respect to Neyman-Pearson testing, and the testing procedure endorsed by Mayo and Spanos described in 1.2, this objection ultimately fails. While it is true that, at a high level of generality, PMT is a weak argument form, there are specific features of hypothesis testing that are added to the basic structure of PMT that strengthen the hypothesis testing argument. First, a proper statistical model explicitly articulates a parameter space, which encodes the set of possible hypotheses—both null and alternative. Thus, probabilities of events can be computed under any of those hypotheses. These probabilities are computed, as a best practice, when finding type II error rates, the power for various alternatives, and, importantly, in the construction of rejection regions (see the discussion of best tests in 1.2.1). The statistical model blocks the Andrii example (and other similar cases).

Neyman-Pearson testing, with its proper error control and optimal rejection regions, is not guilty of the fallacy of PMT. But given that Fisher’s testing does not formulate specific alternative hypotheses, it is plausible that his testing commits the fallacy of PMT. If we reject $H_0$ based on a small p-value, without accounting for the probability of more extreme evidence under viable alternative hypotheses, we are in danger of this fallacy. In his most careful moments, Fisher considered the proper conclusion of hypothesis testing to be: either the null hypothesis is false or something rare happened (Fisher, 1956). Fisher’s testing was more exploratory, and a conclusion like this one is logically consistent with his falsification ends. It is also somewhat less helpful if the goal is a procedure with the capability to actually falsify hypotheses.

Fisher, R. A. (1956). Statistical methods and scientific inference. Oliver; Boyd.

This criticism of Fisherian tests is not an argument against p-values per se. In what has become common practice in hypothesis testing, the p-value is compared to the probability of type I error, $\alpha$, for the given rejection region, and the null hypothesis is rejected (or rendered suspect) if the p-value is smaller than $\alpha$. This practice is equivalent to the Neyman-Pearson practice of rejecting the null hypothesis when the test statistic falls in the rejection region. Thus, p-values used correctly (e.g., not interpreted as a probability of $H_0$) find their justification in Neyman-Pearson testing.

4.3.5 Statistical causes of the replication crisis

Hypothesis testing has seen widespread uptake in scientific literature, and in data science and business applications. But much of that use has not been consistent with the logic of either Fisher’s p-value testing or Neyman-Pearson’s error control testing. Instead, many applications of hypothesis testing have been an inconsistent hybrid of these two approaches, causing negative consequences in science (Colling & Szűcs, 2021; Gigerenzer, 1993, 2004). This inconsistent hybrid, called the “null ritual” by Gigerenzer (2004), involves (1) setting up a null hypothesis of no relationship (e.g., no mean difference, zero correlation), without a pre-specified alternative or research hypothesis; (2) if the p-value is less than $\alpha$ (typically $\alpha = 0.05$), reject the null hypothesis and claim a statistically significant result; otherwise, fail to reject it; (3) always performing steps (1) and (2) for inference. The null ritual is thought to be a cause of poor hypothesis test performance, and thus, a cause of a large number of false research findings.

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Lawrence Erlbaum Associates.

Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

This “null ritual objection” is not about the logical or philosophical foundations of either testing paradigm. Instead, the objection points to a pernicious way of blending of the two testing paradigms. This practice overemphasizes the notion of statistical significance, and deemphasizes the very aspects of testing that make them powerful: critical thinking about statistical modeling of the data generating processes, test performance, and a commitment to “finding things out” (Mayo & Spanos, 2011c). It incentivizes users of tests to fit their data analyses into pre-existing procedures and tables and make overly rigid, often poorly justified inferences and decisions. In this section, we will consider the negative consequences of the null ritual, including a set of practices that have given rise to the “replication crisis” in science: p-hacking, an abundance of researcher degrees of freedom, and the file drawer effect (Collaboration, 2015; Ioannidis, 2005; Simmons et al., 2011).

4.3.5.1 P-hacking.

In 2015, a published research study found that eating a bar of dark chocolate per day can help individuals on a low-carbohydrate diet lose weight significantly faster than those who did not eat dark chocolate (Bohannon, 2015). The study, led by scientist Johannes Bohannon, randomly assigned subjects into three diet groups: one on a low carbohydrate diet; a second on the same low carbohydrate died plus a $1.5$ ounce bark of dark chocolate; and a third control group who made no changes to their regular diet. The study found that the “chocolate group” lost weight 10$\%$ faster than the low-carb group that did not eat chocolate. The result was statistically significant at the $\alpha = 0.05$ level—i.e., the hypothesis test for the difference in mean weight loss across the two groups had a p-value less than $0.05$. This study was reported in the press as a major breakthrough, including headlines:

“Slim by Chocolate!”

“Why you must eat chocolate every day.”

“Has the world gone coco? Eating chocolate can help you LOSE weight.”

While the study really did find the statistically significant result, it is almost certainly a type I error—an incorrect rejection of the null hypothesis, $H_0$: dark chocolate does not help lower weight. Bohannon, who holds a PhD in molecular biology of bacteria but is a journalist by trade, knew that the result was fallacious. In fact, he set the entire study up to expose deep flaws in the way that hypothesis tests are conducted, and in the ways that scientific results are reported by popular media.

So, how did Bohannon construct this study to obtain such a striking error? It turns out that the “null ritual” can easily be abused in ways that are very likely to produce errors. Here’s one reliable way, which Bohannon capitalized on: conduct several tests simultaneously on a small number of participants. In the chocolate study, Bohannon tested the relationships between $18$ different variables—things like weight, cholesterol, blood proteins, sleep quality—on $15$ individuals across the three diet groups. Testing that many relationships across such a small number of people practically guarantees that at least one of them will turn out to be statistically significant. So, while Bohannon didn’t know that chocolate would be associated with weight loss, it was practically guaranteed that chocolate consumption would be associated with at least one of the measured variables. In fact, Bohannon also found that dark chocolate was associated with better cholesterol levels and an increase in overall well-being (Bohannon, 2015).

Andrade, C. (2021). HARKing, cherry-picking, p-hacking, fishing expeditions, and data dredging and mining as questionable research practices. The Journal of Clinical Psychiatry, 82(1), 25941.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

Bohannon’s techniques are a clear example of p-hacking (also known as data dredging). P-hacking is any data analysis practice that (knowingly or unknowingly) violates the logic of hypothesis testing to falsely claim statistically significant results (Andrade, 2021; Simmons et al., 2011). In the chocolate study, the flawed logic is related to the number of tests conducted. Just because any individual hypothesis test has a type I error rate of $\alpha = 0.05$, it does not follow that the family-wise error rate—the rate at which at least one type I error occurs over $m$ tests—will be small. For $m$ independent tests of size $\alpha$, the family-wise error rate is $\alpha_m = 1 - (1-\alpha)^m$. In the chocolate study, $\alpha_m = 1 - (1-0.05)^{18} \approx 0.60$. There was a $60\%$ chance that at least one relationship would be reported as significant, when in fact, no relationships exist!

Frequentist statisticians are well-aware of this problem of multiple comparisons and the relationship between individual significance levels and the FWER. As such, the erroneous handing of multiple tests is not a philosophical critique of hypothesis testing, but a sociological one. As with any sociological critique, an appropriate answer might be: educate practitioners to conduct tests properly. To properly conduct multiple tests to ensure a lower FWER, one should adjust the significance level, $\alpha$, for each individual test. There is no universally agreed upon way to do this adjustment, but there are many options that control error rates. One option, called the Bonferroni correction, is to set the significance level for any individual test to be less than $\alpha_m/m$, where $\alpha_m$ is the desired overall type I error rate. In the chocolate study, for an overall type I error rate of $\alpha_m = 0.05$, individual tests would need a p-value below $\alpha \approx 0.003$ to claim statistical significance. That is a much lower threshold for statistical significance, which in turn, leads to a much lower prevalence of type I errors for any individual test. The Bonferroni correction is one of the more conservative multiple comparison adjustments. For more on methods for working with multiple tests, see Cui et al. (2021).

Cui, X., Dickhaus, T., Ding, Y., & Hsu, J. C. (2021). Handbook of multiple comparisons. CRC Press.

³⁴ This example is based on an interactive tool originally created by the blog FiveThirtyEight, and reconstructed at Heiss (2024).

Heiss, A. (2024). Hack your way to scientific glory: Reproducible research with r, quarto, GitHub, and the tidyverse. https://stats.andrewheiss.com/hack-your-way/.

³⁵ The combined metrics referenced in this example might be (weighted) averages computed on a standardized scale. Note that, in linear regression, the hypothesis test for the slope of the regression line, $\beta_1$, has a null hypothesis, $H_0: \beta_1 = 0$. Thus, assuming the model is correct, a rejection of the null, based on a small p-value, suggests that $\beta_1 \ne 0$, i.e., there is a linear relationship between $x$ and $y$.

Incorrectly accounting for multiple comparisons is not the only flavor of p-hacking. Let’s briefly consider two other variants. Imagine that we are a team of social scientists interested in the research question: is the U.S. economy impacted by the number of Democrats or Republicans in office?³⁴ We have access to plenty of real data going back to 1948, including information on the political party in power—which party holds the President, Senate, House, Governors offices, etc.; and various indicators of economic health—unemployment, inflation, gross domestic product (GDP), and stock price measurements. First, let’s imagine that, prior to looking at the data, we know exactly how we want to measure our variables: we will measure the number of Democrats in office with reference to Democrat Senators and Representatives ($x$); and we will measure the strength of the economy through a combined metric involving unemployment, inflation, GDP, and stock prices ($y$). We conduct a regression of $y$ on $x$ and find a regression line with a negative slope and a p-value of $0.091$.³⁵ This p-value is above the standard threshold of $\alpha = 0.05$. That was disappointing, given how often academic journals require statistically significant results to publish. But we remember that some studies in the social sciences are publishable at the $\alpha = 0.1$, and our p-value is just below that. So, we use $\alpha = 0.1$, claim a statistically significant result, and send our study off for publication.

This process is an instance of p-hacking. The error rates for hypothesis tests covered in 1.2.1 are only valid when the significance level, $\alpha$, is specified prior to observing and analyzing data. If the $\alpha$ is changed after analysis, none of the error rates hold; it is more likely that the claimed results are wrong.

Now, imagine a second version of this study, where we are not exactly sure how we want to measure what it means for Democrats or Republicans to hold power, nor are we sure how we want to measure the strength of the U.S. economy. So, we explore several possibilities for each, using simple linear regression to quantify relationships. We note that, if we measure the number of Democrats in office with reference to Democrat Senators and Representatives, and measure the strength of the economy by GDP, there is a statistically significant result at the $\alpha = 0.05$ level. The slope of the regression line is positive, with a p-value of $\approx 0.01$.

By failing to pre-specify how to measure our variables and instead exploring many possibilities before settling on one that was significant, we are (again) guilty of p-hacking. In a sense, this example is similar to the multiple comparison problem in the chocolate study. But here, the problem arises because of a lack of clarity about how we ought to operationalize the concepts in question. The concepts “political party in power” and “strong economy” have colloquial meaning and importance, but there is no clear and unambiguous way of quantifying them. An essential component of statistical work is that of translating the research concepts and hypotheses—given in scientific or colloquial language—into quantifiable variables and statistical hypotheses. As part of that process, researchers must operationalize their pre-statistical concepts in a way that is conducive to statistical analysis. If operationalization occurs after researchers explore or analyze the data—say by comparing many different possibly ways of measuring the concepts in question—there is a high risk that researchers will select measurement processes that produce results that perpetuate biases.

4.3.5.2 Researcher degrees of freedom

A core problem in these cases, and with many real-world studies, is that there are too many researcher degrees of freedom. A researcher degree of freedom refers to various decisions that researchers can make—how to control for multiple tests, how to specify significance thresholds, how to operationalize variables—that impact the results of the study. Too many researcher degrees of freedom is a serious problem. It is a sociological, rather than a philosophical, objection to hypothesis testing, and other statistical methods and paradigms have their own versions. Limiting researcher degrees of freedom will strengthen research findings by properly controlling for error rates. The following steps can help limit researcher degrees of freedom:

Commit to a pre-data methodology for operationalizing important variables and handling outliers. There are often various ways of measuring a concept—e.g., as we saw above, whether Democrats or Republicans hold power can be measured in many different ways. Clarity on on variable operationalization procedures and justifications can reduce the number of researcher degrees of freedom.
Pre-specify significance levels. The logic of hypothesis testing requires that the significance level is specified in advance of actually conducting the hypothesis test. Pre-specifying significance levels (along with a properly specified statistical model) is a necessary condition of meeting stated error rates. A fluid definition of statistical significance inflates researcher degrees of freedom.
Control for multiple comparisons. As we saw above, conducting multiple tests without controlling the FWER leads to inflated type I and type II error rates. Planning the number of tests and control methods in advance reduces the number of researcher degrees of freedom by controlling the FWER.
Account and adjust for confounding factors. If a statistical model does not properly account for confounding factors, the results of the test will be biased. Further, estimated effect sizes will also be biased. For example, a regression of drowning deaths ($y$) against ice cream sales ($x$) might suggest that the latter impacts the former. However, if we control for outdoor temperature ($z$) (by including it in the regression model), the correlation will disappear. The reason for this change is that $z$ is a common cause of both, and once we adjust for it, we see that $x$ and $y$ are no longer related. Whether to include or exclude a particular variable in a statistical model is a researcher degree of freedom. Having a theoretical understanding of the relationships between variables, and what to include in a statistical model, will cut down on researcher degrees of freedom.³⁶
Report effect sizes alongside statistical significance. A result may reach statistical significance without being practically significant or important. Reporting an effect size can give the reader a sense of how important the result is in the given field.

³⁶ For more on causality and statistical inference, see Chapter 6!

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8). https://doi.org/10.1371/journal.pmed.0020124

³⁷ Most notably, but certainly not the only example of a researcher degree of freedom in Bayesian inference is the choice of prior distribution. For more, see Chapter 5

There are good reasons to believe that p-hacking is prevalent in science, and contributes to a large number of false research findings (Ioannidis, 2005). P-hacking can be done unintentionally; education on how to avoid it is essential. But of course, some may intentionally p-hack their data, knowing well that this practice leads to false results Gigerenzer (2004). We note that researcher degrees of freedom are not just a problem for frequentist hypothesis testing. Bayesian methods also allow for researchers to make decisions that impact their results.³⁷ Therefore, education around the reduction of researcher degrees of freedom is essential to statistical practice in general, and not a specific problem for frequentist inference.

4.3.5.3 The file drawer effect

The null ritual, in conjunction with some common academic publishing practices, is also responsible for the file drawer effect (also discussed in the context of probability and objectivity in Chapter 3). Recall that the file drawer effect occurs when multiple studies of the same research question are conducted, but only some of them—the statistically significant ones—are published. The rest are thrown in the proverbial “file drawer”. The logic of the problem is as follows: suppose that $m$ different studies are conducted on (roughly) the same null and alternative hypotheses. And, suppose that the null hypothesis is true: there is no effect. We would expect that $m\times\alpha$ of those studies to be type I errors—they would produce statistically significant results, and be presented as a positive finding, even though the null hypothesis is true. If all of the studies were published, it would be relatively clear that the majority that yield no effect (fail to reject the null hypothesis) are the true results, and the few that yield an effect (reject the null) are false positives. However, publishing null results is not common or incentivized by the scientific publishing community. If only the positive, type I error studies were published—which is often the case, because journals or researchers themselves are biased toward only publishing positive findings—the scientific community and public at large will be led to think that the result is true. The true negative results, on the other hand, will not be published, destined for the file drawer.

The file drawer effect is caused by the null ritual in conjunction with the lack of incentive to publish null results or replication studies (i.e., studies that attempt to replicate other published studies). Thus, guarding against the file drawer effect would require researchers and publishers to change their hypothesis testing approach, and their views on negative findings and replication studies. Thankfully, in recent years, change has begun. The practice of preregistration has become more common to help combat the file drawer effect. In preregistering a study, researchers are asking that the study design itself, and not just the outcome, be peer reviewed and published. That helps combat the file drawer effect because it implies a commitment to publish the results, positive, negative, or inconclusive (Center for Open Science, 2025).

Center for Open Science. (2025). Preregistration. https://www.cos.io/initiatives/prereg.

The file drawer effect is not a philosophical objection to frequentist hypothesis testing. The fact that replication transparency plays such an important role in solving the file drawer effect actually provides an argument in favor of core frequentist testing concepts. At the heart of hypothesis testing is the notion that, over many replications, properly constructed tests will produce the correct inference with high probability. The logic of frequentist testing provides the solution: replicate and verify!

4.4 Overcoming objections: Recent hypothesis testing innovations

In the previous section, we studied a series of philosophical and sociological objections that have been raised against hypothesis tests. Frequentists have strong responses to many of these objections. For example, many claim that frequentist hypothesis tests are guilty of the base-rate fallacy; but such an accusation either confuses parameter inference and prediction, or misinterprets the p-value. The most pressing unanswered objections to hypothesis tests are:

the fallacies of acceptance and rejection (and the related large $n$ problem);
the null ritual objection—i.e., the awkward interpretive issues that arise from blending two distinct testing paradigms; and
sociological consequences of the null ritual: p-hacking, an abundance of researcher degrees of freedom, and the file drawer effect.

The fallacies of acceptance and rejection involve stretching inference beyond what is justified by the logic of hypothesis tests: a statistically significant result does not lend evidence for any specific alternative hypothesis; and a statistically insignificant result does not lend evidence that the null hypothesis is true. These fallacies are related to difficulties in interpreting tests. On the one hand, Neyman-Pearson tests (1.2.1) are well-equipped to guide decisions—in the sense that the decision-maker would rarely be wrong in the long run—but fall short of providing a theory of inductive inference or evidential support. What should we infer or believe, based on the evidence and a set of severe tests? Neyman-Pearson tests are silent. On the other hand, Fisher’s tests (1.2.2) purport to be evidential and inferential, but their primary measure, the p-value, is not robust enough for specific inferences about the null or alternative hypotheses. The tension between these two testing paradigms, and the need for a satisfactory evidential interpretation of frequentist tests is, according to Mayo & Spanos (2011c), “the locus of the most philosophically interesting controversies and remains the major lacuna in using these methods for philosophy of science.”

The pernicious “null ritual” arose out of an inconsistent blending of these two testing paradigms. In turn, the null ritual perpetuated a series misconceived practices, including p-hacking, researcher degrees of freedom, and the file drawer effect. These spurious practices are not permissible under a consistent use of either testing paradigm, but nevertheless have had a profound effect on the proliferation of false scientific claims.

As a response to these (and other) objections, Mayo and Spanos developed a philosophical framework—the error statistical philosophy—that purports to solve each of these problems (Mayo, 1996; Mayo, 2018; Mayo & Cox, 2011; Mayo & Spanos, 2011c, 2011b).

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Mayo, D. G., & Spanos, A. (2011c). The error-statistical philosophy. In Error and inference: Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press.

4.4.1 The error statistical philosophy

Consider the following two cases:³⁸

³⁸ Both examples are a slight modification of one found in (Mayo, 2018, p. 14).

Case #1. Helen is concerned about gaining weight on vacation. Before leaving, she records her weight on her home digital scale: 145 pounds. The scale is 15 years old, and the battery almost depleted. In the weeks leading up to this recording, Helen saw her weight fluctuate widely on this scale, anywhere from 138 pounds to 150 pounds. Further, a recent weight recording on her doctor’s office scale showed a weight of 148 pounds. Two weeks later, upon returning from vacation, the same home digital scale showed no weight gain: 145 pounds. Helen concludes that she has not gained significant weight on vacation.

Case #2. Helen’s husband Morty is also concerned about weight gain on vacation. Morty plans to record his weight on a friend’s newly purchased digital home scale, and on the big medical scale at the Mandelbaum Community Fitness Center. To ensure that the scales were well-calibrated, Morty weighs himself with and without a five pound dumbbell. Both scales accurately register the five pound weight. After this calibration, Morty records his weight as 192 pounds (digital scale) and 193 pounds (medical scale). Two weeks later, upon returning from vacation, Morty tests both scales again for proper calibration (with the five pound weight), and then records his weight: 191 (digital scale) and 193 (medical scale). Therefore, Morty concludes that he has not gained significant weight on vacation.

Clearly, Morty has done more than Helen to test whether he has gained weight on vacation. His concern about weight gain oriented him toward obtaining an accurate measurement of his weight, both before and after his vacation. He planned ahead, used multiple, well-calibrated measurement tools, and drew conclusions that were aligned with the output of those tools. Helen, on the other hand, used tools available to her without upfront planning or calibration.

These cases are analogous to two different ways of using classical frequentist hypothesis tests. In the same way that one can test for weight gain poorly or well, hypothesis tests can be used poorly—e.g., by not specifying a well-fitting statistical model; by choosing a convenient but inappropriate test for the DGP at hand; by p-hacking—or well—e.g., by specifying an appropriate statistical model; by choosing optimal tests (when available); by setting significance thresholds prior to data analysis. There is nothing about the mathematics or logic of hypothesis tests themselves that ensures their proper use. Additional, non-mathematical commitments are needed. The error statistical philosophy is an interpretative framework of such commitments. It is meant to guide the use of tests, and strengthen and clarify the kinds of inferences that arise from hypothesis tests. It strives to make statisticians and data scientists more like Morty than Helen.

At the foundation of the error statistical philosophy is a commitment to the value of objectivity. In the same way that Morty was committed to an accurate representation of his body weight, the “error statistician”—the statistician employing the error statistical philosophy—is committed to the value of objectivity. Mayo’s and Spanos’ conceptualization of objectivity is similar to the one sketched in [sec:objectivity_in_sci]. The error statistician is committed to constructing statistical models, collecting data, and running tests that uncover the facts of the matter. On objectivity, Mayo & Spanos (2011b) write:

Although knowledge gaps leave plenty of room for biases, arbitrariness, and wishful thinking, in fact we regularly come up against experiences that thwart our expectations, disagree with the predictions and theories we try to foist upon the world, and this affords objective constraints on which our critical capacity is built.

Morty might wish that the medical scale would show a weight five pounds lighter, but his experience with this well-calibrated scale provides an objective constraint on the way in which that wish influences his beliefs and actions. Similarly, the error statistician will make choices when using frequentist tests that place objective constraints on what we can reasonably infer about the world. There is nothing inherent in “classical” hypothesis testing—testing as derived by Fisher, Neyman, and Pearson—that guarantees a commitment to objectivity. Such tests can be employed in a way that fails to meet basic standards of objectivity. To produce inferences that “get it right” with high probability, practitioners must make an explicit philosophical commitment to objectivity. Such a commitment would entail, among other things, using tests in a way that preserves their error properties.

Uncovering the facts of the matter means not just falsification, as Popper and Fisher outlined. Morty and Helen both failed to falsify the theory that they did not gain weight. Put more formally, they failed to reject the null hypothesis $H_0$: there was no difference in body weight before and after vacation. Helen’s test of this hypothesis is just that: a failure to falsify $H_0$. But Morty’s test seems to provide something more: some corroboration for $H_0$. There appears to be good evidence that Morty has not gained weight.³⁹ By itself, classical hypothesis testing does not provide a means for corroboration. The error statistical philosophy provides a conceptual framework for corroboration, and other positive claims about evidence, e.g., about specific alternative hypotheses. The key question here is:

³⁹ At this point, we should note that, by specifying $H_0$, we’ve only partially formalized the Helen and Morty examples. A full formalization, which is required for corroboration and positive evidence claims, would specify the error properties of the scales in question, which would, in turn, contribute to a specification of the DGP.

When do data provide, or fail to provide, strong evidence for a specific hypothesis?

Classical hypothesis testing does not provide a satisfactory answer to this question. To move beyond simple falsification, and to formalize the intuition that Morty’s testing practices are stronger than Helen’s—Mayo and Spanos propose two principles for evidential interpretations of hypothesis tests. The first, called weak severity, provides the answer to the “negative” question, i.e., when data fail to provide evidence for a hypothesis.

Weak Severity Principle. Consider a DGP that gives rise a test procedure $T$, actual data, $x_0$ and a hypothesis, $H$. $x_0$ and $T$ do not constitute evidence for $H$ if $T$ does not rule out ways in which $H$ may be false.(Mayo, 2018; Mayo & Spanos, 2011b).

In Helen’s testing of $H_0$, she chose a test procedure $T$ to be a single scale of questionable accuracy, yielding measurements with high variability. Her data, $x_0$, includes one measurement before and one after vacation. Her selection of $x_0$ and $T$ do not constitute evidence for $H_0$ because her two measurements—and thus her conclusion—could easily have been wrong. If she weighed herself several times before and after vacation, it is likely that the weight measurements would show high variability, and it would not be clear which measurements—the actual ones she took, or the hypothetical ones likely to be quite different—were close to her true weight. Mayo (2018) calls cases like Helen’s BENT: bad evidence, no test.

A second principle, called strong severity, allows practitioners to avoid BENT tests and produce positive inferences.

Strong Severity Principle. Consider a DGP that gives rise a test $T$, actual data, $x_0$ and a hypothesis, $H$. $x_0$ and $T$ count as evidence for $H$ just to the extent that $T$ was highly capable of finding discrepancies from $H$, but none (or few) were found (Mayo, 2018; Mayo & Spanos, 2011b).

Helen’s test does not meet the strong severity requirement, but Morty’s test does. Morty’s test procedure $T$—using two, well-calibrated scales with additional checks for accuracy and precision—along with his data $x_0$—four weight measurements, using two scales, both before and after vacation—provide good evidence for $H_0$. If Morty’s weight changed after vacation, his test would likely have found the discrepancy.

The weak severity principle blocks certain poor hypothesis testing practices, like the fallacy of acceptance and p-hacking. The strong severity principle is an expansion of “classical” frequentist hypothesis testing, allowing the error statistician to make positive, evidential claims about whether certain hypotheses are warranted, given a test procedure and data.

Consider applying these principles to the “chocolate study”, reported in Bohannon (2015) and described in 1.3.5. Let $H$ be the specific alternative hypothesis that the group who added chocolate to their diet lost weight 10$\%$ faster, on average, than the low-carb group that did not eat chocolate. The test procedure—in this case, the study design described above, including multiple tests—was designed to produce type I errors with high probability.⁴⁰ That is, it was not designed to specifically rule out ways in which $H$ (or any other alternative hypotheses) may be false, and thus, was not a severe test of $H$. Therefore, the chocolate study is BENT!

⁴⁰ The chocolate study was explicitly designed to produce type I errors. But the same argument holds about any study that was designed in this way, regardless of the intent of the designers.

How might we set up a severe test to detect whether chocolate has health benefits? Let’s consider two cases. First, imagine that, based on some biological theory, we have reason to believe that chocolate hastens weight loss. Further, we are not interested in exploring other biomarkers that chocolate might improve. In this case, we should use a well-calibrated test—preferably a “best test”, if available (see 1.2.1)—for the relationship between chocolate consumption and weight loss. Testing a single relationship, rather than multiple relationships without controlling for the FWER, reduces the probability of errors. The test should account for how the data were generated. The DGP may involve randomization—as in the original study, which randomly assigned dark chocolate to a subset of participants—statistical controls to model the effect of other variables, or both. The more physically representative and statistically descriptive the DGP, the more severe the test. A test derived from a DGP that does not accurately model the data will not be capable of revealing discrepancies from particular hypotheses, when they exist. Finally, the test should include a sample size (number of participants) that is tuned to discover practically significant sample sizes. Very large sample sizes are helpful in detecting small deviations from the null hypothesis, but, as we will see below, they are less severe tests of larger deviations.

In the second case, similar to the study in Bohannon (2015), we might imagine that we do not have a specific theory that suggests that chocolate is healthy. Instead, we want to test how it might impact various health markers. In this case, after specifying the DGP, sample size, etc., we may conduct multiple tests, but we should adjust the significance level for each test to avoid a high FWER.

Bohannon, J. (2015). I fooled millions into thinking chocolate helps weight loss. Here’s how. https://gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800

Both of these cases set conditions for reasonably severe tests of various hypotheses and allow for an evidential interpretation of testing. Even better would be follow-up studies, to ensure that the results obtained are replicable and not in error. A series of replications all pointing to a similar result would constitute a severely tested, highly corroborated hypothesis. Let’s now turn to an example of how we might operationalize the notion of severity to quantify how severely a hypothesis passes a given test.

4.4.2 Severity in practice

We operationalize severity on a simple example, for pedagogical purposes. Consider the $z$-test example from our illustration of the large $n$ problem:

DGP: $X_1,...,X_n \overset{iid}{\sim} N(\mu, \sigma^2)$, with $\mu$ unknown, $\sigma^2 = 1$.
Hypotheses: \[ \begin{aligned} H_0&: \mu = \mu_0, \\ H_1&: \mu > \mu_0. \end{aligned} \] Test statistic and p-value: \[ \begin{aligned} Z &= \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} = \frac{\sqrt{n}}{\sigma}\left(\bar{X} - \mu_0\right), \\ \\ p &= P(Z > z; \mu = \mu_0), \end{aligned} \] where $z$ is the value of $Z$ for the actual data. Let the significance level be $\alpha = 0.05$. Denote this test as “test $T$”, let $C$ denote a hypothesis or claim of interest, and to get concrete, let $\mu_0 = 0$. Mayo and Spanos define the severity of test $T$ with respect to claim $C$ and with data $x_0$ as the probability that the data were a worse fit with $C$, if $C$ were false (Mayo, 2018; Mayo & Spanos, 2011b; Spanos, 2019). Severity is a function of three items:

a test $T$, which includes a specified statistical model, null and alternative hypotheses, a test statistic, and known error properties;

actual data, $x$, that arises from the DGP and from which we produce actual values of relevant statistics—e.g., $\bar{x}$, $z$; and

a hypothesis or claim $C$, which encodes a practically significant effect size, $\gamma >0$, e.g., $C: \mu > \mu_0 + \gamma$. In the definition of severity, the “probability of a worse fit” is understood as a hypothetical: that the DGP could have produced data even further from the claim $C$ than the actual data. “If $C$ were false” means that we evaluate the probability of this hypothetical under the negation of $C$. We choose the value of $\mu$ that is the smallest perturbation to make $C$ false. If $C: \mu > \mu_0 + \gamma$, then the severity of $C$ is defined as \[ SEV(T, x, C) = P\left(\bar{X} < \bar{x}\, ;\, \mu = \mu_0 + \gamma\right). \] Alternatively, if $C': \mu < \mu_0 + \gamma'$, then \[ SEV(T, x, C') = P\left(\bar{X} > \bar{x}\, ;\, \mu = \mu_0 + \gamma'\right). \]

High severity values suggest that the claim $C$ has been more severely tested, and thus, are better evidenced than less severely tested claims. Let’s consider a few different data scenarios, to see how well severity accords with our intuitions about strength of evidence.

4.4.2.1 Severity of claims in $H_1$

Suppose that two different research groups have different thresholds for a practically significant effect size. Research group A believes that a small effect, $\gamma > 0.03$, is practically significant; research group B believes that only effects $\gamma > 0.1$ are practically significant. How severely tested might each of these effect sizes be with actual data? Since $H_0: \mu = 0$, these effect sizes can be written, respectively, as: \[ \begin{aligned} C_1&: \mu > 0.03 \\ C_1^*&: \mu > 0.1. \end{aligned} \] Imagine actual data, $x_1,...,x_{n=50}$, that, unbeknownst to researchers, were generated under the specific alternative hypothesis \[ H_1': ~\mu ~= ~0.15. \] Further, imagine that the sample mean of the data is $\bar{x} \approx 0.25$. With $n = 50$, this test yields a p-value of $p \approx 0.04$, which results in a rejection of $H_0$ at the $\alpha = 0.05$ level (this is a correct rejection, based on the way the data were generated). The fallacy of rejection warns us against claiming evidence for any specific claim in $H_1$. In this context, severity can be understood as providing an evidence metric for specific claims in $H_1$, such as, $C_1$ or $C_1^*$. Intuitively, with a sample mean of $\bar{x} = 0.25$, we have more evidence for claims with smaller effect sizes and less evidence as the effect size grows. Consequently, we should have more evidence, or a higher severity, for $C_1$ than $C_1^*$. Does the operationalization of severity reflect these facts?

The severity of $C_1$ is defined as the probability that the data would have been a worse fit if $C_1$ were false. A “worse fit” means obtaining an $\bar{X}$ further away from $C_1$—in this case, $\bar{X}< \bar{x}$. “If $C_1$ were false”, means that we take the negation of $C_1$—in this case, $\neg C_1: \mu \le 0.03$. For the purposes of severity, we evaluate $\mu = \mu_{C_1}$, the value that would make $C_1$ just false—false in the most conservative way: $\mu_{C_1} = 0.03.$ Putting this all together, the severity of $C_1$, with respect to our test (now denoted $T_n$ to emphasize the sample size) and data, is: \[ \begin{aligned} SEV(T_{n = 50},x,C_1) = P\left(\bar{X} < \bar{x} \, ; \, \mu_{C_1} = 0.03 \right). %= P\left(\bar{X} < 0.3 \, ; \, \mu = 0.03 \right) \approx 0.49. \end{aligned} \] To evaluate this probability, recall that, as a random variable, $\bar{X}$ is normally distributed, centered at the population mean—in this case, $\mu_{C_1} = 0.03$—with variance $\sigma^2_{\bar{X}} = \sigma^2/n$: \[ \begin{aligned} \bar{X} \sim N\left(\mu_{C_1} = 0.03, \,\,\sigma^2/n = 1/50\right). \end{aligned} \] Thus, \[ \begin{aligned} SEV(T_{n = 50},x,C_1) &= P\left(\bar{X} < \bar{x} \, ; \, \mu_{\bar{X}} = \mu_{C_1}, \sigma^2_{\bar{X}} =\sigma^2/n \right) \\ &= P\left(\bar{X} < 0.25 \, ; \, \mu_{\bar{X}} = 0.03, \sigma^2_{\bar{X}} = 1/50 \right) \\ &\approx 0.94. \end{aligned} \] On the $(0,1)$ scale, this is a relatively high severity value, suggesting relatively strong evidence for $C_1$. That aligns with our knowledge of the DGP and intuition about evidence: since the data were generated under $\mu = 0.15$, and since $\bar{x}$ is above that, a reasonable evidence metric should assign a high “score” to the claim that the effect size is at least $\gamma = 0.03$.

Now, compare the severity of $C_1$ with the severity of $C_1^*$ (for the same test and data): \[ \begin{aligned} SEV(T_{n = 50},x,C_1^*) &= P\left(\bar{X} < \bar{x} \, ; \,\mu_{\bar{X}} = \mu_{C_1^*}, \sigma^2_{\bar{X}} =\sigma^2/n \right) \\ &= P\left(\bar{X} < 0.25\, ; \, \mu_{\bar{X}} = 0.1, \sigma^2_{\bar{X}} = 1/50 \right) \\ & \approx 0.86. \end{aligned} \] This severity score is reasonably high, but our move to the right caused the severity to decrease. This shift makes sense: $C_1$ included all the values in $C_1^*$ and then some more, so the severity score for $C_1$ should be higher. For claims $C: \mu > \mu_0 + \gamma$, as $\gamma > 0$ increases, the severity decreases. See the gray curve in 1.6 for a plot of this relationship, for $n = 50$.⁴¹

⁴¹ We will analyze the two other curves, for higher sample sizes, in the next section.

In summary: a small p-value alerts us to some effect; but without a severity assessment, what effect we have evidence for is unclear. Severity provides a metric for the severely of a test, with respect to data and a specific claim $C$. In this case, both research groups A and B have severely tested effect sizes. Supposing that both groups agree that a $0.8$ severity threshold is desirable, an inference to $\gamma = 0.1$ is warranted by the data.

The severity as a function of deviations from μ₀ = 0 (denoted γ > 0), for sample sizes n₁ = 10, n₂ = 50, and n₃ = 1, 000. x̄ = 0.25. The data were generated under H₁^′ : μ = 0.15.

4.4.2.2 Severity and sample size

Recall that, all else equal, the p-value trends downward as a function of the sample size. For the $z$ test above, with the same sample mean, the p-value decays from $p \approx 0.2$ for a sample size of $n = 10$, to $p \approx 0.04$ for $n = 50$, to $p \approx 0$ for $n = 1,000$. In the absence of severity assessments, this “large $n$ problem” creates challenges in assessing practical significance. A small p-value may be alerting us to practically insignificant deviation from $H_0$. For example, data generated under $\mu = 0.01$—what we might deem a practically insignificant deviation from $H_0: \mu = 0$—could produce a sample mean of $\bar{x} \approx 0.01$, which, for a sample size $n = 50,000$ would be statistically significant, and warrant a rejection of $H_0$. But an inference to a practically significant effect, such as in $C_1$ or $C_1^*$, would not be warranted. Severity captures this desideratum; for this scenario, $SEV(T_{50,000},x,C_1) \approx 10^{-5}$, which would not count as severe under any reasonable severity threshold.

For a more comprehensive understanding of how sample size impacts severity, consider again 1.6, which shows the severity of $C_1$, with test $T_n$, and $\bar{x} = 0.25$. For a fixed small effect size—i.e., an effect size less than $\bar{x}$, say $\gamma = 0.125$, the severity increases from $\approx 0.65$ to $\approx 0.81$ to $\approx 1$, as the sample size increases from $n = 10$ to $n = 50$ to $n = 1,000$. This increase occurs because larger samples sizes have a higher capacity to uncover smaller effect sizes. Conversely, for a larger effect size, say $\gamma = 0.5$, We see severity drop from $\approx 0.21$ to $\approx 0.04$ to $\approx 0$ as the sample size increases from $n = 10$ to $n = 50$ to $n = 1,000$. Larger samples have less of a capacity to provide evidence for larger effect sizes.

4.4.2.3 Severity of $H_0$

Severity addresses the fallacy of rejection by allowing researchers to assign evidence scores for specific alternative hypotheses. Does severity also address the fallacy of acceptance and problem of corroboration? Can we ever consider $H_0$ as having been corroborated, and to what degree? In testing the severity of $H_0$ in this context, a “worse fit” means that we imagine obtaining data with values above the $\bar{x} \approx 0.25$ actually obtained, evaluated under $\neg H_0: \mu > 0$. We choose a conservative value for the severity probability, say, $\mu = 0.001$, which makes $H_0$ just false. With the same data and test as above, the severity of $H_0$ is:⁴² \[ \begin{aligned} SEV(T_{n=50},x,H_0) &= P\left(\bar{X} > \bar{x} \, ; \, \mu_{\bar{X}} = \mu_{C_0}, \sigma^2_{\bar{X}} \right) \\ &= P\left(\bar{X} > 0.25 \, ; \, \mu_{\bar{X}} = 0.001, \sigma^2_{\bar{X}} = 1/50 \right) \\ &\approx 0.04. \end{aligned} \] $H_0$ has not been severely tested—nor corroborated—because our data suggest a mean far from $H_0$. And that makes sense: we would not want to corroborate $H_0$, given that the data were generated under $H'_1: \mu = 0.15$.

⁴² Note that: (1) evaluating this severity under $\mu = 0.001$ is effectively the same as evaluating it under $\mu = 0$, and thus, is approximately equal to the p-value.

How severely tested are claims of the form $C_0: \mu < \mu_0 + \gamma_{\text{min}}$, where $\gamma_{\text{min}}$ is some just practically insignificant effect size? An answer to this question is useful in practice. A high severity for $C_0: \mu < \mu_0 + \gamma_{\text{min}}$ would suggest that we have severely tested a claim that includes the null hypothesis plus all practically insignificant effect sizes. For $\gamma_{\text{min}} = 0.03$—the smaller of the effect sizes from earlier—the severity of $C_0$ is: \[ \begin{aligned} SEV(T_{n=50},x,C_0) &= P\left(\bar{X} > \bar{x} \, ; \, \mu_{\bar{X}} = \mu_{C_0}, \sigma^2_{\bar{X}} \right) \\ &= P\left(\bar{X} > 0.03 \, ; \, \mu_{\bar{X}} = 0.25, \sigma^2_{\bar{X}} = 1/50 \right) \\ &\approx 0.06. \end{aligned} \] As shown in 1.7, as $\gamma_{\text{min}}$ grows, the severity also grows, because the claim is inclusive of more values of $\mu$.

The severity as a function of deviations from C₀ : μ < μ₀ + γ, for γ > 0, sample sizes n₁ = 10, n₂ = 50, and n₃ = 1, 000. x̄ = 0.25. The data were generated under H₁^′ : μ = 0.15.

Finally, what about a case in which the data are generated under $H_0$? Intuitively, if the sample mean reflects mean closer to $H_0: \mu = 0$, we should see higher severity values. With $n = 50$, $\bar{x} = 0.17$, we find a p-value and severity of $p \approx 0.12$. In this case, we would fail to reject $H_0$, and have some, albeit not strong, corroboration. Higher sample sizes would corroborate it less, smaller sample sizes more. A more robust approach would be to consider, again, $C_0: \mu < \mu_0 + \gamma_{\text{min}}$, as in 1.8. Here, we see that as $\gamma_{\text{min}}$ increases, severity increases, suggesting that we have more evidence for claims that include more values into $H_1$, in addition to $H_0$.

The severity as a function of deviations from C₀ : μ < μ₀ + γ, for γ > 0, sample sizes n₁ = 10, n₂ = 50, and n₃ = 1, 000. x̄ = 0.17. The data were generated under H₀.

4.4.2.4 Analysis of severity and the error statistical philosophy

The error statistical philosophy is relatively obscure among practicing statisticians. Severity testing is rarely taught in statistics programs, and rarely used by applied statisticians. Haig (2017) and Colling & Szűcs (2021) advocate for the wider adoption of the error statistical philosophy as a way to rectify inferential issues that arise out of the replication crisis. The former claims that the error statistical philosophy has a number of strengths that may be useful in the field of psychology, including that it provides a unified philosophical framework for using frequentist statistical methods; clarifies common misunderstandings and misconceptions about frequentist tools; and provides a general, philosophical approach to scientific reasoning; and provides a specific alternative to the Bayesian approach, which is dominant in epistemology and the philosophy of science.

Haig, B. D. (2017). Tests of statistical significance made sound. Educational and Psychological Measurement, 77(3), 489–506.

Colling, L. J., & Szűcs, D. (2021). Statistical inference and the replication crisis. Review of Philosophy and Psychology, 12(1), 121–147.

Clayton, A. (2020). How eugenics shaped statistics. In Nautilus. Nautilus. https://nautil.us/how-eugenics-shaped-statistics-238014/

Yet, others object to the philosophical principles and practical applications of severity testing. Let’s consider some of those objections. First, as discussed in [sec:objectivity_in_sci], some take the error statistical notion of objectivity to be misleading or overly naïve. Researchers come to their work with personal biases, misaligned incentives, and cultural and sociological horizons that limit their ability to be objective. For example, Clayton (2020) claims that statisticians and researchers—including prominent figures like Galton, Pearson, and Fisher —hide behind so-called objective methods that are not, in fact, objective. Feminist philosophers have critiqued the notion of objectivity in science, claiming that the exclusion of women and marginalized groups in scientific inquiry has led science away from a comprehensive and accurate understanding of reality. How objectivity ought to fit into scientific practice may be less straightforward than Mayo and Spanos lead on.

There is much to say about the relationship between objectivity and science—much more than we can reasonably say in this context. My view is that these objections are not direct charges against objectivity as conceptualized in the error statistical philosophy. The facts that individuals have biases, that cultures perpetuate biases, and that objectivity is difficult to achieve in practice, do not imply that we should not aim for objectivity to the greatest extent possible. Objectivity is an ideal—not necessarily something perfectly achievable in practice, but something worth aiming for. After all, what would it look like to not aim for objectivity in science? In any scientific context, why would we not aim to (at least attempt to) uncover the facts of the matter? In such cases, we would probably end up looking a lot like Helen, attempting to confirm a hypothesis that align with biases and desires, and not with the facts of the matter.

The feminist critique—that exclusion has lead to an inaccurate understanding of reality—is not a critique of objectivity per se, but an incisive acknowledgement of situations in which we have fallen short of achieving it. The exclusion of women and marginalized groups in science is often wrong because it distorts our understanding of the world and the people in it; as such, exclusion is an indication that we have failed to meet the ideal of objectivity, not that we ought to stop aiming for it. If the exclusion of women in heart attack studies has concealed female cardiac arrest symptoms (and appropriate responses to them), then the right response is to include women in order to get a more accurate picture. That is, the response is a call for more objectivity, not less.

Beyond objectivity, the error statistical philosophy has seen objections against the philosophical notion and mathematical operationalization of severity. Philosophically, adherents of Bayesian inference rightly point out that severe testing is not a form a probabilism, and thus, is not a coherent approach to quantifying uncertainty, ostensibly established through the Dutch Book Argument.⁴³ It is true that the error statistical philosophy’s approach to severe testing is not coherent in this way, because it does not assign probabilities to hypotheses. There are two responses to this objection. As discussed earlier in this chapter, in Chapter 2, and Chapter 3, there are philosophical reasons to reject probabilism; such discussions are, in a sense, prior to severe testing. Relatedly, in Gelman et al. (2020), psychologist Brian Haig writes that the error statistical philosophy seeks to pay special attention to the specific context of an inference, rather than the development of a universal logic of inference (which we saw in [sec:logic], may not be possible or desirable). According to Haig, severe testers seek to develop the evidence-hypothesis relation

⁴³ In Chapter 3, we studied subjective or Bayesian interpretations of probability; we will study the Bayesian approach to statistical inference in Chapter 5, which takes probabilism as the starting point for all inferences.

in terms of the substantive and specific nature of the hypothesis and the evidence with regards to their origin, modeling, and analysis. This is a consequence of a strong commitment to a contextual approach to testing using the most appropriate frequentist methods available for the task at hand.

High-level, theoretical consistency—what probabilism aims for—is not the only desiderata, and may obscure or relegate a more laudable goal: objectivity (potentially) achieved through thoroughly probing specific features of the world in order to learn the facts of the matter.

Finally, many, including statistician Art Owen in Gelman et al. (2020), argue that severity analyses can be replaced by an analysis of confidence intervals. Owen does not “see any advantage to severity...over the confidence interval, if one is looking for evidence that $\mu < \mu_1$”, where $\mu_1$ is some value of $\mu$ in $H_1:\mu > \mu_0$. To better understand this argument, and to understand another major pillar of frequentist inference, let’s now turn to an analysis of confidence intervals—and estimation theory more broadly.

Gelman, A., Haig, B., Hennig, C., Owen, A. B., Cousins, R., Young, S., Robert, C., Yanofsky, C., Wagenmakers, E. J., Kenett, R., & Lakeland, D. (2020). Many perspectives on deborah mayo’s “statistical inference as severe testing: How to get beyond the statistics wars”. https://sites.stat.columbia.edu/gelman/research/unpublished/mayo_reviews_2.pdf.

4.5 Frequentist estimation

Estimation techniques produce a single number—in the case of point estimation—or a range of numbers—in the case of interval estimation—that in some way targets a true parameter in question. Frequentist estimation techniques—e.g., maximum likelihood estimation, confidence intervals—are constructed using principles consistent with the frequentist interpretation of probability and in the context of a frequentist statistical model (defined in 1.1).

4.5.1 Point estimation

A point estimator, $\widehat\theta(\mathbf{X})$, is a statistic—that is, a function of a random sample $\mathbf{X}$—meant target the true parameter, $\theta$. When the data context is clear, it is common to drop the function notation and denote the estimator as $\widehat\theta$. As functions of random variables, estimators are themselves random variables. That is, random changes in $\mathbf{X}$ produce random changes in $\widehat\theta(\mathbf{X})$. This fact allows us to analyze various properties of estimators. For example, over many samples from the DGP, what is the average value of $\widehat \theta$? How much variability do we see in $\widehat \theta$? Given actual data from the DGP, our estimator $\widehat\theta(\mathbf{X})$ produces an estimate, $\widehat{\theta}(\mathbf{x})$. The estimate is a real number, mean to target $\theta$.

There are many ways to derive estimators. Some are clearly bad, e.g., $\widehat\theta = 0$, independent of the estimation context; and some are, at least intuitively, much more reasonable, e.g., a sample mean or sample proportion as estimators of population means and population proportions, respectively. The most prominent frequentist estimation technique is the maximum likelihood estimator (MLE). Informally, the MLE can be understood as the value of the parameter that makes the observed data most likely. The MLE is easy to understand when $n=1$. Consider an attempt to estimate $\mu$ from the DGP $X_1 \sim N(\mu, 1)$, where the actual datum observed is $x_1 = 3$. What value of $\mu$ makes this datum most likely? Visually, we might imagine many normal curves, each with a different value of $\mu$. Which would maximize the probability density over $x_1 = 3$? $\widehat\mu = 3$, of course! When $n>1$, the idea is the same: move the joint density of the data to the location where the data have the highest joint probability density (or probability mass, in the case of discrete data).

Formally, the MLE is defined as the value of the parameter that maximizes the likelihood function. The likelihood function is the joint probability distribution (pdf) function of the data, interpreted as a function of the parameters, where all data values are fixed.⁴⁴ By considering the pdf in this way, we can imagine different hypothetical scenarios giving rise to the actual data—one scenario for each parameter value. The hypothetical scenario that renders the data most likely is the MLE.

⁴⁴ Technically, a likelihood function is defined as any function proportional to the joint pdf, interpreted as a function of the data. That means that we can drop constants with respect to the parameter, and still have a valid likelihood function.

Consider again the BK example from 1.1: a random sample of size $n = 15$ arising from DGP$_H$ yields: \[ \mathbf{x} = (0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0). \] DGP$_H$ is an independent and identically distributed Bernoulli process with probability of success $p$. The marginal pdf of any given flip $X_i$, is given by: \[ \begin{aligned} f({x}_i\, ; \, p) = P(X_i = x_i\, ; \, p) = p^{x_i}(1-p)^{1-x_i}, \,\,\, \end{aligned} \] where $x_i \in \{0,1\}, \,\, p \in (0,1)$. Consequently, the joint pdf is given by \[ \begin{aligned} f(\mathbf{x}\, ; \, p) &= p^{\sum^n_{i=1} x_i}(1-p)^{n-\sum^n_{i=1} x_i}. \end{aligned} \] The joint pdf is an $n$-dimensional function of $\mathbf{x}$. As with any pdf, it integrates to one over $\mathbf{x}$. The likelihood function $L(p\, ; \, \mathbf{x}) = f(\mathbf{x}\, ; \, p)$ is a one-dimensional function of $p$. It does not (necessarily) integrate to one over $p$.

Clearly, some values of $p$ render $\mathbf{x}$ more likely than others. $p = 0.9$ renders the data much less likely than $p = 0.1$. The goal of MLE is to find the value of $p$ that makes the data most likely. Intuitively, \[ \begin{aligned} \widehat p = \frac{\# \text{ of ones}}{n} = \frac{\sum^n_{i=1}x_i}{n} = \bar{x} = 2/15, \end{aligned} \] should make $\mathbf x$ mostly likely. Does the MLE achieve this intuitive answer? Let’s maximize the likelihood function. As a maximization problem, the MLE has been reduced to mathematical optimization—simply maximize the likelihood function over the parameters of interest: \[ \begin{aligned} \widehat{p}_{ML} = \max_{p \in P}L(p\, ; \, \mathbf{x}), \end{aligned} \] where $P = (0,1)$. In practice, we almost always maximize the (natural) log of the likelihood function, denoted $\ell(p \,; \, \mathbf{x})$, or $\ell(p)$ when the data context is clear. The maximizers of the likelihood and log-likelihood functions—i.e., the values that maximize each—are identical. Further, the log-likelihood function is typically easier to maximize. So, let’s find the log-likelihood for our example, and then maximize! In this case, maximization is a calculus exercise: first, take the derivative of the log-likelihood function with respect to the parameter; second, set that expression equal to zero; and third, solve for the parameter.⁴⁵ The result is the MLE.

⁴⁵ We should also check that the result is a maximizer and not a minimizer. The likelihood function in this case is concave, and thus, we know we have obtained a maximizer.

Log-likelihood function. By properties of logarithms: \[ \begin{aligned} \ell(p) &= \log\left(p^{\sum^n_{i=1} x_i}(1-p)^{n-\sum^n_{i=1} x_i}\right) \\ &= \log\left(p^{\sum^n_{i=1} x_i} \right) + \log\left((1-p)^{n-\sum^n_{i=1} x_i}\right)\\ &= \sum^n_{i=1} x_i\log\left(p \right) + \left(n-\sum^n_{i=1} x_i\right)\log\left(1-p\right). \end{aligned} \]

Derivative of $\ell(p)$. \[ \begin{aligned} \frac{d \ell(p)}{dp} = \frac{\sum^n_{i=1} x_i}{p} - \frac{n-\sum^n_{i=1} x_i}{1-p} \end{aligned} \]

Set to zero and solve. \[ \begin{aligned} \frac{d \ell(p)}{dp} &= \frac{\sum^n_{i=1} x_i}{p} - \frac{n-\sum^n_{i=1} x_i}{1-p} \overset{\text{ \bf set}}{=} 0 \\ &\implies (1-p)\sum^n_{i=1} x_i - p\left(n-\sum^n_{i=1} x_i\right)\overset{\text{ \bf set}}{=} 0 \\ &\implies \sum^n_{i=1} x_i -p\sum^n_{i=1} x_i - np+p\sum^n_{i=1} x_i\overset{\text{ \bf set}}{=} 0 \\ &\implies \sum^n_{i=1} x_i - np\overset{\text{ \bf set}}{=} 0 \\ &\implies p = \frac{1}{n}\sum^n_{i=1} x_i \\ \end{aligned} \] Thus, the MLE for $p$ is the intuitive result, the sample mean: $\widehat p_{ML}(\mathbf{x}) = \bar{x}$, as an estimate, or $\widehat p_{ML}(\mathbf{X}) = \bar{X}$, as an estimator.

Let’s look at another simple example. Consider again $X_1,...,X_n \overset{iid}{\sim} N(\mu, \sigma^2)$. The marginal pdf of any individual $X_i$ is given by: \[ \begin{aligned} f_{X_i}(x_i\, ; \, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}, \,\,\,\, -\infty < \mu < \infty, \,\,\, \sigma > 0. \end{aligned} \] Since the data are independent and identically distributed, the joint pdf of $\mathbf{X} = (X_1,...,X_n)^T$ is simply the product of the marginal pdfs: \[ \begin{aligned} f_{\mathbf{X}}(\mathbf{x}\, ; \, \mu, \sigma^2) = \left(2\pi\sigma^2\right)^{-n/2}e^{-\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^2}}, \,\,\,\, -\infty < \mu < \infty, \,\,\, \sigma > 0. \end{aligned} \] This joint pdf, $f_{\mathbf{X}}(\mathbf{x}\, ; \, \mu, \sigma^2)$, is an $n$-dimensional positive real function—i.e., it takes in $n$ values, and produces a positive real number. Parameters $\mu$ and $\sigma^2$ are constants. Since this function is a pdf, the $n$-dimensional integral over $\mathbf{x}$ is one. The likelihood function is \[ L(\mu, \sigma^2 \, ; \, \mathbf{x}) = f_{\mathbf{X}}(\mathbf{x}\, ; \, \mu, \sigma^2), \] the joint pdf interpreted as a two-dimensional function of $\mu$ and $\sigma^2$, with $\mathbf{x}$ fixed as numbers (actual data). $L(\mu, \sigma^2 \, ; \, \mathbf{x})$ does not necessarily integrate to one over $\mu, \sigma^2$.⁴⁶

⁴⁶ In the case where $\sigma^2$ is known and $\mu$ is unknown, , the likelihood as only a function of $\mu$ does integrate to one; but this is an accident related to the form of the normal pdf.

As with the previous example, we must now maximize the likelihood function over the parameters of interest. Let $\boldsymbol\theta = (\mu, \sigma^2)^T$ and let $\widehat{\boldsymbol\theta}_{ML} = (\widehat{\mu}_{ML}, \widehat{\sigma}_{ML}^2)^T$. Then: \[ \begin{aligned} \widehat{\boldsymbol\theta}_{ML} = \max_{\boldsymbol\theta \in \Theta}L(\boldsymbol\theta\, ; \, \mathbf{x}), \end{aligned} \] where $\Theta$ is the parameter space (in this case, all real numbers for $\mu$ and positive real numbers for $\sigma$). Let’s maximize the log-likelihood.

Log-likelihood function. By properties of logarithms: \[ \begin{aligned} \ell(\boldsymbol\theta\, ; \, \mathbf{x}) &= \log\left(\left(2\pi\sigma^2\right)^{-n/2}e^{-\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^2}}\right) \\ &= \log\left(\left(2\pi\sigma^2\right)^{-n/2}\right) + \log\left(e^{-\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^2}}\right) \\ &= -\frac{n}{2}\log\left(2\pi\sigma^2\right) -\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^2} \end{aligned} \]

Derivatives. \[ \begin{aligned} \frac{\partial \ell(\boldsymbol\theta\, ; \, \mathbf{x})}{\partial\mu} = \frac{\sum^n_{i=1}(x_i-\mu)}{\sigma^2} \end{aligned} \] \[ \begin{aligned} \frac{\partial \ell(\boldsymbol\theta\, ; \, \mathbf{x})}{\partial\sigma^2} = -\frac{n}{2\sigma^2} +\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^4} \end{aligned} \]

Set to zero and solve. \[ \begin{aligned} \frac{\partial \ell(\boldsymbol\theta\, ; \, \mathbf{x})}{\partial\mu} &=\frac{\sum^n_{i=1}(x_i-\mu)}{\sigma^2} \overset{\text{{\bf set}}}{=} 0 \\ &\implies \sum^n_{i=1}(x_i-\mu)= 0 \\ &\implies \sum^n_{i=1}x_i- n\mu =0 \\ &\implies \mu = \frac{1}{n}\sum^n_{i=1}x_i \end{aligned} \] Thus, the MLE for $\mu$ is the sample mean: $\widehat\mu_{ML}(\mathbf{x}) = \bar{x}$, as an estimate, or $\widehat\mu_{ML}(\mathbf{X}) = \bar{X}$, as an estimator. All that work justifies a pretty intuitive result: the sample mean is a good estimator of the population mean. To find the MLE for $\sigma^2$, we solve the partial derivative of the log-likelihood for $\sigma^2$, using $\widehat\mu_{ML}$ in place of $\mu$: \[ \begin{aligned} \frac{\partial \ell(\boldsymbol\theta\, ; \, \mu, \sigma^2)}{\partial\sigma^2} &= -\frac{n}{2\sigma^2} +\frac{\sum^n_{i=1}(x_i-\mu)^2}{2\sigma^4} \overset{\text{{\bf set}}}{=} 0 \\ &\implies -n\sigma^2 +\sum^n_{i=1}(x_i-\mu)^2 = 0\\ &\implies \sigma^2 =\frac{\sum^n_{i=1}(x_i-\mu)^2}{n} \\ &\overset{\widehat\mu_{ML} = \bar{x}}{\implies} \sigma^2 =\frac{\sum^n_{i=1}(x_i-\bar{x})^2}{n}. \end{aligned} \] Thus, the MLE for $\sigma^2$ is the (non-bias corrected) sample variance is $\widehat\sigma^2_{ML}(\mathbf{x}) =\frac{\sum^n_{i=1}(x_i-\bar{x})^2}{n}$ (estimate) or $\widehat\sigma^2_{ML}(\mathbf{X}) =\frac{\sum^n_{i=1}(X_i-\bar{X})^2}{n}$ (estimator). Again, a very intuitive answer!

4.5.1.1 Properties of point estimators

Nothing described thus far makes the MLE an essentially frequentist statistical tool. As we will see in Chapter 5, the likelihood function also plays a central role in Bayesian statistical methods (albeit, a different role than in frequentist methods). But there are ways of analyzing estimators—the MLE included—that make use of frequentist concepts. How would this estimator perform on repeated samples from this population? Or, equivalently, how would this estimator perform on all of the possible data sets that could arise from the given statistical model/DGP? Imagine that we were to observe many realizations $\mathbf{X}$, and compute $\bar{X}$ for each one. Frequentists ask: What properties would the random process $\bar{X}$ have? Here’s one nice property: on average $\bar{X}$ will “get it right”, in the sense that, on average it will pinpoint the true mean of the normal distribution, $\mu$. That’s true because the expected value—that is, mean—of $\bar{X}$ is $\mu$: \[ \begin{aligned} E\left(\bar{X}\right) &= E\left(\frac{1}{n}\sum^n_{i=1}X_i\right) \\ &= \frac{1}{n}E\left(\sum^n_{i=1}X_i\right) = \frac{1}{n}\sum^n_{i=1}E\left(X_i\right) \\ &= \frac{1}{n}\sum^n_{i=1}\mu = \frac{1}{n}n\mu = \mu. \\ \end{aligned} \] The equalities in the second line hold by the linear properties of expected value. The third line holds because, by our statistical modeling assumptions, the expected value of $X_i$ is $\mu$ (and by algebraic simplification). For our purposes, the algebra is less important than the concept: when an estimator’s expected value is equal to the parameter it is attempting to estimate, we call that estimator unbiased. In general, we can measure the bias of an estimator by with the equation \[ \begin{aligned} \text{Bias}(\widehat\theta) = E\left(\widehat{\theta}\right) - \theta. \end{aligned} \] The MLE is not always unbiased (we will say more about this below). But it happens to be unbiased in this case, where $\bar{X}$ is estimating $\mu$.

Intuitively, a small or non-existent bias is a good thing: on average, the estimator is getting close to or hitting the mark. But what happens on average is not the whole story. The estimator $\widehat\mu = \frac{1}{2}\left(X_1 + X_2\right)$ is unbiased.⁴⁷ But for any sample size $n > 2$, this estimator ignores data, and thus, information. So, in addition to bias, what other properties might be desirable?

⁴⁷ $E(\widehat\mu) = E(\frac{1}{2}\left(X_1 + X_2\right)) = \frac{1}{2}(E\left(X_1\right) + E\left(X_2\right)) = \frac{1}{2}2\mu = \mu$.

⁴⁸ Note that the standard deviation of the MLE, and of any estimator, is often called the standard error of the estimator. We used this language in 1.2.3.2.

The variance of an estimator is also important. Is the variance relatively small? Does the variance shrink as the sample size increases? The variance of the sample mean is:⁴⁸ \[ \begin{aligned} \text{Var}\left(\bar{X}\right) &= \text{Var}\left(\frac{1}{n}\sum^n_{i=1}X_i\right) \\ &= \frac{1}{n^2}\text{Var}\left(\sum^n_{i=1}X_i\right) \overset{i}{=} \frac{1}{n^2}\sum^n_{i=1}\text{Var}\left(X_i\right) \\ &= \frac{1}{n^2}\sum^n_{i=1}\sigma^2 = \frac{1}{n^2}n\sigma^2= \frac{\sigma^2}{n}. \\ \end{aligned} \] The second line follows from properties of variance and independent random variables. The third line follows from the fact that, by our modeling assumptions, $\text{Var}(X_i) = \sigma^2$. In this case, as the sample size grows, the amount of variability in our estimator shrinks. A desirable property! Note that the variance of $\widehat\mu = \frac{1}{2}\left(X_1 + X_2\right)$ is $\frac{\sigma^2}{2}$, and does not shrink as $n$ grows. That is one reason why $\widehat\mu$ is not a “good” estimator.

Ideally, estimators will strike a strong balance between a low bias and low variance. Unfortunately, at a fixed sample size, there is a tradeoff between the two. We can see this tradeoff most clearly by analyzing the mean squared error of the estimator $\widehat\theta$, denoted $MSE(\widehat\theta)$. The MSE is the average error between the estimator and the parameter of interest. It can be shown that the MSE is a function of bias and variance:⁴⁹ \[ \begin{aligned} MSE(\widehat\theta) &\overset{\text{def}}{=} E\left[ \left(\widehat\theta - \theta \right)^2 \right] = \text{Var}(\widehat\theta) + \text{Bias}(\widehat\theta)^2. \end{aligned} \] Conceptually, we see that both bias and variance contribute to the MSE: for a fixed error, if an estimator were altered so as to increase the bias, the variance would necessarily decrease (and vice versa). Thus, it does not always pay to have an unbiased (or low-bias) estimator; an estimator with a slight bias can (and often does) have a better MSE than an unbiased estimator.
Earlier, we saw that $\bar{X}$ is a better estimator than $\widehat\mu = \frac{1}{2}\left(X_1 + X_2\right)$, in part, because the latter ignores information. It seems desirable to only consider estimators that summarize data efficiently, by making use of all the available information. Along those lines, statisticians say that a statistic $\widehat\theta(\mathbf X)$ is a sufficient estimator of a parameter $\theta$ if the distribution of the data $\mathbf{X}$ conditional on $\widehat\theta(\mathbf{X})$—i.e., $f(\mathbf{X} \, | \, \widehat\theta(\mathbf{X}))$—does not depend on the parameter $\theta$. Colloquially, this means that “no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter” Fisher (1922). Essentially, a sufficient statistic is a data reduction tool; for the purposes of estimating a parameter $\theta$, the information in a data vector of length $n$ can be reduced to a summary, $\widehat\theta(\mathbf{X})$, of length less than $n$ (and often of length one). In the BK example, where $\mathbf{X} = (X_1,...,X_n)^T$ are iid Bernoulli distributed with parameter $p$, the distribution of the data conditional on $Y = \sum^n_{i = 1}X_i$ is \[ \begin{aligned} P(X_1 = x_1,...,X_n = x_n \, | \, Y = y) \overset{\text{def}}{=} \frac{P(X_1 = x_1,...,X_n = x_n, Y = y)}{P(Y = y)}. \end{aligned} \] This equality holds by definition: in general, the distribution of $A$ conditional on $B$ is equal to the joint distribution and $A$ and $B$ divided by the distribution of $B$. To show that $Y$ is a sufficient statistic, we need to show that this distribution can be simplified to an expression that does not depend on $p$. To simplify the numerator, consider two cases. First, imagine the case where the first $n$ variables, $x_1,...,x_n$, actually do sum to $y$. In that case, the last term, $Y = y$ does not add any new information, and we can reduce the numerator to $P(X_1 = x_1,...,X_n = x_n)$. In the second case, if $x_1,...,x_n$ do not sum to $y$, then the numerator is zero; that is, it’s not possible for $X_1 = x_1$ and $X_2 = x_2$, etc., and $\sum X_i \ne y$. Thus, \[ \begin{aligned} P(X_1 = x_1, \ldots, X_n = x_n \mid Y = y) &\overset{\text{def}}{=} \frac{P(X_1 = x_1, \ldots, X_n = x_n)}{P(Y = y)} \\ &= \begin{cases} \frac{P(X_1 = x_1, \ldots, X_n = x_n)}{P(Y = y)} & \text{if } y = \sum^n_{i=1}x_i \\ 0 & \text{otherwise} \end{cases} \\ &= \begin{cases} \frac{p^y(1-p)^{n-y}}{{n \choose y} p^y(1-p)^{n-y}} & \text{if } y = \sum^n_{i=1}x_i \\ 0 & \text{otherwise} \end{cases} \\ &= \begin{cases} \frac{1}{{n \choose y} } & \text{if } y = \sum^n_{i=1}x_i \\ 0 & \text{otherwise} \\ \end{cases} \end{aligned} \] which does not depend on $p$! As such, $Y = \sum^n_{i=1}X_i$ is a sufficient statistic. It turns out that any function of a sufficient statistic is also sufficient, so, perhaps unsurprisingly, $\widehat p = \bar{X}$ is a sufficient statistic for $p$. That means that $Y$ (or $\bar{X}$) contains all of the information necessary to estimate $p$, and no other information is needed. If we were sure we didn’t need the data for any other reason, we could just save $\bar{X}$, and delete the data!
In addition to bias, variance, MSE, and sufficiency for a finite sample size, it is also desirable to know what happens to an estimator as we increase the sample size. Consider the MSE of $\bar{X}$: \[ \begin{aligned} MSE(\bar{X}) &= \text{Var}(\bar{X}) + \underbrace{\text{bias}(\bar{X})^2}_{=0} = \frac{\sigma^2}{n}. \end{aligned} \] For the sample mean, the mean squared error tends toward zero as the sample size grows. That is a desirable property: more data means more information and less error. Interestingly, $MSE(\bar{X})$ has no lower bound as $n \to \infty$. That is not true of all estimators. For example, $MSE(\widehat\mu) = \frac{\sigma^2}{2}$ for any sample size. $MSE(\widehat\mu)$ does not get better as we obtain more information. In general, an estimator with an MSE that does not shrink to zero as the sample size grows is a less-than-ideal estimator.

⁴⁹ Here’s the proof: \[ \begin{aligned} MSE(\widehat\theta) &\overset{\text{def}}{=} E\left[ \left(\widehat\theta - \theta \right)^2 \right] \\ &= E\left[ \left(\widehat\theta - E(\widehat\theta) + E(\widehat\theta ) - \theta \right)^2 \right] \\ &= E\left[ \left(\widehat\theta - E(\widehat\theta) \right)^2\right] + \left[ E(\widehat\theta) - \theta \right]^2 + \underbrace{E\left[ 2 \left(\widehat\theta - E(\widehat\theta)\right)\left( E(\widehat\theta ) - \theta \right)\right]}_{=0} \\ &= \text{Var}(\widehat\theta) + \text{Bias}(\widehat\theta)^2. \end{aligned} \] The last term in line three is zero because $\left( E(\widehat\theta ) - \theta \right)$ is a constant with respect to the outer expected value, which implies it can be pulled outside, and the resulting term is zero: \[ \begin{aligned} 2\left( E(\widehat\theta ) - \theta \right)E\left[ \left(\widehat\theta - E(\widehat\theta)\right)\right] &= 2\left( E(\widehat\theta ) - \theta \right) \left(E(\widehat\theta) - E(E(\widehat\theta))\right) \\ &= 2\left( E(\widehat\theta ) - \theta \right)\underbrace{ \left(E(\widehat\theta) - E(\widehat\theta)\right)}_{=0} =0 \end{aligned} \]

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society, 309–368. https://doi.org/10.1007/978-1-4612-0919-5_2

There are other “large sample” performance properties of estimators. We use the notation $\theta_n$ when we want to make explicit the role of sample size.

Consistency. An estimator $\widehat\theta_n$ is consistent with respect to $\theta$ if, with probability one, as the sample size increases, $\widehat\theta_n$ gets arbitrarily close to the $\theta$. That is, let $\epsilon >0$. An estimator $\widehat\theta_n$ is consistent if \[ \begin{aligned} \lim_{n\to\infty}P\left(|\widehat\theta_n - \theta | < \epsilon \right) = 1. \end{aligned} \] As a shorthand notation, we say $\widehat\theta_n \overset{P}{\to}\theta$. If an estimator were not consistent, even with theoretically unlimited information, we would not be able to pinpoint the value we are trying to estimate. Therefore, consistency is a necessary condition for an estimator to be worth considering. It is not a sufficient condition. The estimator \[ \widehat\mu^* = \begin{cases} 0 & n < 10^{24} \\ \bar{X} & \text{if } n \ge 10^{24} \end{cases} \] is consistent (Rao, 1973; Spanos, 2019). But it is a very poor estimator for any reasonable sample size. $\bar{X}$, on the other hand, is consistent; the proof, which is a bit more mathematically involved that necessary here, can be found in Casella & Berger (2024) or Corcoran (2022).

Asymptotically unbiased. Some estimators are biased, but as the sample size goes to infinity, they become unbiased. An asymptotically unbiased estimator is an estimator $\widehat\theta_n$ for which: \[ \begin{aligned} \lim_{n\to\infty}\text{Bias}(\widehat\theta_n) = \lim_{n\to\infty} \left[E\left(\widehat{\theta}_n\right) - \theta\right] = 0. \end{aligned} \] Any estimator that is unbiased is asymptotically unbiased. MLEs are always asymptotically unbiased.

Efficiency. An estimator, $\widehat\theta_n$, of a parameter, $\theta$, is efficient if it is unbiased and reaches the minimum possible variance for all estimators of $\theta$. The minimum possible variance, when it exists, is called the Cramér Rao lower bound ($CRLB_\theta$), and can be found from mathematical properties of the DGP in question.⁵⁰ There is no guarantee that there will be an efficient estimator in any given context. $\bar{X}$ is efficient. It can be shown that $CRLB_\theta = \sigma^2/n$ is the minimum possible variance for an unbiased estimator of $\mu$ (Casella & Berger, 2024; Corcoran, 2022).

⁵⁰ $CRLB_\theta = \mathcal{I}(\theta)^{-1}$, where $\mathcal{I}(\theta)$ is the Fisher information of the parameter $\theta$. The Fisher information is formally defined as the variance of the derivative (with respect to $\theta$) of log-likelihood function. It can be understood as a measure of how much information the data contain about the unknown parameter.

Asymptoticly efficient. Some estimators are not efficient for a finite sample size, but become efficient as $n \to \infty$. When an estimator reaches efficiency in the limit, we call it asymptotically efficient.

Asymptotic normality. If the distribution of $\widehat\theta_n$ converges to a normal distribution as $n\to\infty$, then the estimator is said to be asymptotically normal: \[ \widehat\theta_n \overset{a}\sim N(\mu_{\widehat\theta_n}, \sigma^2_{\widehat\theta_n}). \] Asymptotic normality can be desirable when using estimators to derive test statistics and to find approximate confidence intervals (discussed below in 1.5.2). An estimator that meets many of these properties is thought of as a “good” estimator. $\bar{X}$ is one such estimator. In fact, $\bar{X}$ is a “uniformly best unbiased estimator”, in the sense that, among all unbiased estimators of $\mu$, for any true value of $\mu$, $\bar{X}$ has the lowest variance. No other estimator performs better in this context!

In general, there is not always a best unbiased estimator. In choosing an estimator, we often have to work with something that is “suboptimal”. The MLE is a good choice. Consider the MLE of the normal distribution variance, $\widehat\sigma^2_{ML} =\frac{\sum^n_{i=1}(X_i-\bar{X})^2}{n}$. This estimator is consistent; but, it is also biased, and so it is clearly not a “best unbiased estimator”. How biased is it? It can be shown that: $$ \[\begin{align} E\left(\widehat\sigma^2_{ML}\right) &=E\left(\frac{\sum^n_{i=1}(X_i-\bar{X})^2}{n}\right) = \frac{n-1}{n}\sigma^2. \end{align}\] $$ {#eq-MLbias} Note that as $n\to\infty$, $\frac{n-1}{n} \to 1$, and thus $E\left(\widehat\sigma^2_{ML}\right) \to \sigma^2$. Thus, while $\widehat\sigma^2_{ML}$ is biased for a finite $n$, it is asymptotically unbiased. It can also be shown that $\widehat\sigma^2_{ML}$ is not efficient (for a finite $n$, it’s not unbiased, so it cannot be efficient as we have defined it); but it is asymptotically efficient and asymptotically normal. This example is illustrative of some nice general properties of the MLE. The MLE is always:

Consistent. $\widehat\theta_{ML} \overset{P}{\to} \theta$.

Asymptotically unbiased. $E\left(\widehat\theta_{ML}\right) \to \theta$ as $n\to\infty$.

Asymptotically efficient. $\text{Var}\left(\widehat\theta_{ML}\right) \to CRLB_\theta$ as $n\to\infty$.

Asymptotically normally distributed. The distribution of the MLE converges to a normal distribution as $n\to\infty$: \[ \widehat\theta_{ML} \overset{a}\sim N(\theta, CRLB_\theta). \]

Invariant to transformation. Let $\gamma = g(\theta)$ be some function. Then the MLE of $\gamma$ is the function of the MLE of $\theta$: \[ \widehat\gamma_{ML} = g(\widehat\theta_{ML}). \]
The last property is convenient. It tells us that if we know the MLE of $\theta$, we can easily find the MLE of functions of $\theta$, by just taking the function of the MLE. So, in the BK example, since $\widehat p_{ML} = \bar{X}$, the MLE of $\sigma^2 = \text{Var}(X_i) \overset{\text{def}}{=} p(1-p)$ is $\widehat\sigma^2_{ML} = \bar{X}(1-\bar{X})$. No need to work with the likelihood function again!

4.5.1.2 Objections to frequentist point estimation

Some object to the utility of these properties. Do they really provide the proper grounding for what constitutes a “good” estimator? These properties are founded on the idea that a good estimator is one that performs well over all possible samples from the DGP. But why should we care about resampling theory? Why do we care what an estimator can do for data that we did not collect? Instead of using probability to measure variability over hypothetical datasets, some statisticians argue (as we will see in Chapter 5) that we should use it to quantify our epistemic uncertainty in our parameters based on data that was actually collected. Data that we did not collect is irrelevant.

This objection may be shortsighted. Data that we did not collect, but could have collected—what philosophers call counterfactuals—is often quite relevant. When researchers study the impact of an intervention on some outcome of interest—say, the impact of a vaccine on disease protection—they are implicitly asking the question: if this individual did not receive the intervention, would their outcome have been better or worse?. For any single individual, this is unequivocally unavailable, counterfactual information. We cannot know how a specific individual who was vaccinated at time $t$ would have responded in terms of disease protection had they not been vaccinated at time $t$.⁵¹ Similarly, for any actual dataset, it is relevant to think about how typical it might be under hypothesized parameter values and DGPs. That is exactly what resampling allows us to do.

⁵¹ Of course, we can consider proxies to this counterfactual, like how the same individual would respond at a different time—e.g., the following flu season—or how a sufficiently similar individual would respond. We can consider what happens at the population level, and use that to predict what might happen for any individual. But using these proxies to estimate the counterfactual simply underlines the importance of counterfactuals themselves.

Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). Wiley.

There are further objections to asymptotic properties of estimators. These properties are not simply about resampling from a population, but about what would happen if we increased the sample size without bound. Why do we care about how estimators behave for sample sizes we can never achieve? Consider, again, the estimator (Rao, 1973; Spanos, 2019): \[ \widehat\mu^* = \begin{cases} 0 & n < 10^{24} \\ \bar{X} & \text{if } n \ge 10^{24} \end{cases}. \] As mentioned above, $\widehat\mu^*$ is consistent. But it only achieves closeness (in probability) to $\mu$ after an unreasonably large sample size. Similarly, $\widehat\mu^*$ is biased, and does not trend toward unbiasedness as $n$ grows. Instead, $\widehat\mu^*$ is biased until $n$ is unachievably large, and then, all of a sudden, is unbiased. These facts suggest that, at least in general, asymptotic properties of estimators may be of limited use.

These objections do present some problems with asymptotic performance criteria. But they are not insurmountable objections. As discussed above, consistency is a necessary condition for an estimator to be “good”. But it is not a sufficient condition, for exactly this reason: consistent estimators may be quite bad in practice. As with any performance metric, consistency is not strong enough by itself to classify an estimator as “good”. More is needed. Asymptotic properties are meant to be used together, along with finite sample properties, as part of a more comprehensive picture of estimator performance.

Similarly, in the case of asymptotic unbiasedness, it is certainly possible that an estimator is “bad” on this metric for any reasonable sample size. But in most cases, when estimators are derived from some principled procedure—like the MLE, and unlike the arbitrarily derived $\widehat\mu^*$—estimators trend toward unbiasedness in a gradual manner. For example, for $\widehat\sigma^2_{ML}$, ?eq-MLbias makes it clear that the bias steadily tends toward zero as $n$ increases.

In summary, an estimator that is consistent, unbiased, and efficient estimator constitutes a best case (Spanos, 2019). Beyond that, tradeoffs are inevitable. Specifically, the tradeoff between bias and variance is important to consider on a case-by-case basis.

4.5.2 Interval estimation

Point estimation is helpful in cases where stakeholders need a single number to explain or predict some phenomenon. But point estimation alone is not sufficient for making inferential claims about parameters, i.e., claims about how close the point estimate is to the true value of the parameter (Spanos, 2019, Chapter 11). For the purposes of inference, a point estimate should be accompanied by a range of values—called an interval estimate—that helps quantify uncertainty in the estimate. A narrow interval generally corresponds to less uncertainty. In the frequentist context, this is often achieved with a confidence interval.

Spanos, A. (2019). Probability theory and statistical inference: Empirical modeling with observational data. Cambridge University Press.

Confidence intervals are a range of values accompanied by a measure of confidence. The measure of confidence is meant to represent how “confident” we are that the true parameter is inside of the interval. We must be careful in how we interpret “confident” in this context. To clarify, consider again the DGP $X_1,...,X_n \overset{iid}{\sim} N(\mu, \sigma^2)$, with $\mu$ unknown, $\sigma^2$ known. The interval \[ \begin{aligned} \mathcal{I}_{\mu}(\mathbf{x}) = \left(\bar x - \frac{\sigma}{\sqrt{n}}, \bar x + \frac{\sigma}{\sqrt n} \right), \end{aligned} \] is a reasonable starting point for a confidence interval. $\mathcal{I}_{\mu}(\mathbf{x})$ utilizes the best estimate of $\mu$, and produces an interval by “expanding” that best estimate to include values on either side of it, to a length of two times the standard error—i.e., the standard deviation of $\bar{X}$: \[ \begin{aligned} \text{s.e.}(\bar{X}) \overset{\text{def}}{=} \text{s.d.}(\bar{X}) = \sqrt{\text{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}. \end{aligned} \] To be a confidence interval, this interval must be accompanied by a confidence level. How confident are we be that $\mu$ is inside of $\mathcal{I}_{\mu}(\mathbf{x})$? For a frequentist statistician, this is a difficult question to answer. For $\mathcal{I}_{\mu}(\mathbf{x})$, which is formulated at the level of actual data, “confidence” cannot mean “probability”, at least not in any meaningful sense. The probability that $\mu$ is inside $\mathcal{I}_{\mu}(\mathbf{x})$ is either zero or one. That’s because $\mu$ is a fixed constant, and $\mathcal{I}_{\mu}(\mathbf{x})$ is an interval with fixed constant endpoints. Thus, either $\mu$ is in $\mathcal{I}_{\mu}(\mathbf{x})$ or it is not—i.e., $P(\mu \in \mathcal{I}_{\mu}(\mathbf{x}))$ is either zero or one. For any given fixed interval, we do not know which of these options is the case. If we did, we would have knowledge of $\mu$, and thus, we wouldn’t need to do estimation in the first place! “Zero or one” is not a helpful confidence level assignment. The confidence level associated with $\mathcal{I}_{\mu}(\mathbf{x})$ must come about through different means.

Thankfully, frequentists have a philosophically consistent method for grounding confidence measures. Instead of assessing confidence at the level of actual data, as in $\mathcal{I}_{\mu}(\mathbf{x})$, frequentists assess confidence at the level of the DGP, i.e., \[ \begin{aligned} \mathcal{I}_{\mu}(\mathbf{X}) = \left(\bar X - \frac{\sigma}{\sqrt{n}}, \bar X + \frac{\sigma}{\sqrt n} \right). \end{aligned} \] As a function of the random sample $\mathbf{X}$, $\mathcal{I}_{\mu}(\mathbf{X})$ is a random interval—i.e., an interval that is defined by endpoints that are random variables. By virtue of the random endpoints, \[ \begin{aligned} P\left(\mu \in \mathcal{I}_{\mu}(\mathbf{X})\right) = P\left(\mu \in \left(\bar X - \frac{\sigma}{\sqrt{n}}, \bar X + \frac{\sigma}{\sqrt n} \right)\right), \end{aligned} \] is a meaningful and non-trivial probability statement. Some mathematical unpacking shows that, \[ \begin{aligned} P\left(\mu \in \mathcal{I}_{\mu}(\mathbf{X})\right) = P\left(Z \le 1\right) - P\left(Z \le -1\right) \approx 0.68, \,\,\,\, Z \sim N(0,1). \end{aligned} \] Thus, at the level of the random sample, the probability that the interval covers $\mu$ is (approximately) $0.68$. It is in that sense that we can be (roughly) $68\%$ confident that $\mathcal{I}_{\mu}(\mathbf{X})$ will cover the true parameter $\mu$. Since, for frequentists, probability means long run relative frequency, we can say that, if we were to take many samples of size $n$ from this population, and compute $\mathcal{I}_{\mu}(\mathbf{X})$ for each one, about $68\%$ of the resulting intervals would contain the true parameter. With the move from $\mathcal{I}_{\mu}(\mathbf{x})$ to $\mathcal{I}_{\mu}(\mathbf{X})$, it is clear that the confidence level is attached to the confidence interval procedure—at the level of the DGP—rather than to any specific interval computed for actual data.

1.9 provides a visualization of the coverage properties of $\mathcal{I}_{\mu}(\mathbf{X})$ in a sequence of $m = 100$ intervals, computed with simulated data from $X_1,...,X_n ~\overset{iid}{\sim}~N(\mu ~= ~0,\,\, 1)$. Each of the $m = 100$ vertical lines represents an application of the procedure $\mathcal{I}_{\mu}(\mathbf{X})$ to a sample from the population. The $70$ gray lines are confidence intervals that cover $\mu = 0$. For example, the first interval from the left is $\mathcal{I}_{\mu}(\mathbf{x}_1) \approx (-0.20, 0.08)$, which contains zero. The $30$ black lines are those that do not cover $\mu = 0$. The first black interval from the left—the third interval overall—is $\mathcal{I}_{\mu}(\mathbf{x}_3) \approx (0.02, 0.30)$, which does not cover zero.

Of course, the analysis in 1.9 is meant to illustrate the performance of $\mathcal{I}_{\mu}(\mathbf{X})$ as a procedure, when the parameter $\mu = 0$ is known. In real-world estimation contexts, practitioners will not know $\mu$, and will only compute one confidence interval. Thus, they will not know whether their interval estimate, $\mathcal{I}_{\mu}(\mathbf{x})$, covers $\mu$. The best that can be said is that as a general procedure applied to random samples, $\mathcal{I}_{\mu}(\mathbf{X})$ is somewhat reliable, in the sense that, $68\%$ of the time it will cover the true parameter.

An illustration of the coverage properties of a 68% confidence interval attempting to estimate μ = 0. Each of the m = 100 vertical lines represents a 68% confidence interval computed from a random sample of size n from the DGP. The 70 gray lines are confidence intervals that cover μ = 0. The 30 black lines are those that do not cover μ = 0.

4.5.2.1 Changing the confidence level

The confidence level of $68\%$ was arbitrary. Often, confidence intervals are reported at a higher level of confidence. We can find a confidence interval of confidence level $c\times 100\%$ by solving the following equation for $a, b > 0$: \[ \begin{aligned} c &= P\left(\mu \in \left(\bar X - a\frac{\sigma}{\sqrt{n}}, \bar X + b\frac{\sigma}{\sqrt n} \right)\right) \\ &= P\left(\bar X - a\frac{\sigma}{\sqrt{n}} \le \mu \le \bar X + b\frac{\sigma}{\sqrt n} \right) \\ &= P\left(- b\le \frac{\bar{X} -\mu}{\sigma/\sqrt n} \le a \right) \\ &= P\left(- b\le Z \le a\right), \,\,\, \text{where } Z \sim N(0,1). \\ %&=P\left(Z \le z_c\right) - P\left(Z \le -z_c\right) \end{aligned} \] The last equality says that, for this DGP, a confidence interval for $\mu$ with confidence level $c\times 100\%$ can be found by finding an $a$ and $b$ such that $c\times 100\%$ of the area under the standard normal curve is trapped between $-b$ and $a$. There are infinitely many ways to do this; one obvious (and optimal!) way is to use the “symmetric around zero” property of the standard normal distribution, and set $a = b$: \[ \begin{aligned} c &= P\left(- b\le Z \le b\right). %&=P\left(Z \le z_c\right) - P\left(Z \le -z_c\right) \end{aligned} \] In the case that $c = 0.95$—a high level of confidence—$b \approx 1.96$. When $c = 0.99$, $b \approx 2.57$. As $c$ increases, $b$ also increases, and thus, the width of the interval increases. That highlights an important tradeoff: all other quantities equal, as the confidence level increases, the width of the confidence interval also increases. If we want more confidence, what we become confident of will include more values.⁵² In order to decrease the width of a confidence interval at a constant confidence level, one must increase the sample size (or, if possible, reduce the variability, $\sigma^2$, in the DGP). More information means less uncertainty for a given confidence level.

⁵² In the limit, we can be $100\%$ confident that the parameter is in the interval $(-\infty, \infty)$. Helpful!

4.5.2.2 Methods for finding confidence intervals

We arrived at $\mathcal{I}_{\mu}(\mathbf{X})$ by intuition. But as with tests and point estimators, there are systematic methods for finding interval estimates. Some of those methods produce intervals that have desirable properties. One method is to invert a hypothesis test. In the case of $X_1,...,X_n ~\overset{iid}{\sim}~N(\mu, \sigma^2)$ ($\mu$ unknown, $\sigma^2$ known), the best test of size $\alpha = 0.05$—the test of size $\alpha$ with the highest power—of the hypotheses \[ \begin{aligned} H_0: \mu = \mu_0, \, H_1: \mu \ne \mu_0 \end{aligned} \] is given by the test statistic \[ \begin{aligned} Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1), \end{aligned} \] with rejection region $\mathcal{R} = (-\infty, -1.96) \cup (1.96, \infty)$. This rejection region is depicted in the gold shaded regions in 1.10. Each tail region contains probability $\alpha/2 = 0.025$, for a total of $\alpha = 0.05$. Consequently, we know that the area under the curve between $-1.96$ and $1.96$ is $0.95$. We can use this fact to “invert” the test of size $\alpha = 0.05$ to arrive at a $(1-\alpha) \times 100\% = 95\%$ confidence interval: \[ \begin{aligned} 0.95 &= P\left(-1.96 < Z < 1.96\right) \nonumber \\ &=P\left(-1.96 < \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} < 1.96\right) \nonumber \\ &=P\left(-1.96\frac{\sigma}{\sqrt{n}} < \bar{X} - \mu < 1.96\frac{\sigma}{\sqrt{n}} \right) \\ \nonumber &=P\left(-\bar{X}-1.96\frac{\sigma}{\sqrt{n}} < - \mu < -\bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \\ &=P\left(\bar{X}-1.96\frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right). \nonumber \end{aligned} \] Since $\mu \in \left(\bar{X}-1.96\frac{\sigma}{\sqrt{n}} , \, \, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right)$ with probability 0.95, \[ \mathcal{I}_{\mu,95}(\mathbf{X}) = \left(\bar{X}-1.96\frac{\sigma}{\sqrt{n}}, \,\, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \tag{4.1}\] is a $95\%$ confidence interval for $\mu$. To derive this confidence interval at arbitrary confidence level $(1-\alpha)\times 100\%$, replace $1.96$ in equation [eq:95_ci] with $z_{\alpha/2}$, where $z_{\alpha/2}$ defines the upper boundary of the two-tailed rejection region of size $\alpha$ (i.e., the value along the $z$ axis that traps $\alpha/2$ to the right and under the standard normal curve).

Inverting hypothesis tests to derive confidence intervals suggests a deep connection between the two. The confidence interval $\mathcal{I}_{\mu,95}(\mathbf{X})$ contains the parameter values that would not be rejected by the corresponding two-tailed “best test” of size $\alpha = 0.05$ (Mayo & Spanos, 2011b). Suppose that for actual data $\mathbf{x}$, $\mathcal{I}_{\mu,95}(\mathbf{X}) = (-0.1,0.3)$. Then the test \[ \begin{aligned} H_0: \mu &= \mu_0 \\ H_1: \mu &\ne \mu_0 \end{aligned} \] of size $\alpha = 0.05$ would fail to reject $H_0$ for all $\mu_0 \in (-0.1,0.3)$, and would reject $H_0$ for all $\mu_0 \notin (-0.1,0.3)$. This relationship holds for one sided tests too. The test \[ \begin{aligned} H_0: \mu &= \mu_0 \\ H_1: \mu &> \mu_0 \end{aligned} \] at level $\alpha = 0.05$ would fail to reject $H_0$ for any $\mu_0$ in the $95\%$ confidence interval $\mathcal{I}_{\mu,95,>}(\mathbf{X}) = \left( \bar{X} - 1.62\frac{\sigma}{\sqrt{n}}, \infty \right)$.

Maximum likelihood estimation can also be used to derive confidence intervals. In cases where we know the exact finite sample distribution of the MLE, we can use that information to derive an exact confidence interval. When $X_1,...,X_n \overset{iid}{\sim} N(\mu,\sigma^2)$, with $\sigma^2$ known, $\bar{X}$ is the MLE for $\mu$. In this case, $\bar{X} \sim N(\mu, \frac{\sigma^2}{n})$, and thus, the $(1-\alpha)\times 100\%$ confidence interval for $\mu$ associated with the MLE is $\mathcal{I}_{\mu,95}(\mathbf{X})$, the same as the interval for inverting the $z$ test.

In cases where we do not know the exact distribution of the MLE, but have a large sample size, we can rely on the asymptotic distribution of the MLE: \[ \begin{aligned} \widehat\theta_{ML} \overset{a}{\sim}N(\theta,CRLB_\theta). \end{aligned} \] For the BK example, the DGP $X_1,...,X_n$ is iid Bernoulli with unknown parameter $p$. We saw that the MLE of $p$ is $\bar{X}$. In this context, the exact distribution of $\bar{X}$ is not normal, but as $n \to \infty$, \[ \begin{aligned} \bar{X} \overset{a}{\sim}N\left(p, \,\, CRLB_p = \frac{p(1-p)}{n}\right). \end{aligned} \] By this asymptotic distribution, we know that, for a large $n$, \[ \begin{aligned} 1-\alpha &\approx P\left(-z_{\alpha/2} < \frac{\bar{X} - p}{\sqrt{\frac{p(1-p)}{n}}} < z_{\alpha/2} \right) \\ & \approx P\left(-z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} < \bar{X} - p < z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\right) \\ & \approx P\left(\bar{X} -z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} < p < \bar{X} +z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} \right). \end{aligned} \] The last line is suggestive of an approximate $95\%$ confidence interval candidate: $\left( \bar{X} - z_{\alpha/2} \sqrt{ \frac{p(1-p)}{n}}, p- z_{\alpha/2} \sqrt{ \frac{p(1-p)}{n}} \right)$. However, we cannot work with this interval in practice because $p$—the unknown parameter of interest that we are attempting to estimate—is part of the interval formula. Thankfully, there’s a way to modify this formula that preserves the approximate coverage properties. Recall that $\sigma^2 = p(1-p)$ is the variance of the Bernoulli distribution. So, by the invariance property of the MLE (described earlier in this section), $\widehat\sigma^2_{ML} = \bar{X}(1-\bar{X})$. To obtain an approximate $95\%$ confidence interval, we can substitute $\widehat\sigma^2_{ML}$ for $\sigma^2$. That is, the interval \[ \begin{aligned} \mathcal I_{p,95}(\mathbf{X}) = \left( \bar{X} - z_{\alpha/2} \sqrt{ \frac{\bar{X}(1-\bar{X})}{n}}, \bar{X} - z_{\alpha/2} \sqrt{ \frac{\bar{X}(1-\bar{X})}{n}} \right) \end{aligned} \] is an approximate $(1-\alpha)\times 100\%$ confidence interval for $p$. The interval has approximately $(1-\alpha)\times 100\%$ coverage because of (1) the asymptotic approximation of the distribution of $\bar{X}$, and (2) the fact that we substitute $\widehat\sigma^2_{ML}$ for $\sigma^2$. Both (1) and (2) add uncertainty to the interval end points, but that uncertainty decreases as $n$ increases.

For more on methods for deriving confidence intervals for various DGPs, see Casella & Berger (2024).

The rejection region for a two-tailed Z test. This test can be inverted to find the (1 − α) × 100% confidence interval for μ.

4.5.2.3 Evaluating confidence intervals

As with hypothesis testing and point estimators, confidence intervals provide frequentist statisticians with a tool that has known error properties. And as with those other frequentist tools, we can compare and evaluate the quality of confidence intervals. The interval $\mathcal{I}_{\mu,95}(\mathbf{X})$ above takes advantage of the symmetry of the normal distribution. But other, non-symmetric intervals exist. The interval \[ \begin{aligned} \mathcal{I}'_{\mu,95}(\mathbf{X}) = \left(\bar{X}-2.2\frac{\sigma}{\sqrt{n}}, \,\, \bar{X} + 1.8\frac{\sigma}{\sqrt{n}} \right) \end{aligned} \] is also a $95\%$ confidence interval. Is it possible to say which of these intervals is “better”? Consider the length, $L$, of each. Interval $\mathcal{I}_{\mu,95}(\mathbf{X})$ has length: \[ \begin{aligned} L\left(\mathcal{I}_{\mu,95}(\mathbf{X})\right) &= \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} - \left(\bar{X}-1.96\frac{\sigma}{\sqrt{n}} \right)\\ &= 1.96\frac{\sigma}{\sqrt{n}} +1.96\frac{\sigma}{\sqrt{n}} \\ &= 3.92\frac{\sigma}{\sqrt{n}}. \end{aligned} \] Compare that to the length of $\mathcal{I}'_{\mu,95}(\mathbf{X})$: \[ \begin{aligned} L\left(\mathcal{I}'_{\mu,95}(\mathbf{X})\right) &= \bar{X} + 1.8\frac{\sigma}{\sqrt{n}} - \left(\bar{X}-2.2\frac{\sigma}{\sqrt{n}} \right)\\ &= 1.8\frac{\sigma}{\sqrt{n}} +2.2\frac{\sigma}{\sqrt{n}} \\ &= 4\frac{\sigma}{\sqrt{n}}. \end{aligned} \] For the same data, both of the intervals have the same confidence level. But $\mathcal{I}'_{\mu,95}(\mathbf{X})$ is a longer interval. Intuitively, longer is bad; it includes more values at the same confidence level. As such, we should prefer shorter intervals, in this case $\mathcal{I}_{\mu,95}(\mathbf{X})$. More generally, in this estimation context, $\mathcal{I}_{\mu,95}(\mathbf{X})$ is the shortest, and thus, best, interval centered at $\bar{X}$.

There is much more to be said about properties of confidence intervals. For more, see Casella & Berger (2024). We now turn to the philosophical aspects of confidence intervals.

4.5.2.4 Confidence intervals: interpretation and context

Some have proposed that confidence intervals be used as an alternative to hypothesis testing (Cumming & Calin-Jageman, 2024; Halsey, 2019). As a response to the replication crisis and the difficulties in interpreting and using hypothesis tests, This proposal is especially prevalent in areas of science where the “null ritual” that has taken hold. Cumming & Calin-Jageman (2024) writes that, while it is important to understand hypothesis testing and p-values to read scientific literature, confidence interval estimation “almost always provides the most complete and best basis for interpretation and drawing conclusions.” Along these lines, in a statement by the American Statistical Association, Wasserstein & Lazar (2016) writes,

Cumming, G., & Calin-Jageman, R. (2024). Introduction to the new statistics: Estimation, open science, and beyond. Routledge.

Halsey, L. G. (2019). The reign of the p-value is over: What alternative analyses could we use to fill the power vacuum? Biology Letters, 15(5), 20190174. https://doi.org/10.1098/rsbl.2019.0174

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. In The American Statistician (No. 2; Vol. 70, pp. 129–133). Taylor & Francis.

In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include methods that emphasize estimation over testing, such as confidence [intervals], credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.

Confidence intervals may provide some protection against the naïve “null ritual” approach to hypothesis testing. Rather than encouraging dichotomous thinking—as a naïve approach to testing might—confidence intervals provide a range of values that may stand in as reasonable effect sizes. However, confidence intervals are not without criticism. The primary criticism of confidence intervals is that it is unclear how we ought to interpret them once we have computed an interval with actual data. At the level of the DGP, we know that a $(1-\alpha)\times 100\%$ confidence interval is a procedure that covers the true parameter with probability $(1-\alpha)\times 100\%$; that is, computing the confidence interval over many repeated samples from the DGP will result in $(1-\alpha)\times 100\%$ of those intervals covering the true parameter. Once we compute a confidence interval with that procedure, we have no way of knowing whether the interval covers the true parameter, and no way of assigning non-degenerate probabilities. The lack of a straightforward interpretation at the actual data level leaves an interpretative vacuum that gets filled with various incorrect interpretations. Hoekstra et al. (2014) conducted an analysis of how often actual confidence intervals are misinterpreted. They asked hundreds of researchers and students to assess the truth value of the following six statements, all concerning a fictitious experimental result that produced a $95\%$ confidence interval $(0.1, 0.4)$ for a mean (no other details, e.g., about the DGP, were specified):

The probability that the true mean is greater than $0$ is at least $95\%$.

The probability that the true mean equals 0 is smaller than 5%.

The “null hypothesis” that the true mean equals 0 is likely to be incorrect.

There is a 95% probability that the true mean lies between 0.1 and 0.4.

We can be 95% confident that the true mean lies between 0.1 and 0.4.

If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. Although many students and researchers endorsed these claims (for specific details, see Hoekstra et al. (2014)), it turns out that all six of these statements are false! Statements 1–4 are false because the mean parameter is a fixed (unknown) quantity, and the interval $(0.1, 0.4)$ is fixed. Statement 5 is false for the same reason, if we follow Hoekstra et al. (2014) in interpreting “confidence” to mean “probability”.⁵³ Even though statement 6 refers to repeated sampling, it does not apply repeated sampling in the right way. If we were to repeatedly sample from the experimental population, the end points of the interval would change, which would provide justification for (non-degenerate) probabilistic statements. But statement 6 holds the interval fixed at $(0.1, 0.4)$, which means that there is, again, nothing random.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–1164.

⁵³ However, if we let “confidence” refer not to probability, but to the reliability of the procedure that produced this interval, then this interpretation seems more reasonable.

The findings in Hoekstra et al. (2014) suggest that confidence intervals are easily misinterpreted. How are we supposed to interpret a confidence interval computed with actual data? There is no probabilistic coverage property that applies to the interval computed with actual data; nor is there a probability distribution to be placed over the values in the computed interval. “Each of the values of the parameter in the confidence interval [are] on a par” (Mayo & Spanos, 2011b). The best that we can do is to give confidence intervals a behavioristic interpretation, similar to Neyman-Pearson tests. “Different sample realizations $x$ lead to different estimates, but one can ensure that $(1-\alpha)\times 100\%$ of the time the true parameter value $\mu$, whatever it may be, will be included in the interval formed” (Mayo & Spanos, 2011b). In the BK example, imagine a one-sided $95\%$ confidence interval $\mathcal{I}_{p,95}(\mathbf{x}) = (0,0.28)$. Under the behavioristic interpretation, we behave as if this interval contains a set of reasonable values to estimate $p$ because it was produced by a process that is not often in error. But there’s nothing probabilistic to be said about the interval.

The behavioristic interpretation of confidence intervals is quite limited. The error statistical philosophy and the severity approach to testing, detailed in 1.4.1, helps overcome these limitations by giving an evidential interpretation of confidence intervals.⁵⁴ Specifically, a severity account allows researchers to “break open” the interval, in a sense, and claim that there is strong evidence for some claims with in the interval, and weaker evidence for others. Consider the case of the DGP $X_1,...,X_n \overset{iid}{\sim} N(\mu, \sigma^2)$, with $\mu$ unknown, $\sigma^2=1$, and hypotheses \[ \begin{aligned} H_0&: \mu = \mu_0 \\ H_1&: \mu > \mu_0. \end{aligned} \] Let the significance level be $\alpha = 0.05$. Inverting this test gives a confidence interval of $\mathcal{I}_{\mu,95,>}(\mathbf{X})= \left(\bar{X} - 1.62\frac{\sigma}{\sqrt{n}}, \,\, \infty \right)$. Let $C: \mu > \bar{x} - 1.62\frac{\sigma}{\sqrt{n}}$, i.e., the claim that the true mean is the boundary of the confidence interval. The severity of $C$ is the confidence level: \[ \begin{aligned} SEV(T, \mathbf{x}, C) &= P\left(\bar{X} < \bar{x} - 1.62\frac{\sigma}{\sqrt{n}}; \, \mu = \bar{X} - 1.62\frac{\sigma}{\sqrt{n}} \right) \\ &=P\left(Z < 1.62 \right) = 0.95. \end{aligned} \] If the endpoint were the true mean—i.e., if $\mu_1 = \bar{x} - 1.62\frac{\sigma}{\sqrt{n}}$ of the DGP, then with high probability, a result less discordant from $\mu_1$ would have occurred. Now, move to the right, further into the confidence interval: for $\epsilon > 0$, consider $C': \mu > \bar{x} - 1.62\frac{\sigma}{\sqrt{n}} + \epsilon$. $C'$ is tested with less severity than $C$. On the severity construal, not all value in the confidence interval are considered equal! “For each value of $\mu$ in the confidence interval, there [is] a different answer to the question: How severely does $\mu > \mu_1$ pass with [actual data] $x$?” (Mayo & Spanos, 2011a).

⁵⁴ “Just as we replace the behavioristic rationale of tests with the inferential one based on severity, we do the same with confidence intervals” (Mayo & Spanos, 2011b).

Mayo, D. G., & Spanos, A. (2011b). Error statistics. In P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of statistics (Vol. 7, pp. 153–198). Elsevier.

4.6 Broad objections to frequentist statistical methods

In previous sections, we’ve discussed objections to frequentist inference methods, including hypothesis testing, point estimation, and confidence interval estimation. In this section, we explore some broader objections to the frequentist statistical inference paradigm.

4.6.0.1 The likelihood principle and stopping rules

The frequentist statistical inference paradigm is committed to controlling long-run error rates. Adherence to this commitment is seen in hypothesis testing, for example, when one sets a rate of type I error (the significance level, $\alpha$), and when one finds the sampling distribution of a test statistic and computes a p-value for the data. If one were to collect data from the same DGP, perform the same test, and reject $H_0$ whenever the p-value was less than the pre-specified $\alpha$, one would be in error at most $\alpha\times 100\%$ of the time.

However, the commitment to controlling long-run error rates is in tension with other potentially important principles of inference. One such principle is called the likelihood principle. The likelihood principle states that

In an inference about $\theta$, after the data $\mathbf x$ [are] observed, all relevant information from the data is contained in the likelihood function for the observed $\mathbf x$. Furthermore, two likelihood functions contain the same information about $\theta$ if they are proportional to each other (Vidakovic, n.d.).

The standard justification for the likelihood principle states that, if one accepts two other plausible principles—the sufficiency principle and the conditionality principle—then, on pain of contradiction, one ought to accept the likelihood principle.⁵⁵ That is, accepting the likelihood principle can be reduced to accepting two other perhaps more intuitive principles. The hope is that the acceptance of these other principles is less controversial. We will briefly consider these two other principles, assess their plausibility, and their connection to the likelihood principle.

⁵⁵ The original proof that the likelihood principle is entailed by the sufficiency principle and the conditionality principle is given in Birnbaum (1962). More recently, some have argued that the proof given in Birnbaum (1962) is not sound. For example see, Mayo & Spanos (2011a) and Mayo (2014). For a discussion of the criticisms in Mayo (2014), see Evans (2014).

Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57(298), 269–306. https://doi.org/10.1080/01621459.1962.10480660

Mayo, D. G., & Spanos, A. (2011a). An error in the argument from conditionality and sufficiency to the likelihood principle. In Error and inference: Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press.

Mayo, D. G. (2014). On the birnbaum argument for the strong likelihood principle. Statistical Science, 29(2). https://doi.org/10.1214/13-sts457

Evans, M. (2014). Discussion of “on the birnbaum argument for the strong likelihood principle.” Statistical Science, 29(2). https://doi.org/10.1214/14-sts471

⁵⁶ A test statistic, as defined in 1.2, is a specific example of a statistic.

⁵⁷ Typically, an estimator is a function of the random variables $\mathbf{X}$, while an estimate is a function of the realizations of those random variables, $\mathbf{x}$.

First, let’s review some definitions. A statistic is any function of the data arising from a statistical model.⁵⁶ A statistic can either be a function of the random variables that generate the data—and therefore a random quantity—or a function of the realizations of the random variables that generate the data—and therefore a fixed quantity. An estimator is a statistic meant to target or pinpoint a parameter $\theta$.⁵⁷ A statistic $T$ is sufficient as an estimator of a parameter $\theta$ if the distribution of the data $\mathbf{X}$ conditional on $T(\mathbf{X})$ does not depend on the parameter $\theta$.

The sufficiency principle states that if $T(\mathbf{X})$ is a sufficient statistic of $\theta$, then any inference about $\theta$ should depend on the sample $\mathbf{X}$ only through $T(\mathbf{X})$. Note that a sufficient statistic does not always exist in a given context; but when it does, the sufficiency principle says that it should be the pathway through which data impact the statistical inference. Both frequentists and Bayesians accept the sufficiency principle. In fact, many common frequentist estimators are sufficient statistics.

Informally, the conditionality principle states that “If an experiment concerning the inference about $\theta$ is chosen from a collection of possible experiments, independently of $\theta$, then any experiment not chosen is irrelevant to the inference” (Vidakovic, n.d.). The formal conditionality principle is somewhat more technical, and the interpretations of the technical details matters for the proof Birnbaum (1962) and the criticism of the proof given by Mayo (2014). We choose to circumvent these issues here, but encourage the interested reader to consult the aforementioned resources.

Vidakovic, B. (n.d.). The likelihood principle. H. Milton Stewart School of Industrial; Systems. Retrieved https://www2.isye.gatech.edu/isyebayes/bank/handout2.pdf

Berger, J. O. (1985). Statistical decision theory and bayesian analysis. Springer.

Cox, D. R. (1958). Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29(2), 357–372. https://doi.org/10.1214/aoms/1177706618

To gain a better understanding of the conditionality principle, consider the following example, from Vidakovic (n.d.), via, Berger (1985) and Cox (1958):

Suppose that a substance to be analyzed is to be sent to either one of two labs, one in California or one in New York. Two labs seem equally equipped and qualified and a coin is flipped to decide which one will be chosen. The coin comes up tails, denoting that California lab is to be chosen. After the results are returned back and report is to be written, should report the take into account the fact that coin did not land up heads and that New York laboratory could have been chosen?

Intuition suggests that the report should not take into account the coin flip and the counterfactual information. Likewise, we might extend this reasoning further: only the experiment and data at hand should matter to the statistical inference; experiments and data that were not chosen are irrelevant. However, because the frequentist approach requires an averaging over all possible data, and not just the data observed, some claim that frequentist inference violates the conditionality principle. Further, since the sufficiency principle and the conditionality principle together imply the likelihood principle, frequentist inference is said to violate the likelihood principle.

The BK example can help illustrate how frequentist hypothesis testing does not adhere to the likelihood principle. Recall dataset $\mathbf{x}_2$ from 1.2.1: \[ \mathbf{x}_2= (0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0). \] Suppose the group that collected this data could not remember which of the following two stopping rules they used:

: Stop when the sample size reaches $n = 15$.
: Stop when the sample includes $r=10$ individuals who have not read the BK.

Both stopping rules are consistent with the data, and so, we cannot know which was used just by observing $\mathbf{x}_2$. As discussed in previous analyses of this example, stopping rule #1 describes a binomial DGP. Recall that, for an $X \sim \text{binomial}(n,p)$, the mass function is \[ f(x\, ; \, p) = {n \choose x}p^x(1-p)^{n-x}. \]

Stopping rule #2 describes a negative binomial DGP. Let $X$ count the number of “failures” (e.g., individuals who have not read the BK) until $r$ “successes” (e.g., individuals who have read the BK). Let $p$ be the probability of success. Then the probability mass function of $X$ is \[ \begin{aligned} f(x\, ;\, r, p) = {x + r -1 \choose x}(1-p)^xp^r. \end{aligned} \]

Since, for the negative binomial, $x$ represents the number of failures, $r = n-x$ represents the number of successes. Thus, these functions are proportional to each other with respect to $p$; they only differ in the binomial coefficients, which is free of the parameter $p$.

According to the likelihood principle, it should not matter whether we use Stopping rule #1 or Stopping rule #2 in our inference about $p$. But, when testing \[ \begin{aligned} H_0&: p \le 0.1 \\ H_1&: p > 0.1, \end{aligned} \] at the $\alpha = 0.01$ level, the stopping rule does matter! The p-value for stopping rule #1 (binomial DGP) is $p_1 \approx 0.013 > \alpha$. The p-value for stopping rule #2 (negative binomial DGP) is $p_{2} \approx 0.009 < \alpha$.

Together, these analyses violate the likelihood principle. In an inference about $p$, after $\mathbf{x}_2$ was observed, the likelihood function did not contain all of the relevant information. Some information relevant to the inference is contained in the stopping rule. Specifically, the first stopping rule—“stop when $n = 15$”—results in failing to reject $H_0$, and the second stopping rule—“stop when $r = 10$”—results in a rejection of $H_0$.

Of course, stopping rules do not always influence the inference. There are many possible datasets where a frequentist hypothesis test will yield the same inference, independent of stopping rules. But the fact that stopping rules can sometimes influence a frequentist inference is thought by some to be an argument against the frequentist inference paradigm. Why should the way an experiment is stopped change an inference? Often, an experiment or study includes a number of statistical units that was not fixed and pre-specified, but instead, chosen by convenience. For example, suppose a study by the National Science Foundation decided to end a study earlier than originally anticipated, for reasons related to their fiscal year budget. Should the (in part, stochastic) political and social phenomena that lead to the fiscal year budget influence the inference made in the study? Howson & Urbach (2005) write about an even more extreme hypothetical example. Imagine two scientists collaborating in a study; each has a different stopping rule in mind, but neither communicates their rule to the other. By chance, the results of the study are consistent with both stopping rules, and so no conflict arises. Howson & Urbach (2005) ask:

What then are the outcome space and sampling distributions for the trial? To know these you would need to discover how each of the scientists would have reacted in the event of a disagreement. Would they have conceded or insisted, and if they had put up a fight, which of them would have prevailed? We suggest that such information about experimenters’ subjective intentions, their physical strengths and their personal qualities has no inductive relevance whatever in this context, and that in practice it is never sought or even contemplated. The fact that significance tests and, indeed, all [frequentist] inference models require it is a decisive objection to the whole approach.

How might a frequentist defend their methods against this charge? First, it is worth noting that this implementation of hypothesis testing is naïve. It more aligned with the “null ritual” criticized above than with a robust version of hypothesis testing found in the error statistical philosophy. Specifically, the example uses dichotomous decision-making without attention to effect sizes and severity testing. It has not been implemented in accordance with the error statistical philosophy, and thus, is not a criticism of the strongest version of hypothesis testing.

With that said, even the strongest versions of hypothesis testing still violate the likelihood principle. Frequentists do not deny that fact, but instead seek to show that the likelihood principle (and conditionally principle, in particular) are not essential for sound inferences. As we saw in 1.5.1, data that were not collected but could have been collected—i.e., counterfactuals—are often quite relevant to inference. Further, to adhere to the likelihood principle—as a Bayesian statistician might—means, at least in some contexts, giving up a commitment to methods with acceptable error rates. For example, consider a financial portfolio manager, Gary Stearns, who advertises a very high success rate for picking profitable stocks.⁵⁸ In particular, suppose Stearns reports that all of the stocks that were included in his portfolio increased in value. Does information about Stearns’ subjective intentions and selection method matter to the inference as to whether $H$: Stearns is skillful in building stock portfolios? Yes! Imagine that Stearns randomly invested in many, many more stocks than were ultimately included in his portfolio, and subsequently only included the original investments for which the stock increased in value. That certainly is relevant information!

⁵⁸ This example is modified version of the “Pickrite” example in Mayo (2018).

Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.

4.6.1 Incoherence

Many statisticians and philosophers (e.g., (Clayton, 2021; Howson & Urbach, 2005; Lindley; 2000) have argued that frequentist inference suffers from incoherence. Dennis Lindley, in Lindley (2000), provides a strong version of this argument. Lindley’s argument starts with the view that statistics is the science of uncertainty, and as such, must be concerned with the reasoning through and combining of uncertainties. “The philosophical position adopted here is that statistics is essentially the study of uncertainty” (Lindley, 2000). As a science, the statistical approach to uncertainty must include quantification of uncertainty. “A scientific approach would mean the measurement of uncertainty; for, to follow Kelvin, it is only by associating numbers with any scientific concept that the concept can be properly understood” (Lindley, 2000). Quantification of uncertainty, according to Lindley, provides a means for combining uncertainties. For example, if we know the uncertainty in two events $A$ and $B$, quantification ought to allow us to derive the uncertainty that both $A$ and $B$ will happen. Lindley then describes several approaches to quantifying our pre-scientific intuitions about uncertainty—some of which we discussed in [sec:subjective] and [sec:logic]—all of which, he claims, necessarily lead to probability theory. That is, any attempt to reason about uncertainty ought to adhere to the laws of probability theory; otherwise, the reasoner is incoherent, and will “ultimately violate some of the basic assumptions [about uncertainty] that were intended to be self-evident and [would] cause embarrassment if violated” (Lindley, 2000). Finally, Lindley argues that the most common probabilities used in frequentist inference—p-values and confidence levels—do not follow the laws of probability theory. Therefore, frequentist inference is incoherent.

Lindley, D. V. (2000). The philosophy of statistics. Journal of the Royal Statistical Society Series D: The Statistician, 49(3), 293–337.

The view that statistics is primarily about uncertainty, and that uncertainty is formalized solely through probability theory should sound familiar; this view is a kind of probabilism, introduced in Chapter 2 and further developed in Chapter 3, and sections 1.1 and 1.2. Broadly, we saw that probabilism includes the view that probability theory provides a comprehensive framework for reasoning about uncertainty, and can, at least in theory, be used for any statistical problem. Probabilism achieves this goal by assigning probabilities to all uncertain quantities, including data and parameters. Bayesian statistical inference, discussed in Chapter 5, provides the foundational statistical toolkit for this philosophical view.

As discussed above, frequentists reject probabilism. They do so on philosophical grounds: probability only applies to (in theory) repeatable events. Parameters are not repeatable events, but instead, fixed, descriptive features of a theory. Instead, the frequentists and error statisticians adopt a falsificationist and probative view of inference. The goal of inference, in this view, is to put forward and stringently test hypotheses and theories. Successful inference is conducted through repeated, contextualized well-posed tests that probe the truth of a hypothesis; the more abstract notion of overarching theoretical coherence is not a primary concern. Mayo (2018) writes,

In the [error statistician’s] view, statistics is collecting, modeling, and using data to make inferences about aspects of what produced them. Inferences, being error prone, are qualified by reports of the error probing capacities of the inferring method...It splits problems oﬀ piecemeal; there’s no need for an exhaustive list of hypotheses that could explain data. Being able to directly pick up on gambits like cherry picking and optional stopping is essential for an account to be up to the epistemological task of determining if claims are poorly tested. While for Lindley this leads to incoherence ([and] violations of the likelihood principle), for us it is the key to assessing if your tool is capable of deceptions.

Thus, the view of the error statistician is: Lindley is right that frequentist inference is incoherent, in the sense that it’s p-values and confidence levels adhere to the laws of probability. But this kind of coherence is not the goal of statistical inference. Statistics is not about high-level, theoretical coherence, in part, because one can be coherent and very wrong. Remember, error statistics is committed to a notion of objectivity, to obtaining theories that accurately describe the world. A coherent view can easily get it wrong.⁵⁹

⁵⁹ A classic historical example of a coherent but wrong theory is the Ptolemaic model of planetary motion. This model posits a mathematically consistent description of the motion of planets, which posted that plants travel along “epicycles”. This model could predict the positions of planets. But it was ultimately wrong, in the sense that planets do not actually travel along epicycles.

Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society, Series A (Statistics in Society), 180(4), 967–1033. https://doi.org/10.1111/rssa.12276

Frequentists are not the only ones that deemphasize theoretical coherence. Even some Bayesian statisticians see coherence as a desirable virtue of the Bayesian paradigm that is of limited importance in practice, because of inevitable tradeoffs with other values. For example, Andrew Gelman sees coherence as just one among many values to be considered when doing applied statistical inference. In some contexts, values such as calibration and context dependence may trump coherence. “Badly calibrated Bayesians could do well to adjust their future priors if this is needed to improve calibration, even at the cost of violating coherence” (Gelman & Hennig, 2017). We will discuss prior distributions in detail in Chapter 5. If theoretical coherence is of relegated importance in practice—if alternative paradigms like Bayesian inference often fail to adhere to it in practice—it seems to be a weak criticism of frequentist inference.

4.7 Ethics and frequentist statistics

Frequentist inference—as developed by Neyman, Pearson, Mayo, and Spanos—is founded on the notion that the quality of an inference is tied to its performance. Performance is understood in terms of the error rates in the long run of experience. That is, methods and tools that have low error rates are to be preferred to methods that have higher (or understudied) error rates. Unfortunately, there are tradeoffs when considering error rates. As we saw in 1.2.1, for a given test, type I (“false positive”) and type II (“false negative”) error rates are inversely related: as one decreases, the other increases. These statistical errors are often tied to various real world errors that have social and ethical consequences.

One example of the ethical tradeoffs involved with error properties comes from prenatal blood tests that screen for rare but serious fetal abnormalities. Screening tests, and separately, diagnostic tests, play an important role in medicine; if we can screen for disorders early, further diagnostic testing may confirm the presence of a disorder. As an imperfect analogy, screening tests are likened to metal detectors at an airport that may screen for weapons, and diagnostic tests are likened to further searches that can more accurately confirm their presence or absence (Matloff, 2022). One might argue that, with early screening, more treatment options will be available for the mother and child.

Matloff, E. (2022). What the NYTimes got wrong on prenatal screening. https://www.forbes.com/sites/ellenmatloff/2022/01/06/what-the-nytimes-got-wrong-on-prenatal-screening/?sh=4e6e97e837a7

But screening tests are imperfect tools, prone to both false positive and false negative errors. These errors are tied to real consequences for women taking these tests. A positive screening test result alerts a pregnant woman and her healthcare provider to a possible fetal abnormality, some of which can be devastating. A false positive screening test result incorrectly alerts them to this outcome. Further, that incorrect result would not be known to them until some later, more invasive diagnostic test. A high rate of false positive tests would mean that many women are working through the worry and further invasive testing for no good reason. For example, after a normal ultrasound but a positive screening test, Yael Geller was told that “her fetus might be missing part of a chromosome, which could lead to serious ailments and mental illness” (Kliff & Bhatia, 2022). A follow-up diagnostic test “involving a long, painful needle to retrieve a small part of her placenta” overturned the positive screening test, and, as of January 2022, her then six month old baby boy showed no signs of the condition that he screened positive for.

Of course, false negatives can have negative consequences too. Believing that your child has no abnormalities, when, in fact, they do, may also cause hardship and trauma. Screening tests themselves were developed in part to minimize this kind of hardship: knowing early about abnormalities can empower women to make decisions that are best for them, their family, and future children.

So, what should researchers, statisticians, and other stakeholders do with this information, that frequentist inference methods rely on error rates, and that there are ethical consequences of errors? First, they should be honest, transparent, and modest about the limitations of their methods, and how those limitations might lead to various positive and negative outcomes. Unfortunately, according to Kliff & Bhatia (2022), the companies offering screening tests were less than honest and transparent about the error rates of their products. These companies advertised their tests as “reliable”, “highly accurate”, and that patients can have “total confidence” in the results. These advertisements do not cohere with the fact that, for many of the rare conditions being tested for, positive results are wrong more than they are right (upwards of $85\%$ of the time). For example, one of the industry leaders, Natera, reports that the false positive rate for their test of DiGeorge syndrome (22q11.2 microdeletion) is 0.05%. That is, given that DiGeorge syndrome is not present, the test says that it is present in 1 in every 2000 tests (Natera, Inc., 2022).⁶⁰

Kliff, S., & Bhatia, A. (2022). When they warn of rare disorders, these prenatal tests are usually wrong. https://www.nytimes.com/2022/01/01/upshot/pregnancy-birth-genetic-testing.html

⁶⁰ False positive rates are found via validation studies that run screening tests on pregnant women known to be free of the abnormality in question. As such, this probability is a sample statistic, used to estimate the population level relative frequency.

⁶¹ The PPV can be, and often is, interpreted as a Bayesian posterior quantity. After all, it is computed from Bayes’ theorem. Recall though, that frequentist inference is entirely consistent with many uses of Bayes’ theorem; the latter is simply a theorem—i.e., a necessarily true result following from the axioms of probability theory. Whether a quantity or method “Bayesian” and not frequentist, depends on what we are computing the posterior quantity of. In this case, we can understand the PPV as perfectly consistent with frequentist inference, in the same way that in the Jonah case, $P(A \, | \, \epsilon4)$ was perfectly consistent with frequentist inference. The PPV can be understood as the probability that a fetus like this one as the abnormality tested for, given a positive test, where “like this one” is cashed out in terms of resampling from a relevant population.

An enticing but wrong interpretation of this false positive rate would suggest that there is a $100\% -0.05\% = 99.95\%$ chance that the test is correctly identifying DiGeorge syndrome. However, as astute statisticians, we know that this is incorrect. One plausible way to assess risk would be to combine the false positive rate with information about how rare DiGeorge syndrome is in the population at large. Together, using Bayes’ theorem, that information can yield the positive predictive value (PPV): the probability that a fetus has the abnormality tested for, given a positive test.⁶¹ Thankfully, Natera documentation is honest about this fact. They write that

This rate of false positive tests means that a positive result for 22q11.2 deletion syndrome has a positive predictive value (PPV) of 53%. In other words, if a pregnancy receives a positive or “high risk” result, there is a 53% chance that the baby actually has the syndrome. Additional diagnostic testing is necessary to confirm if the condition is present (Natera, Inc., 2022).

Natera, Inc. (2022). 22q11.2 deletion syndrome: The most common microdeletion syndrome. https://www.natera.com/resource-library/panorama/22q11-2-deletion-syndrome-the-most-common-microdeletion-syndrome/.

Simply reporting the false (or true) positive rate, and leading the less-than-statistically-astute reader to believe in a very high chance of abnormality, would not be transparent, honest, or modest. Thus, in my view, it would be unethical.

A second laudable goal for researchers, statisticians, and other stakeholders related to error rates is to aim to construct tests and methods to be as best as they can be, i.e., the lowest error rates possible. This can be difficult. Recall that a reliable way to reduce both type I and type II errors is to increase the sample size. Consequently, lower error rates, often means more time, resources, money, etc.