3 A source of the divide: interpretations of probability theory
It is unanimously agreed that statistics depends somehow on probability. But as to what probability is and how it is connected with statistics, there has seldom been such a complete disagreement and breakdown of communication since the Tower of Babel.
— Leonard J. Savage, The Foundations of Statistics
In the last chapter, we analyzed two attempts at solving the problem of induction. The first was a hypothesis test. Such hypothesis tests—along with confidence intervals, maximum likelihood estimation, and many other statistical modeling methods—are often labeled as “frequentist methods”, for reasons that will become clear later in this book. The second attempt at solving the problem of induction that we analyzed was a “Bayesian method”, named for its use of Bayes’ theorem. Frequentist and Bayesian methods are not just proposed solutions to the esoteric problem of induction; almost all inference and modeling methods used in contemporary statistics and data science fall into one of these paradigms.1
1 One might object that there are many data science tools that do not fall within the frequentist or Bayesian paradigm; for example, gradient descent, a common tool in data science, does not have a statistical interpretation within one of these paradigms. However, gradient descent (along with other optimization tools) are not, in and of themselves, inference procedures. Instead, they are optimization tools used to find a single value (point estimator) to estimate an unknown parameter (or vector of parameters). An inference procedure would go further, not simply estimating an unknown parameter but providing a means for evaluating the estimator and providing uncertainty quantification. Of course, when gradient descent (or ascent) is used for estimating minimizers or (maximizers) of statistical functions like likelihoods or posterior distributions, it becomes infused with a statistical interpretation. Estimator evaluation and uncertainty quantification are almost always done in either a frequentist or Bayesian context. Admittedly, other inference paradigms exist, including the likelihoodist paradigm and fiducial inference; but the tools and interpretations found in these paradigms are rarely used.
2 Different values of the parameter \(\theta\) correspond to different research hypotheses.
Frequentist and Bayesian methods have some similarities. Common to both is the assumption that observed data, \(\mathbf{x} = (x_1,...,x_n)\), are realizations of a probabilistic process \(\mathbf{X} = (X_1,...,X_n)\). Recall that \(f(\mathbf{x}\, ; \, \boldsymbol\theta)\) describes the probability distribution of \(\mathbf{X}\), for a fixed value of a parameter, \(\boldsymbol\theta \in \boldsymbol\Theta\).2 As mentioned in the previous chapter, this data model assumption is often justified on the basis of repeated sampling, i.e., if we observed the same phenomena (e.g., experiment, physical process) again under sufficiently similar conditions, we would have observed different values \(\mathbf{x} = (x_1,...,x_n)\).
From there, frequentist and Bayesian methods diverge in their use of probability theory. Frequentist methods only use probability to describe the data generating mechanism, through the data model \(f(\mathbf{x}\, ; \, \boldsymbol\theta)\). Any probabilistic statements used in a frequentist method are downstream of that data model. In frequentist hypothesis testing, we use \(f(\mathbf{x}\, ; \, \boldsymbol\theta)\) to set the distribution of a summary of the data, called a test statistic. That distribution is then used to make a decision about the hypotheses in question. Importantly, probabilities about data are not used to assign probabilities to the hypotheses in question.
The Bayesian paradigm takes a similar approach to modeling data as realizations of a probabilistic process; but Bayesians also use probability in another, very different way: to quantify uncertainty in hypotheses about statistical parameters. This quantification of uncertainty in hypotheses happens through the prior and posterior distributions. Bayesian methods assign probabilities directly to the hypotheses in question. In our example of Bayesian inference in Chapter 2, the prior distribution quantified our degree of belief in a parameter before observing data. Similarly, the posterior distribution quantified our degree of belief in a parameter after observing data.
What justifies this Bayesian use of probability? Can probability really model an individual’s subjective belief? Isn’t probability theory a mathematical theory, and thus, in some sense, objective? Many popular textbooks define probabilities, not in terms of subjective degrees of belief, but in terms of relative frequencies (for example, see Ugarte et al. (2016)). Is that the best way to interpret probabilities?
The interpretation of probability sits at the core of debates about frequentist and Bayesian inference. But how we interpret probabilities matters in other ways, too. Statisticians and non-statisticians alike use probability statements all the time. Consider the following claims:
The probability that a fair coin will land on heads is \(0.5\).
The likelihood of being dealt a flush in 5-card poker is approximately \(0.2\%\).
The odds are \(1.1:1\) that the Colorado Avalanche will beat the New York Rangers.
For a particular woman about to undergo IVF treatment, the probability that she will give birth to a healthy child is \(1/3\).
There’s a \(50\%\) chance of rain in Boulder, CO tomorrow. How are we to interpret the words ‘probability’, ‘likelihood’, ‘odds’, and ‘chance’ in the claims above? What does it even mean to interpret probability? In this chapter, we will consider some answers to these questions; our goal will be to gain a deeper understanding of probability theory, both in the use of statistical inference, and our use of probability in our everyday lives.
Get the full book: Buy Patterns from Static