A healthy dose of foundational crisis

Posted in General by Rémi Bardenet on 26 March 2018

I finally took the time to read about axiomatic foundations of Bayesian statistics. I like axioms, I like Bayesians stats, so this was definitely going to be a pleasant opportunity to read some books, comfortably seated on the sofa I just added to my office. Moreover, my team in Lille includes supporters of Dempster-Shafer belief functions, another framework for uncertainty modelling and decision-making, so being precise on my own axioms was the best way to discuss more constructively with my colleagues.

Long story short, I took the red pill: there is a significant gap between axiomatic constructions of the Bayesian paradigm and current Bayesian practice. None of this is new, but I had never been told. It’s not keeping me awake at night, but it’s bothering my office mates who cannot stop me blabbering about it over coffee. The good side is that smart people have thought about this in the past. Also, reading about this helped me understand some of the philosophical nuances between the thought processes of different Bayesians, say de Finettians vs. Jeffreysians. I will not attempt a survey in a blog post, nor do I feel to be knowledgeable enough for this, but I thought I could spare my office mates today and annoy Statisfaction’s readers for once.

Take Savage’s axioms, for instance. I’ve always heard that they were the current justification behind the saying “being Bayesian is being a coherent decision-maker”. To be precise, let \Theta be the set of states of the world, that is, everything useful to make your decision. To fix ideas, in a statistical experiment, your decision might be a “credible” interval on some real parameter, so \Theta should at least be the product of \mathbb{R} times whatever space your data live in. Now an action a is defined to be a map from \Theta to some set of outcomes \mathcal{Z}. For the interval problem, an action a_I corresponds to the choice of a particular interval I and the outcomes \mathcal{Z} should contain whatever you need to assess the performance of your action, say, the indicator of the parameter actually belonging to your interval I, and the length of I. Outcomes are judged by utility, that is, we consider functions u:\mathcal{Z}\rightarrow\mathbb{R}_+ that map outcomes to nonnegative rewards. In our example, this could be a weighted sum of the indicator and the interval length. The weights translate your preference for an interval that actually captures the value of the parameter of interest over a short interval. Now, the axioms give the equivalence between the two following bullets:

  • (being Bayesian) There is a unique pair (u,\pi), made of a utility function and a finitely additive probability measure \pi defined on all subsets of the set \Theta of states of the world, such that you choose your actions by maximizing an expected utility criterion:

\displaystyle a^\star \in \arg\max_a \mathbb{E}_\pi u(a(\theta)),

  • (being coherent) Ranking actions according to a preference relation that satisfies a few abstract properties that make intuitive sense for most applications, such as transitivity: if you prefer a_I to a_J and a_J to a_G, then you prefer a_I to a_G. Add to this a few structural axioms that impose constraints on \Theta on \mathcal{Z}.

Furthermore, there is a natural notion of conditional preference among actions that follows from Savage’s axioms. Taken together, these axioms give an operational definition of our “beliefs” that seems to match Bayesian practice. In particular, 1) our beliefs take the form of a probability measure –which depends on our utility–, 2) we should update these beliefs by conditioning probabilities, and 3) make decisions using expected utility with respect to our belief. This is undeniably beautiful. Not only does Savage avoid shaky arguments or interpretations by using your propensity to act to define your beliefs, but he also avoids using “extraneous probabilities”. By the latter I mean any axiom that artificially brings mathematical probability structures into the picture, such as “there exists an ideal Bernoulli coin”.

But the devil is in the details. For instance, some of the less intuitive of Savage’s axioms require the set of states of the world to be uncountable and the utility bounded. Also, the measure \pi is actually only required to be finitely additive, and it has to be defined on all subsets of the set of states of the world. Now-traditional notions like Lebesgue integration, \sigma-additivity, or \sigma-algebras do not appear. In particular, if you want to put a prior on the mean of a Gaussian that lives in \Theta=\mathbb{R}, Savage says your prior should weight all subsets of the real line, so forget about using any probability measure that has a density with respect to the Lebesgue measure! Or, to paraphrase de Finetti, \sigma-additive probability does not exist. Man, before reading about axioms I thought “Haha, let’s see whether someone has actually worked out the technical details to justify Bayesian nonparametrics with expected utility, this must be technically tricky”; now I don’t even know how to fit the mean of a Gaussian anymore. Thank you, Morpheus-Savage.

There are axiomatic ways around these shortcomings. From what I’ve read they all either include extraneous probabilities or rather artificial mathematical constructions. Extraneous probabilities lead to philosophically beautiful axioms and interpretations, see e.g. Chapter 2 of Bernardo and Smith (2000), and they can get you finite and countably finite sets of states of the world, for instance, whereas Savage’s axioms do not. Stronger versions also give you \sigma-additivity, see below. Loosely speaking, I understand extraneous probabilities as measuring uncertainty with respect to an ideal coin, similarly to measuring heat in degrees Celsius by comparing a physical system to freezing or boiling water. However, I find extraneous probability axioms harder to swallow than (most of) Savage’s axioms, and they involve accepting a more general notion of probability than personal propensity to act.

If you want to bypass extraneous probability and still recover \sigma-additivity, you could follow Villegas (1964), and try to complete the state space \Theta so that well-behaved measures \pi extend uniquely to \sigma-additive measures on a \sigma-algebra on this bigger set of states \hat\Theta. Defining the extended \hat\Theta involves sophisticated functional analysis, and requires to add potentially hard-to-intepret states of the world, so losing some of the interpretability of Savage’s construction. Authors of reference books seem reluctant to go in that direction: De Groot (1970), for instance, solves the issue by using a strong extraneous probability axiom that allows working in the original set \Theta with \sigma-additive beliefs. Bernardo & Smith use extraneous probabilities, but keep their measures finitely additive until the end of Chapter 2. Then they admit departing from axioms for practical purposes and define “generalized beliefs” in Chapter 3, defined on a \sigma-algebra of the original \Theta. Others seem to more readily accept the gap between axioms and practice, and look for a more pragmatic justification of the combined use of expected utility and countably additive probabilities. For instance, Robert (2007) introduces posterior expected utility, and then argues that it has desirable properties among decision-making frameworks, such as respecting the likelihood principle. This is unlike Savage’s approach, for whom the (or rather, a finitely additive version of the) likelihood principle is a consequence of the axioms. I think this is an interesting subtlelty.

To conclude, I just wanted to share my excitement for having read some fascinating works on decision-theoretic axioms for Bayesian statistics. There still is some unresolved tension between having both an applicable and axiomatized Bayesian theory of belief. I would love this post to generate discussions, and help me understand the different thought processes behind each Bayesian being Bayesian (and each non-Bayesian being non-Bayesian). For instance, I had not realised how conceptually different the points of view in the reference books of Robert and Bernardo & Smith were. This definitely helped me understand (Xi’an) Robert’s short three answers to this post.

If this has raised your interest, I will mention here a few complementary sources that I have found useful, ping me if you want more. Chapters 2 and 3 of Bernardo and Smith (2000) contain a detailed description of their set of axioms with extraneous probability, and they give great lists of pointers on thorny issues at the end of each chapter. A lighter read is Parmigiani and Inoue (2009), which I think is a great starting point, with emphasis on the main ideas of de Finetti, Ramsey, Savage, and Anscombe and Aumann, how they apply, and how they relate to each other, rather than the technical details. Technical details and exhaustive reviews of sets of axioms for subjective probability can be found in their references to Fishburn’s work, which I have found to be beautifully clear, rigorous and complete, although like many papers involving low-level set manipulations, the proofs sometimes feel like they are written for robots. But after all, a normative theory of rationality is maybe only meant for robots.



9 Responses

Subscribe to comments with RSS.

  1. Pierre Jacob said, on 27 March 2018 at 02:51

    Just to be clear, when you refer to Fishburn’s work, you mean Lawrence Fishburn?

    • Rémi Bardenet said, on 27 March 2018 at 11:06

      Another hint we’re all living in the np.matrix!

  2. Aki Vehtari said, on 27 March 2018 at 14:56

    You might enjoy this one, too,
    Terenin & Draper, Cox’s Theorem and the Jaynesian Interpretation of Probability

    • Rémi Bardenet said, on 27 March 2018 at 15:38

      Thanks, this looks very relevant, and I haven’t read much on the Jaynesian approach yet. On the matter of \sigma-additivity, a quick skim though the paper leads me to think that Terenin and Draper consider by default a \sigma-algebra (bottom of p6). I haven’t come to terms with that yet, although it is probably the only thing we can do for now. The Stone extension theorem they mention can actually be used to extend a Boolean algebra to a \sigma-algebra, as done by Villegas, but this has to be in an extended state space (the so-called Stone space of the original state space), and it looks a rather complicated object.

  3. xi'an said, on 30 March 2018 at 20:47

    Could anyone provide the background to understand the joke behind the picture? This guy does not look like any Bayesian I know…

  4. xi'an said, on 30 March 2018 at 21:09

    Very entertaining and not in the least annoying!!! As you noticed in these short outbursts on X validated, I never was particularly excited about the finite versus σ-finite debate. Or even about the axiomatic unicity of the pair (u,π), as I find it completely disconnected from an even highly formalised practice… As a possible symptom of neuronal degeneracy, I am getting even less excited since I realised the extent of the dependence of a Bayesian (decision) analysis on the chosen dominating measure, from the construction of conjugate priors to the definition of MAP estimators and HPD regions and MaxEnt priors, to the Savage-Dickey representation of Bayes factors (ah, Savage again!), &tc. I may have thus become a very relative Bayesian, accepting the dependence on that choice, rather than trying to promote one special version. (Incidentally, I wonder if all sets and notions introduced in the third paragraph are really what you wanted them to be.)

    • Rémi Bardenet said, on 6 April 2018 at 17:54

      Thanks for the reply. It’s true that dependence on the dominating measure and/or particular versions of the densities involved is also disturbing, as discussed in your post and its comments. I wonder whether there could be a more general yet still practical mathematical framework than probability spaces that would avoid specifying a dominating measure or a \sigma-algebra. A bit like Le Cam using functional analysis to get rid of sample spaces and \sigma-algebras in classical statistics, but placing implementability high on the list of requirements (I don’t know much about Le Cam theory, but this note of David Pollard makes me want to learn more).

      About the third paragraph, which set or notion would you change? I have tried to follow Chapter 5 of Parmigiani and Inoue (2009), although I’ve made the set \Theta to be a product of the set where the parameter lives and the space where the data live. This way, I can condition on data living in a subset using the sure-thing principle (Section 5.2.2), and obtain posterior expected utility. I think that Parmigiani and Inoue do this implicitly in Section 7.2 when they say that posterior expected utility is directly backed up by Savage’s axioms, but by then they justify the statement using a Lebesgue formalism, which means they have tacitly switched to the “generalized beliefs” of Bernardo and Smith. So I tried to introduce data earlier.

  5. John Klein said, on 31 March 2018 at 12:54

    Nice and enlightening post ! So what are we left with to convince someone to be Bayesian ? As you pointed out to me, De Finetti’s representation theorem for exchangeable random variables is not a decisive argument either (although dismissing it is debatable too). My feeling is that the “proof of the pudding” is perhaps quite enough. After all, a theory or practice remains valid unless proven otherwise or subsumed by a finer model of the world.

    • Rémi Bardenet said, on 4 April 2018 at 00:06

      Thanks. Also thanks because our discussions on Bayes and DS are at the origin of this post. I think there are still many reasons to be Bayesian. I’m just saying that _in its current state_, the axiomatic justification falls short of fully justifying what is done in practice. But there are many reasonable thought processes that still lead you to Bayes. For instance, you can follow Bernardo and Smith and argue that the gap between Savage’s axioms and subjective expected utility with Lebesgue integrals is not that big, or you can follow Xi’an’s book in reversing the argument: first define posterior expected utility, then check that it has many desirable properties (good frequentist properties, satisfies the likelihood principle, unifying framework for all statistical questions, etc.). Plus, as you say, the fact that it can be implemented and has led to successful applications.

      Overall, I see Bayesian statistics as the best trade-off so far in terms of conceptual simplicity / coverage of statistical questions / axiomatic justification / good theoretical properties / implementability. However, I agree that depending on how you weight these five items, current Bayesian practice may not be the optimal answer.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: