# Statisfaction

## Bayesian model comparison with vague or improper priors

Posted in Statistics by Pierre Jacob on 6 November 2017

Synthetic data from a Lévy-driven stochastic volatility model (top), log-Bayes factor between two such models (middle) and “Hyvärinen factor” (proposed approach, bottom). Each line represents a different Monte Carlo estimate, obtained sequentially over time.

Hi,

With Stephane Shao, Jie Ding and Vahid Tarokh we have just arXived a tech report entitled “Bayesian model comparison with the Hyvärinen score: computation and consistency“. Here I’ll explain the context, that is, scoring rules and Hyvärinen scores (originating in Hyvärinen’s score matching approach to inference), and then what we actually do in the paper.

Let’s start with scoring rules. These are loss functions for the task of predicting a variable $Y$ with a probability distribution $p(dy)$. If $p(dy)$ is used to predict $Y$ and $y$ occurs, then the score is a real value, e.g. denoted by $S(y, p(dy))$; the smaller score the better, and overall we want to find $p(dy)$ that minimizes $\mathbb{E}[S(Y,p(dy))]$, where the expectation is with respect to the distribution of $Y$. A scoring rule is proper if the above expectation is minimized when $p(dy)$ is precisely the distribution of $Y$. An example of proper scoring rule is $S(y,p(dy)) = - \log p(y)$, the logarithmic scoring rule.

We can interpret Bayes factors in terms of logarithmic scoring rules (as in Chapter 6 of Bernardo & Smith). Indeed, the logarithm of the Bayes factor between model $M_1$ and $M_2$ is the difference of log-evidences:

$-\log p(Y | M_2) - (-\log p(Y | M_1))$,

In this sense, the Bayes factor compares the predictive performance of models. Decomposing these marginal likelihoods into conditionals and assuming that $Y = (Y_1,\ldots, Y_T)$, we have for model $M$:

$\log p(Y|M) = \sum_{t=1}^T\log p(Y_t| Y_1, \ldots, Y_{t-1}, M)$,

(with a convention for $t = 1$), which can be interpreted as a measure of performance of out-of-sample predictive distributions $p(dy_t| Y_1, \ldots, Y_{t-1}, M)$, summed up over time. Importantly, this interpretation of Bayes factors holds also when models are misspecified.

So what’s not to like? Prior specification affects the evidence, which is completely fine per se. What’s concerning is the extent of the impact of the prior. Seemingly innocent changes of prior distributions can have drastic effects on the evidence and thus on Bayes factors. This is the case in the simplest example of a Normal location model: $Y_t \sim \mathcal{N}(\mu, \sigma^2)$, with fixed $\sigma^2$ and prior $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$. Then the log-evidence behaves like $-\log( \sigma_0^2)$ when $\sigma_0^2 \to \infty$. This means that the log-evidence can take crazy values, and is not even well-defined in that limit. However, that limit corresponds to a flat prior which is not crazy in this model, at least in terms of parameter inference. This is a reason for people to avoid vague priors when relying on Bayes factors for model comparison.

Conversely, this is a reason to seek alternatives to the evidence as a model comparison criterion, see for instance intrinsic Bayes factors, fractional Bayes factors, or the mixture approach of Kamary et al. Our work follows Dawid & Musio (2015) who propose to change the scoring rule. Instead of the logarithmic scoring rule, they advocate the Hyvärinen scoring rule, which leads to replacing $- \log p(Y_1,\ldots,Y_T|M)$ by

$\sum_{t=1}^T \left\{ 2 \frac{d^2}{dy_t^2} \log p(Y_t|Y_1,\ldots,Y_{t-1},M) + \left(\frac{d}{dy_t} \log p(Y_t|Y_1,\ldots,Y_{t-1},M)\right)^2 \right\}$.

This barbaric expression involves derivatives of the log-density of predictive distributions, instead of log-densities. It can then be checked in the Normal location model that the score is well-defined even in the limit $\sigma_0^2 \to \infty$. Thankfully it can also be checked that the Hyvärinen score is proper. Note that variants of the score have been proposed for discrete observations, but there are cases where the Hyvärinen score is inapplicable, namely when predictive densities are not smooth enough, e.g. Laplace distributions.

In the paper, we show how sequential Monte Carlo samplers can approximate this scoring rule, for a wide range of models including nonlinear state space models. We also show the consistency of this scoring rule for model selection, as the number of observations goes to infinity; our proof relies on strong regularity assumptions, but the numerical experiments indicate that the results hold under weaker conditions. Finally we investigate an example of population growth model applied to kangaroos, and a Lévy-driven stochastic volatility model which we use to illustrate the consistency result. Both of these cases feature intractable likelihoods approximated by particle filters within an SMC^2 algorithm.

The code producing the figures of the paper is available on Github: https://github.com/pierrejacob/bayeshscore