# Another take on the Hyvärinen score for model comparison

Exact log-Bayes factors (log-BF) and H-factors (HF) of M1 against M2, computed for 100 independent samples (thin solid lines) of 1000 observations generated as i.i.d. N(1,1), under three increasingly vague priors for θ1.

In a former post, Pierre wrote about Bayesian model comparison and the limitations of Bayes factors in the presence of vague priors. Here we are, one year later, and I am happy to announce that our joint work with Jie Ding and Vahid Tarokh has been recently accepted for publication. As way of celebrating, allow me to give you another take on the matter.

Given some observations $y_{1:T} = (y_1,...,y_T)$ generated as independent draws from $\mathcal{N}(1,1)$, consider the seemingly simple task of choosing between the following two Normal models

$M_1$ :    $Y_1,...,Y_T\,|\,\theta_1 \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\theta_1,1)$ ,   $\theta_1 \sim \mathcal{N}(0,\sigma_0^2)$ ,

$M_2$ :   $Y_1,...,Y_T\,|\,\theta_2 \,\stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,\theta_2)$ ,    $\theta_2 \sim \text{Inv-}\chi^2(0.1,1)$ ,

where $\sigma_0>0$ is a known hyperparameter that controls how vague the prior on $\theta_1$ is. In such a setting, only model $M_1$ is well-specified (in the sense of containing the true data generating process), and we can reasonably demand from any sensible model selection criterion to be able to correctly select $M_1$ over $M_2$. The log-Bayes factor, defined as $\log \text{BF} = \log p(y_{1:T}|M_1) - \log p(y_{1:T}|M_2)$, fits the bill asymptotically, as it is almost surely positive (thus choosing $M_1$) for all $T$ large enough. However, for any fixed sample size $T$ (no matter how large), making the prior on $\theta_1$ more vague by increasing $\sigma_0$ drives $p(y_{1:T}|M_1)$ to 0 and $\log \text{BF}$ to $-\infty$, which has the undesirable effect of tricking the log-Bayes factor into preferring the misspecified model $M_2$. This is illustrated with $T=1000$ in the above picture, where $\log(\sigma_0)$ is increased from 0 to 150, then 350. The alarming point is not so much that the value of $\log p(y_{1:T}|M_1)$ varies when the prior changes (as it should, mathematically speaking), but rather that these variations can be unbounded and arbitrary, even for changes of prior that have no impact on the inference stage. This phenomenon is concerning when using Bayes factors to perform model selection, since the choice of one prior is able to dictate the conclusion of the procedure even though the “fit” of the model is virtually unchanged beyond a certain vagueness: indeed, as $\sigma_0$ increases, the posterior $p(\theta_1|y_{1:T},M_1)$ stabilizes to $\mathcal{N}(1,1/T)$, leading the posterior predictive distributions $p(y_{T+1}|y_{1:T},M_1)$ given by $\mathcal{N}(1,1+1/T)\approx \mathcal{N}(1,1)$ to essentially coincide with the data generating distribution, for any large fixed $T$ and all priors with $\sigma_0$ large enough; yet, as soon as $\sigma_0$ becomes too large, the Bayes factor mechanically (and wrongly) rules out $M_1$ against other misspecified models.

One possible solution to address this sensitivity to priors’ vagueness is to take a decision theoretic approach, by regarding the Bayes factor as a decision rule consisting in choosing the model $M$ that minimizes the prequential score $\sum_{t=1}^T S(y_t, p(dy_t|y_{1:t-1},M))$, for the particular choice of score function $S(y, p) = -\log p(y)$, known as the log-score. A score function has an associated divergence function $D_S(p,q) = \mathbb{E}_{p}[S(Y, q)-S(Y, p)]$, which quantifies the expected loss incurred when trying to use a distribution $q$ to predict a random outcome $Y$, whose true distribution is $p$. In particular, the divergence associated to the log-score is the well-known Kullback-Leibler divergence $\text{KL}(p,q) = \mathbb{E}_p[\log p(Y) - \log q(Y)]$. Finding an alternative to the Bayes factor thus becomes a matter of using a different divergence, and Dawid & Musio (2015) suggest using

$D_\mathcal{H}(p,q) = \mathbb{E}_p[\,\|\nabla\log p(Y) - \nabla\log q(Y)\|^2\,]$ ,

sometimes referred to as the relative Fisher information divergence. Its key feature is that, contrary to the Kullback-Leibler divergence, its value remains unchanged when multiplying $q(Y)$ by an arbitrary constant (which is effectively what happens when making a prior more vague, as its normalizing constant becomes arbitrary). The associated score is the Hyvärinen score, defined as

$\mathcal{H}(y,p) = 2\Delta\log p(y) + \|\nabla\log p(Y)\|^2$ .

One would then select the model $M$ minimizing the H-score, defined as $\sum_{t=1}^T \mathcal{H}(y_t, p(dy_t|y_{1:t-1},M))$, which is now robust to any arbitrary vagueness in the specification of priors, as shown in the picture (where H-factors simply refer to differences of H-scores). In practice, H-scores need to be estimated. This can be achieved consistently using sequential Monte Carlo methods (e.g. IBIS when the likelihood is available, or SMC2 for state-space models with intractable likelihoods). We can also prove (under strong assumptions) that the H-score leads to consistent model selection. If interested, more can be learned here; the code for the figure is provided here.

## 3 thoughts on “Another take on the Hyvärinen score for model comparison”

1. Rémi Bardenet says:

Interesting post and paper! Can you actually justify minimizing the H-score as a Bayes action in an expected utility problem? I see how Bayes factors would come out of this (e.g. as in Bernardo and Smith, Chapter 6.1, p391), but I’m not sure I see how to introduce another type of “score”. When you said “decision-theoretic justifuctation”, I thought you had such a point, but now I’m not sure that is what you meant.

1. Stephane Shao says:

Thank you for your comment, Rémi ! By “decision-theoretic justification”, we were alluding to something along the lines of p.403-405 from Bernardo and Smith (1994, Section 6.1.6), i.e. comparing models in an M-open setting based on their predictive performances, albeit with the Hyvärinen score (instead of the log score) to assess predictions, and a prequential framework à la Dawid (1984, 1992) to approximate the expected loss (instead of more general cross-validation schemes). Although there are a few remarks on prequential analysis in Bernardo and Smith (1994, e.g. see p.485 and 488) suggesting that the prequential framework does not exactly fall within Bayesian decision theory, Dawid (1992) still uses decision theoretic ideas to define actions and prequential losses when regarding models as probability forecasting systems.

1. Rémi Bardenet says: