Statisfaction

Another take on the Hyvärinen score for model comparison

Posted in Statistics by Stephane Shao on 22 September 2018

example_iidNormal_vague_15_by_7

Exact log-Bayes factors (log-BF) and H-factors (HF) of M1 against M2, computed for 100 independent samples (thin solid lines) of 1000 observations generated as i.i.d. N(1,1), under three increasingly vague priors for θ1.

In a former post, Pierre wrote about Bayesian model comparison and the limitations of Bayes factors in the presence of vague priors. Here we are, one year later, and I am happy to announce that our joint work with Jie Ding and Vahid Tarokh has been recently accepted for publication. As way of celebrating, allow me to give you another take on the matter.

Given some observations y_{1:T} = (y_1,...,y_T) generated as independent draws from \mathcal{N}(1,1), consider the seemingly simple task of choosing between the following two Normal models

M_1 :    Y_1,...,Y_T\,|\,\theta_1 \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\theta_1,1) ,   \theta_1 \sim \mathcal{N}(0,\sigma_0^2) ,

M_2 :   Y_1,...,Y_T\,|\,\theta_2 \,\stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,\theta_2) ,    \theta_2 \sim \text{Inv-}\chi^2(0.1,1) ,

where \sigma_0>0 is a known hyperparameter that controls how vague the prior on \theta_1 is. In such a setting, only model M_1 is well-specified (in the sense of containing the true data generating process), and we can reasonably demand from any sensible model selection criterion to be able to correctly select M_1 over M_2. The log-Bayes factor, defined as \log \text{BF} = \log p(y_{1:T}|M_1) - \log p(y_{1:T}|M_2), fits the bill asymptotically, as it is almost surely positive (thus choosing M_1) for all T large enough. However, for any fixed sample size T (no matter how large), making the prior on \theta_1 more vague by increasing \sigma_0 drives p(y_{1:T}|M_1) to 0 and \log \text{BF} to -\infty, which has the undesirable effect of tricking the log-Bayes factor into preferring the misspecified model M_2. This is illustrated with T=1000 in the above picture, where \log(\sigma_0) is increased from 0 to 150, then 350. The alarming point is not so much that the value of \log p(y_{1:T}|M_1) varies when the prior changes (as it should, mathematically speaking), but rather that these variations can be unbounded and arbitrary, even for changes of prior that have no impact on the inference stage. This phenomenon is concerning when using Bayes factors to perform model selection, since the choice of one prior is able to dictate the conclusion of the procedure even though the “fit” of the model is virtually unchanged beyond a certain vagueness: indeed, as \sigma_0 increases, the posterior p(\theta_1|y_{1:T},M_1) stabilizes to \mathcal{N}(1,1/T), leading the posterior predictive distributions p(y_{T+1}|y_{1:T},M_1) given by \mathcal{N}(1,1+1/T)\approx \mathcal{N}(1,1) to essentially coincide with the data generating distribution, for any large fixed T and all priors with \sigma_0 large enough; yet, as soon as \sigma_0 becomes too large, the Bayes factor mechanically (and wrongly) rules out M_1 against other misspecified models.

One possible solution to address this sensitivity to priors’ vagueness is to take a decision theoretic approach, by regarding the Bayes factor as a decision rule consisting in choosing the model M that minimizes the prequential score \sum_{t=1}^T S(y_t, p(dy_t|y_{1:t-1},M)), for the particular choice of score function S(y, p) = -\log p(y), known as the log-score. A score function has an associated divergence function D_S(p,q) = \mathbb{E}_{p}[S(Y, q)-S(Y, p)], which quantifies the expected loss incurred when trying to use a distribution q to predict a random outcome Y, whose true distribution is p. In particular, the divergence associated to the log-score is the well-known Kullback-Leibler divergence \text{KL}(p,q) = \mathbb{E}_p[\log p(Y) - \log q(Y)]. Finding an alternative to the Bayes factor thus becomes a matter of using a different divergence, and Dawid & Musio (2015) suggest using

D_\mathcal{H}(p,q) = \mathbb{E}_p[\,\|\nabla\log p(Y) - \nabla\log q(Y)\|^2\,] ,

sometimes referred to as the relative Fisher information divergence. Its key feature is that, contrary to the Kullback-Leibler divergence, its value remains unchanged when multiplying q(Y) by an arbitrary constant (which is effectively what happens when making a prior more vague, as its normalizing constant becomes arbitrary). The associated score is the Hyvärinen score, defined as

\mathcal{H}(y,p) = 2\Delta\log p(y) + \|\nabla\log p(Y)\|^2 .

One would then select the model M minimizing the H-score, defined as \sum_{t=1}^T \mathcal{H}(y_t, p(dy_t|y_{1:t-1},M)), which is now robust to any arbitrary vagueness in the specification of priors, as shown in the picture (where H-factors simply refer to differences of H-scores). In practice, H-scores need to be estimated. This can be achieved consistently using sequential Monte Carlo methods (e.g. IBIS when the likelihood is available, or SMC2 for state-space models with intractable likelihoods). We can also prove (under strong assumptions) that the H-score leads to consistent model selection. If interested, more can be learned here; the code for the figure is provided here.

Thanks for reading !

Advertisements

3 Responses

Subscribe to comments with RSS.

  1. Rémi Bardenet said, on 8 October 2018 at 10:56

    Interesting post and paper! Can you actually justify minimizing the H-score as a Bayes action in an expected utility problem? I see how Bayes factors would come out of this (e.g. as in Bernardo and Smith, Chapter 6.1, p391), but I’m not sure I see how to introduce another type of “score”. When you said “decision-theoretic justifuctation”, I thought you had such a point, but now I’m not sure that is what you meant.

    • Stephane Shao said, on 11 October 2018 at 21:58

      Thank you for your comment, Rémi ! By “decision-theoretic justification”, we were alluding to something along the lines of p.403-405 from Bernardo and Smith (1994, Section 6.1.6), i.e. comparing models in an M-open setting based on their predictive performances, albeit with the Hyvärinen score (instead of the log score) to assess predictions, and a prequential framework à la Dawid (1984, 1992) to approximate the expected loss (instead of more general cross-validation schemes). Although there are a few remarks on prequential analysis in Bernardo and Smith (1994, e.g. see p.485 and 488) suggesting that the prequential framework does not exactly fall within Bayesian decision theory, Dawid (1992) still uses decision theoretic ideas to define actions and prequential losses when regarding models as probability forecasting systems.

      • Rémi Bardenet said, on 12 October 2018 at 17:07

        Thanks for the documented answer!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: