Statisfaction

Pseudo-Bayes: a quick and awfully incomplete review

Posted in General, Statistics by nicolaschopin on 18 September 2013
Image

You called me a pseudo-Bayesian. Prepare to die.

A recently arxived paper by Pier Bissiri, Chris Holmes and Steve Walker piqued my curiosity about “pseudo-Bayesian” approaches, that is, statistical approaches based on a pseudo-posterior:

\pi(\theta) \propto p(\theta) \hat{L}(\theta)

where \hat{L}(\theta) is some pseudo-likelihood. Pier, Chris and Steve use in particular

\hat{L}(\theta) = \exp\{ - \lambda*R_n(\theta,x) \}

where R_n(\theta,x) is some empirical risk function. A good example is classification; then R_n(\theta,x) could be the proportion of properly classified points:

R_n(\theta,x) = \sum_{i=1}^n \mathbf{I}(y_i\times f_{\theta}(x_i)\geq 0)

where f_{\theta} is some score function parametrised by \theta, and y_i\in\{-1,1\}. (Side note: I find the -1/1 ML convention for the y_i more convenient than the 0/1 stats convention.)

It turns out that this particular kind of pseudo-posterior has already been encountered before, but with different motivations:

  •  Chernozhukov and Hong (JoE, 2003)  used it to define new Frequentist estimators based on moment estimation ideas (i.e. take R_n above to be some empirical moment constraint). Focus is on establishing Frequentist properties of say the expectation of the pseudo-posterior. (It seems to me that few people have heard about this this paper in Stats).
  • the PAC-Bayesian approach which originates from Machine Learning  also relies on this kind of pseudo-posterior. To be more precise, PAC-Bayes usually starts by minimising the upper bound of an oracle inequality within a class of randomised estimators. Then, as a result, you obtain as a possible solution, say, a single draw for the pseudo-posterior defined above.  A good introduction is this book by Olivier Catoni.
  • Finally, Pier, Chris and Steve’s approach is by far the most Bayesian of these three pseudo-Bayesian approaches, in the sense that they try to maintain an interpretation of the pseudo-posterior as a representation on the uncertainty on \theta. Crudely speaking,  they don’t look only at the expectation, like the two approaches aboves, but also at the spread of the pseudo-posterior.

Let me mention briefly that quite a few papers have considered using other types of pseudo-likelihood in a pseudo-posterior, such as empirical likelihood, composite likelihood, and so on, but I will shamefully skip them for now.

To which extent this growing interest in “Pseudo-Bayes” should have an impact on Bayesian computation? For one thing, more problems to throw at our favourite algorithms should be good news. In particular, Chernozhukov and Hong mention the possibility to use MCMC as a big advantage for their approach, because typically the L_n function they consider could be difficult to minimise directly by optimisation algorithms. PAC-Bayesians also seem to recommend MCMC, but I could not find so many PAC-Bayesian papers that go beyond the theory and actually implement it; an exception is this.

On the other hand, these pseudo posteriors might be quite nasty. First, given the way they are defined, they should not have the kind of structure that makes it possible to use Gibbs sampling. Second, many interesting choices for R_n seem to be   irregular or multimodal. Again, in the classification example, the 0-1 loss function is typically not continuous. Hopefully the coming years will witness some interesting research on which computational approaches are more fit for pseudo-Bayes computation, but readers will not be surprised if I put my Euros  on (some form of) SMC!

About these ads
Tagged with:

3 Responses

Subscribe to comments with RSS.

  1. Pierre Alquier said, on 18 September 2013 at 19:12

    Thanks for this post Nicolas! Very informative.

    According to Sébastien Gerchinovitz’ PhD thesis, the idea of the pseudo-posterior $\pi(\theta) \exp(-\lambda R_n(\theta))$ can already be found in “Aggregating strategies” (Vovk, COLT 1990) and “The Weighted Majority algorithm” (Littlestone and Warmuth, Information and Computation 1994) for sequential prediction. More details in Sébastien’s thesis (everybody should read it)!!

  2. Simon Barthelme said, on 19 September 2013 at 09:25

    There are two other places I can think of where the idea crops up. One is the “let’s stick an exponential around an energy function and pretend it’s a likelihood” school of computer vision. A lot of Gibbs models are justified like that. Another is Bayesian quantile regression, where you also take a loss function with an asymptotic connection to sample quantiles and treat it as a likelihood. It works fine but I didn’t think the resulting posteriors had any kind of Bayesian meaning. The Holmes & Walker paper shows it does, that’s pretty cool I think.

  3. Dan Simpson said, on 27 September 2013 at 13:10

    There’s also some stuff on “quasi-posteriors” (because names should be randomly assigned to classes of algorithms) that come from using approximate likelihoods. This particularly pops up when people talk about Bayesian composite likelihood. Most of the work there tries to asymptotically correct the variance estimates in the general spirit of “sandwich estimators”.

    There’s also the general “meh – we don’t care” school of thought that, for example, replaces the log-likelihood with an unbiased estimator and doesn’t care about asymptopia.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 53 other followers

%d bloggers like this: