A recently arxived paper by Pier Bissiri, Chris Holmes and Steve Walker piqued my curiosity about “pseudo-Bayesian” approaches, that is, statistical approaches based on a pseudo-posterior:
where is some pseudo-likelihood. Pier, Chris and Steve use in particular
where is some empirical risk function. A good example is classification; then could be the proportion of properly classified points:
where is some score function parametrised by , and . (Side note: I find the ML convention for the more convenient than the stats convention.)
It turns out that this particular kind of pseudo-posterior has already been encountered before, but with different motivations:
- Chernozhukov and Hong (JoE, 2003) used it to define new Frequentist estimators based on moment estimation ideas (i.e. take above to be some empirical moment constraint). Focus is on establishing Frequentist properties of say the expectation of the pseudo-posterior. (It seems to me that few people have heard about this this paper in Stats).
- the PAC-Bayesian approach which originates from Machine Learning also relies on this kind of pseudo-posterior. To be more precise, PAC-Bayes usually starts by minimising the upper bound of an oracle inequality within a class of randomised estimators. Then, as a result, you obtain as a possible solution, say, a single draw for the pseudo-posterior defined above. A good introduction is this book by Olivier Catoni.
- Finally, Pier, Chris and Steve’s approach is by far the most Bayesian of these three pseudo-Bayesian approaches, in the sense that they try to maintain an interpretation of the pseudo-posterior as a representation on the uncertainty on . Crudely speaking, they don’t look only at the expectation, like the two approaches aboves, but also at the spread of the pseudo-posterior.
Let me mention briefly that quite a few papers have considered using other types of pseudo-likelihood in a pseudo-posterior, such as empirical likelihood, composite likelihood, and so on, but I will shamefully skip them for now.
To which extent this growing interest in “Pseudo-Bayes” should have an impact on Bayesian computation? For one thing, more problems to throw at our favourite algorithms should be good news. In particular, Chernozhukov and Hong mention the possibility to use MCMC as a big advantage for their approach, because typically the function they consider could be difficult to minimise directly by optimisation algorithms. PAC-Bayesians also seem to recommend MCMC, but I could not find so many PAC-Bayesian papers that go beyond the theory and actually implement it; an exception is this.
On the other hand, these pseudo posteriors might be quite nasty. First, given the way they are defined, they should not have the kind of structure that makes it possible to use Gibbs sampling. Second, many interesting choices for seem to be irregular or multimodal. Again, in the classification example, the 0-1 loss function is typically not continuous. Hopefully the coming years will witness some interesting research on which computational approaches are more fit for pseudo-Bayes computation, but readers will not be surprised if I put my Euros on (some form of) SMC!
Hi Statisfied readers,
I am Nicolas Chopin, a Professor of Statistics at the ENSAE, and my colleagues and good friends that manage Statisfaction kindly agreed that I would join their blog. I work mostly on “Bayesian Computation”, i.e. Monte Carlo and non-Monte Carlo methods to compute Bayesian quantities; a strong focus of my research is on Sequential Monte Carlo (aka particle filters).
I don’t plan to blog very regularly, and only on stuff related to my research, at least in some way. Well, that’s the idea for now. Stay tuned!
To illustrate generally complex probability density functions on continuous spaces, researchers always use the same examples, for instance mixtures of Gaussian distributions or a banana shaped distribution defined on with density function:
If we draw a sample from this distribution using MCMC we obtain a [scatter]plot like this one:
Clearly it doesn’t really look like a banana, even if you use yellow to colour the dots like here. Actually it looks more like a boomerang, if anything. I was worried about this for a while, until I came up with a more realistic banana shaped distribution:
See how the shape is well defined compared to the first figure? And there’s even the little tail, that proves so convenient when we want to peel off the fruit. More generally we might want to create target density functions based on general shapes. For this you can now try RShapeTarget, which you can install directly from R using devtools:
library(devtools) install_github(repo="RShapeTarget", username="pierrejacob")
The package parses SVG files representing shapes, and creates target densities from them. More precisely, a SVG files contains “paths”, which are sequence of points (for instance the above banana is a single closed path). The associated log density at any point is defined by where is the closest path of the shape from and is the distance between the point and the path. The parameter specifies the rate at which the density decays when the point goes away from the shape. With this you can define the maple leaf distribution, as a tribute to JSM 2013:
In the package you can get a distribution from a SVG file using the following code:
library(RShapeTarget) # create target from file my_shape_target <- create_target_from_shape(my_svg_file_name, lambda =1) # test the log density function on 25 randomly generated points my_shape_target$logd(matrix(rnorm(50), ncol = 2), my_shape_target$algo_parameters)
Since characters are just a bunch of paths, you can also define distributions based on words, for instance:
which is done as follows (warning you’re only allowed a-z and A-Z, no numbers no space no punctuation for now):
library(RShapeTarget) word_target <- create_target_from_word("Hodor")
For the words, I defined the target density function as before, except that it’s constant on the letters: so if a point is outside a letter its density is computed based on the distance to the nearest path; if it’s inside a letter it’s just constant, so that the letters are “filled” with some constant density. I thought it’d look better.
Now I’m not worried about the banana shaped distribution any more, but by the fact that the only word I could think of was “Hodor” (with whom you can chat over there).
I’ll talk in a session organized by Scott Schmidler, entitled Adaptive Monte Carlo Methods for Bayesian Computation; you can find the session programme here [online program]. I’ll talk about score and Fisher observation matrix estimation in state-space models.
According to the rumour and Christian’s reflections on the past years (2009, 2010, 2011), I should prepare my schedule in advance to really enjoy this giant meeting. So if you want to meet there, please send me an e-mail!
See you in Montréal!
We’re at the Big Data era blablabla, but the advanced computational methods usually don’t scale well enough to match the increasing sizes of datasets. For instance, even in a simple case of i.i.d. data and an associated likelihood function , the cost of evaluating the likelihood function at any parameter is typically growing at least linearly with . If you then plug that likelihood into an optimization technique to find the Maximum Likelihood Estimate, or into a sampling technique such as Metropolis-Hastings to sample from the posterior distribution, the computational cost grows accordingly for a fixed number of iterations. However you can get unbiased estimates of the log-likelihood by drawing points uniformly in the index set and by computing . This way you sub-sample from the whole dataset, and you can choose according to your computational budget. However is it possible to perform inference with these estimates instead of the complete log-likelihood?
Arnaud Doucet, Sylvain Rubenthaler and I have just put a technical report on arXiv about estimating the first- and second-order derivatives of the log-likelihood (also called the score and the observed information matrix respectively) in general (intractable) statistical models, and in particular in (non-linear non-Gaussian) state-space models. We call them “derivative-free” estimates because they can be computed even if the user cannot compute any kind of derivatives related to the model (as opposed to e.g. this paper and this paper). Actually in some cases of interest we cannot even evaluate the log-likelihood point-wise (we do not have a formula for it), so forget about explicit derivatives. Would you like to know more?
and of course Happy New Year (2013 is the international year of statistics!).
Last week the ISBA Regional Meeting was held in Banaras / Varanasi, in the North of India. The conference was well attended, with leading figures such as Jayanta K. Ghosh, José Bernardo, James Berger, Peter Green, Christian Robert who blogged about it, and an overall ~350 participants.
With Robin Ryder we wrote a paper titled The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time and it has been accepted in Annals of Applied Probability (arXiv preprint here). I’m especially happy about it since it was the last remaining unpublished chapter of my PhD thesis. In this post I’ll try to explain what we proved here on a simple example.
What do you do when you see the word “condom” in the title of a new arXiv entry?! You click with wild excitement of course! And you end up reading