## Particle methods in Statistics

Hi there,

In this post, just in time for the summer, I propose a reading list for people interested in discovering the fascinating world of particle methods, aka sequential Monte Carlo methods, and their use in statistics. I also take the opportunity to advertise the SMC workshop in Uppsala (30 Aug – 1 Sept), which features an amazing list of speakers, including my postdoctoral collaborator Jeremy Heng:

## Likelihood calculation for the g-and-k distribution

Hello,

An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2) numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

- Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
- Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
- Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
- Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

## ABC in Banff

Hi all,

Last week I attended a wonderful meeting on Approximate Bayesian Computation in Banff, which gathered a nice crowd of ABC users and enthusiasts, including lots of people outside of computational stats, whom I wouldn’t have met otherwise. Christian blogged about it there. My talk on Inference with Wasserstein distances is available as a video here (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, the paper is here). In this post, I’ll summarize a few (personal) points and questions on ABC methods, after recalling the basics of ABC (ahem).

## Statistical inference with the Wasserstein distance

Hi! It’s been too long!

In a recent arXiv entry, Espen Bernton, Mathieu Gerber and Christian P. Robert and I explore the use of the Wasserstein distance to perform parameter inference in generative models. A by-product is an ABC-type approach that bypasses the choice of summary statistics. Instead, one chooses a metric on the observation space. Our work fits in the minimum distance estimation framework and is particularly related to “On minimum Kantorovich distance estimators”, by Bassetti, Bodini and Regazzini. A recent and very related paper is “Wasserstein training of restricted Boltzmann machines“, by Montavon, Müller and Cuturi, who have similar objectives but are not considering purely generative models. Similarly to that paper, we make heavy use of recent breakthroughs in numerical methods to approximate Wasserstein distances, breakthroughs which were not available to Bassetti, Bodini and Regazzini in 2006.

Here I’ll describe the main ideas in a simple setting. If you’re excited about ABC, asymptotic properties of minimum Wasserstein estimators, Hilbert space-filling curves, delay reconstructions and Takens’ theorem, or SMC samplers with r-hit kernels, check our paper!

## Coupling of particle filters: smoothing

Hi again!

In this post, I’ll explain the new smoother introduced in our paper Coupling of Particle Filters with Fredrik Lindsten and Thomas B. Schön from Uppsala University. Smoothing refers to the task of estimating a latent process of length , given noisy measurements of it, ; the smoothing distribution refers to . The setting is state-space models (what else?!), with a fixed parameter assumed to have been previously estimated.

## Coupling of particle filters: likelihood curves

Hi!

In this post, I’ll write about coupling particle filters, as proposed in our recent paper with Fredrik Lindsten and Thomas B. Schön from Uppsala University, available on arXiv; and also in this paper by colleagues at NUS. The paper is about a methodology with multiple direct consequences. In this first post, I’ll focus on correlated likelihood estimators; in a later post, I’ll describe a new smoothing algorithm. Both are described in detail in the article. We’ve been blessed to have been advertised by xi’an’s og, so glory is just around the corner.

## Why shrinking priors shrink ?

Hello there !

While I was in Amsterdam, I took the opportunity to go and work with the Leiden crowd, an more particularly with Stéphanie van der Pas and Johannes Schmidt-Heiber. Since Stéphanie had already obtained neat results for the Horseshoe prior and Johannes had obtained some super cool results for the spike and slab prior, they were the fist choice to team up with to work on sparse models. And guess what ? we have just ArXived a paper in which we study the sparse Gaussian sequence

where only a small number of are non zero.

There is a rapidly growing literature on shrinking priors for such models, just look at Polson and Scott (2012), Caron and Doucet (2008), Carvalho, Polson, and Scott (2010) among many, many others, or simply have a look at the program of the last BNP conference. There is also an on growing literature on theoretical properties of some of these priors. The Horseshoe prior was studied in Pas, Kleijn, and Vaart (2014), an extention of the Horseshoe was then study in Ghosh and Chakrabarti (2015), and recently, the spike and slab Lasso was studied in Rocková (2015) (see also Xian ’Og)

All these results are super nice, but still we want to know **why do some shinking priors shrink so well and others do not?!** As we are *all* mathematicians here, I will reformulate this last question: **What would be the conditions on the prior under which the posterior contracts at the minimax rate ^{1}** ?

We considered a Gaussian scale mixture prior on the sequence

since this family of priors encomparse all the ones studied in the papers mentioned above (and more), so it seemed to be general enough.

Our main contribution is to give conditions on such that the posterior converge at the good rate. We showed that in order to recover the parameter that are non-zeros, the prior should have tails that decays at most exponentially fast, which is similar to the condition impose for the Spike and Slab prior. Another expected condition is that the prior should put enough mass around 0, since our assumption is that the vector of parameter is nearly black i.e. most of its components are 0.

More surprisingly, in order to recover 0 parameters correctly, one also need some conditions on the tail of the prior. More specifically, the prior’s tails cannot be too big, and if they are, we can then construct a prior that puts enough mass near 0 but which does not concentrate at the minimax rate.

We showed that these conditions are satisfied for many priors including the Horseshoe, the Horseshoe+, the Normal-Gamma and the Spike and Slab Lasso.

The Gaussian scale mixture are also quite simple to use in practice. As explained in Caron and Doucet (2008) a *simple* Gibbs sampler can be implemented to sample from the posterior. We conducted simulation study to evaluate the *sharpness* of our conditions. We computed the loss for the Laplace prior, the global-local scale mixture of gaussian (called hereafter *bad* prior for simplicity), the Horseshoe and the Normal-Gamma prior. The first two do not satisfy our condition, and the last two do. The results are reported in the following picture.

As we can see, priors that do and do not satisfy our condition show different behaviour (it seems that the priors that do not fit our conditions have a risk larger than the minimax rate of a factor of ). This seems to indicate that our conditions are sharp.

At the end of the day, our results expands the class of shrinkage priors with theoretical guarantees for the posterior contraction rate. Not only can it be used to obtain the optimal posterior contraction rate for the horseshoe+, the inverse-Gaussian and normal-gamma priors, but the conditions provide some characterization of properties of sparsity priors that lead to desirable behaviour. Essentially, the tails of the prior on the local variance should be at least as heavy as Laplace, but not too heavy, and there needs to be a sizable amount of mass around zero compared to the amount of mass in the tails, in particular when the underlying mean vector grows to be more sparse.

## Reference

Caron, François, and Arnaud Doucet. 2008. “Sparse Bayesian Nonparametric Regression.” In *Proceedings of the 25th International Conference on Machine Learning*, 88–95. ICML ’08. New York, NY, USA: ACM.

Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” *Biometrika* 97 (2): 465–80.

Ghosh, Prasenjit, and Arijit Chakrabarti. 2015. “Posterior Concentration Properties of a General Class of Shrinkage Estimators Around Nearly Black Vectors.”

Pas, S.L. van der, B.J.K. Kleijn, and A.W. van der Vaart. 2014. “The Horseshoe Estimator: Posterior Concentration Around Nearly Black Vectors.” *Electron. J. Stat.* 8: 2585–2618.

Polson, Nicholas G., and James G. Scott. 2012. “Good, Great or Lucky? Screening for Firms with Sustained Superior Performance Using Heavy-Tailed Priors.” *Ann. Appl. Stat.* 6 (1): 161–85.

Rocková, Veronika. 2015. “Bayesian Estimation of Sparse Signals with a Continuous Spike-and-Slab Prior.”

- For those wondering why the heck with minimax rate here, just remember that a posterior that contracts at the minimax rate induces an estimator which converge at the same rate. It also gives us that confidence region will not be too large.↩

## Sequential Bayesian inference for time series

Hello hello,

I have just arXived a review article, written for ESAIM: Proceedings and Surveys, called Sequential Bayesian inference for implicit hidden Markov models and current limitations. The topic is sequential Bayesian estimation: you want to perform inference (say, parameter inference, or prediction of future observations), taking into account parameter and model uncertainties, using hidden Markov models. I hope that the article can be useful for some people: I have tried to stay at a general level, but there are more than 90 references if you’re interested in learning more (sorry in advance for not having cited your article on the topic!). Below I’ll comment on a few points.

## Statistics journals network

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite *fascinating* for us, *academics fascinated by academic rankings, for better or for worse *(ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include *Bayesian Analysis* (published from 2006), *The Annals of Applied Statistics* (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between *Annals of Statistics* and *Biometrika*. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction,
*CSDA*is way ahead of all the others. - the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.

**Some remarks**

The two software journals included in the dataset are quite outliers:

- the
*Journal of Statistical Software (JSS)*is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015). - the largest is the self citations of the
*STATA Journal (StataJ).*

Centrality:

*CSDA*is the most central journal in the sense of the highest (unweighted) degree.

**Some further thoughts**

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

## Bayesian classics

This week I’ll start my Bayesian Statistics master’s course at the Collegio Carlo Alberto. I realized that some of last year students got PhD positions in prestigious US universities. So I thought that letting this year’s students have a first grasp of some great Bayesian papers wouldn’t do harm. The idea is that in addition to the course, the students will pick a paper from a list and present it (or rather part of it) to the others and to me. Which will let them earn some extra points for the final exam mark. It’s in the spirit of Xian’s Reading Classics Seminar (his list here).

I’ve made up the list below, inspired by two textbooks references lists and biased by personal tastes: Xian’s Bayesian Choice and Peter Hoff’s First Course in Bayesian Statistical Methods. See the pdf list and zipped folder for papers. Comments on the list are much welcome!

Julyan

PS: reference n°1 isn’t a joke!

4comments