Why shrinking priors shrink ?

Posted in General, Statistics by JB Salomond on 19 October 2015


Hello there !

While I was in Amsterdam, I took the opportunity to go and work with the Leiden crowd, an more particularly with Stéphanie van der Pas and Johannes Schmidt-Heiber. Since Stéphanie had already obtained neat results for the Horseshoe prior and Johannes had obtained some super cool results for the spike and slab prior, they were the fist choice to team up with to work on sparse models. And guess what ? we have just ArXived a paper in which we study the sparse Gaussian sequence

X_i = \theta_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0,1), \quad i=1,...,n,

where only a small number  p_n \ll n of  \theta_i are non zero.

There is a rapidly growing literature on shrinking priors for such models, just look at Polson and Scott (2012), Caron and Doucet (2008), Carvalho, Polson, and Scott (2010) among many, many others, or simply have a look at the program of the last BNP conference. There is also an on growing literature on theoretical properties of some of these priors. The Horseshoe prior was studied in Pas, Kleijn, and Vaart (2014), an extention of the Horseshoe was then study in Ghosh and Chakrabarti (2015), and recently, the spike and slab Lasso was studied in Rocková (2015) (see also Xian ’Og)

All these results are super nice, but still we want to know why do some shinking priors shrink so well and others do not?! As we are all mathematicians here, I will reformulate this last question: What would be the conditions on the prior under which the posterior contracts at the minimax rate1 ?

We considered a Gaussian scale mixture prior on the sequence (\theta_i)

\theta_i \sim p(\theta_i) = \int \frac{e^{-\theta_i^2/(2\sigma^2)}}{\sqrt{2\pi \sigma^2}} \pi(\sigma^2) d\sigma^2

since this family of priors encomparse all the ones studied in the papers mentioned above (and more), so it seemed to be general enough.

Our main contribution is to give conditions on \pi such that the posterior converge at the good rate. We showed that in order to recover the parameter  \theta_i that are non-zeros, the prior should have tails that decays at most exponentially fast, which is similar to the condition impose for the Spike and Slab prior. Another expected condition is that the prior should put enough mass around 0, since our assumption is that the vector of parameter  \theta is nearly black i.e. most of its components are 0.

More surprisingly, in order to recover 0 parameters correctly, one also need some conditions on the tail of the prior. More specifically, the prior’s tails cannot be too big, and if they are, we can then construct a prior that puts enough mass near 0 but which does not concentrate at the minimax rate.

We showed that these conditions are satisfied for many priors including the Horseshoe, the Horseshoe+, the Normal-Gamma and the Spike and Slab Lasso.

The Gaussian scale mixture are also quite simple to use in practice. As explained in Caron and Doucet (2008) a simple Gibbs sampler can be implemented to sample from the posterior. We conducted simulation study to evaluate the sharpness of our conditions. We computed the \ell_2 loss for the Laplace prior, the global-local scale mixture of gaussian (called hereafter bad prior for simplicity), the Horseshoe and the Normal-Gamma prior. The first two do not satisfy our condition, and the last two do. The results are reported in the following picture.


As we can see, priors that do and do not satisfy our condition show different behaviour (it seems that the priors that do not fit our conditions have a \ell_2 risk larger than the minimax rate of a factor of n). This seems to indicate that our conditions are sharp.

At the end of the day, our results expands the class of shrinkage priors with theoretical guarantees for the posterior contraction rate. Not only can it be used to obtain the optimal posterior contraction rate for the horseshoe+, the inverse-Gaussian and normal-gamma priors, but the conditions provide some characterization of properties of sparsity priors that lead to desirable behaviour. Essentially, the tails of the prior on the local variance should be at least as heavy as Laplace, but not too heavy, and there needs to be a sizable amount of mass around zero compared to the amount of mass in the tails, in particular when the underlying mean vector grows to be more sparse.


Caron, François, and Arnaud Doucet. 2008. “Sparse Bayesian Nonparametric Regression.” In Proceedings of the 25th International Conference on Machine Learning, 88–95. ICML ’08. New York, NY, USA: ACM.

Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” Biometrika 97 (2): 465–80.

Ghosh, Prasenjit, and Arijit Chakrabarti. 2015. “Posterior Concentration Properties of a General Class of Shrinkage Estimators Around Nearly Black Vectors.”

Pas, S.L. van der, B.J.K. Kleijn, and A.W. van der Vaart. 2014. “The Horseshoe Estimator: Posterior Concentration Around Nearly Black Vectors.” Electron. J. Stat. 8: 2585–2618.

Polson, Nicholas G., and James G. Scott. 2012. “Good, Great or Lucky? Screening for Firms with Sustained Superior Performance Using Heavy-Tailed Priors.” Ann. Appl. Stat. 6 (1): 161–85.

Rocková, Veronika. 2015. “Bayesian Estimation of Sparse Signals with a Continuous Spike-and-Slab Prior.”

  1. For those wondering why the heck with minimax rate here, just remember that a posterior that contracts at the minimax rate induces an estimator which converge at the same rate. It also gives us that confidence region will not be too large.

Who is Julia ?

Posted in General by JB Salomond on 4 June 2015

Hi there !

Unfortunately this post is indeed about statistics…

If you are randomly walking around the statistics blogs, you probably have certainly heard of this new language called Julia. It is said by the developers to be as easy to write as R and as fast as C (!) which is quite a catchy way of selling their work. After talking with a Julia enthusiastic user in Amsterdam, I decided to give it a try. And here I am sharing my first impressions.

Fist thing first, the installation is as easy as any other language, plus there is a neat Package management that allows you to get started quite easily. In this respect it is very similar to R.
On the minus side I became a big fan of RStudio Julian (… oupsy Julyan) told you about a long time ago. These kind of programs really make your life easier. I thus tried Juno which turned out to be cumbersome and terribly slow. I would have loved to have an IDE for Julia that would be up to the RStudio standard. Nevermind.

No lets talk a little about what is really interesting : “Is their catch phrase false advertising or not?!”.

There is a bunch of relatively good tutorials online which are really helpful to learn the basic vocabulary, but indeed if like me you are use to code in R and/or Python, you should get it pretty fast and can almost copy-paste your favourite code into Julia and with a few adjustments, it will work. So as easy to write as R : quite so.

I then tried to compare computational times for some of my latest codes and there came the good surprise ! A code that would take a handful of minutes to run in R mainly due to unavoidable loops took a couple of seconds to run in Julia, without any other sorts of optimization. The handling of big objects is smooth and I did not ran into memory problems that R was suffering from.

So far so good ! But of course there has to be some drawbacks. The first one is the poor package repository compare to CRAN or even what you can get for Python. This might of course improve in the next few years as the language is still quite new. However, it is bothering to have to re-code something when you are used to simply load a package in R. Another, probably less important problem, is the lack of data visualization methods and especially the absence of ggplot2 that we have grown quite found of around here. There is of course Gadfly, which is quite close but once again, it is up to now very limited compared to what I was used to…

All in all, I am happy to have tried Julia, and I am quite sure that I will be using it quite a lot from now on. However, even if from a efficiency point of view, it is great, and it is way easier to learn than C (which I should have done a while ago), R and its tremendous package repository is far from beaten.

Oh and by the way, it uses PyPlot based on MatplotLib that allow you to make some xkcd-like plots, which can make your presentations a lot more fun.

A good tool for researchers ?

Posted in Geek by JB Salomond on 17 January 2013


Hi there !

Like Pierre a while ago, I got fed up with printing articles, annotating them, losing them, re-printing them, and so on. Moreover, I also wanted to be able to carry more than one or two books in my bag without ruining my back. E-Ink readers seemed good but at some point I changed my mind

After the ISBA conference in Kyoto, where I saw bazillions of IPads, I thought that tablets really worth the shot. I am cool with reading on a LCD screen, I probably won’t read scientific articles/books outside in the sun, and I like the idea of a light device that can replace my laptop in conferences. Furthermore, there is now a large choice of apps to annotate pdf which is crucial for me.

The device I chose run on Android (mainly because there is no memory extension on Apple devices), combined with a good capacitive pen, an annotation app such as eZreader that get your pdf directly from Dropbox (which is simply awesome). You can even use LaTeX (without fancy packages…) which may become handy.

I hope that I will not experience the same disappointment as Pierre did with his reader, but for the moment a tablet seems just what I needed !

A glimps of Inverse Problems

Posted in General, Seminar/Conference, Statistics by JB Salomond on 15 November 2012


Hi folks !

Last Tuesday a seminar on Bayesian procedure for inverse problems took place at CREST. We had time for two presentations of young researchers Bartek Knapik and Kolyan Ray. Both presentations deal with the problem of observing a noisy version of a linear transform of the parameter of interest

Y_i = K\mu + \frac{1}{\sqrt{n}} Z
where K is a linear operator and Z a Gaussian white noise.  Both presentations considered asymptotic properties of the posterior distribution (Their papers can be found on arxiv, here for Bartek’s, and here for Kolyan’s). There is a wide literature on asymptotic properties of the posterior distribution in direc models. When looking at the concentration of f toward a true distribution f_0  given the data, with respect to some distance d(.,.),  well known problem is to derive concentration rates, that is the rate \epsilon_n such that

\pi(d(f,f_0) > \epsilon_n | X^n) \to 0.

For inverse problems, the usual methods as introduced by Ghosal, Ghosh and van der Vaart (2000) usually fails, and thus results in this settings are in general difficult to obtain.

Bartek presented some very refined results in the conjugate case. He manages to get some results on the concentration rates of the posterior distribution, on Bayesian Credible Sets and Bernstein – Von Mises theorems – that states that the posterior is asymptotically Gaussian – when estimating a linear functional of the parameter of interest. Kolyan got some general conditions on the prior to achieve concentration rate, and prove that these techniques leads to optimal concentration rates for classical models.

I only knew little about inverse problems but both talks were very accessible and I will surely get more involved in this field !


Get every new post delivered to your Inbox.

Join 63 other followers

%d bloggers like this: