Unbiased MCMC with couplings

 

2017-08-unbiasedmcmc
Two chains meeting at time 10, and staying faithful forever. ❤

 

Hi,

With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.

What bias?

An MCMC algorithm defines a Markov chain (X_t), with stationary distribution \pi, so that time averages of the chain converge to averages with respect to \pi, for instance

\frac{1}{m} \sum_{t=1}^m X_t \to \mathbb{E}_\pi[X],

as m\to \infty. The MCMC estimator \frac{1}{m} \sum_{t=1}^m X_t is in general biased, for any fixed m, because the chain is not started from \pi, but rather from some initial distribution \pi_0.

It is common to discard some initial iterations as “burn-in”, in order to mitigate that bias, which is particularly problematic for parallel computations. Indeed you might be able to run many chains in parallel for the price of one, but each chain has to converge to \pi for the bias to disappear. In other words, MCMC estimators are intrinsically justified in the asymptotic of the number of iterations, m \to \infty, whereas parallelization can be done in the number of chains, as explained e.g. by Jeff Rosenthal.

How to remove the bias?

Instead of running one chain, let’s run two chains, (X_t) and (Y_t). Each of them is, individually, “as if” it was generated from the same MCMC algorithm; however, we construct the pair of chains such that they will “meet”, as in the above animation. There is a meeting time \tau such that X_t = Y_{t-1} for all t \geq \tau. Note the time shift! This construction allows to consider

X_{0} + \sum_{t=1}^{\tau-1} (X_{t} - Y_{t-1}) =X_{0} +\sum_{t=1}^{\infty} (X_{t} - Y_{t-1}),

just adding infinitely many zeros. By taking the expectation, swapping limit and expectation and using a telescoping sum argument (the proper justification being in the paper, Section 3), we get that the expectation of the above sum is

\mathbb{E}[X_{0}] + \sum_{t=1}^\infty (\mathbb{E}[X_{t}] -\mathbb{E}[Y_{t-1}]) = \lim_{t\to\infty}\mathbb{E}[X_{t}] =\mathbb{E}_\pi[X].

This is hugely inspired by Glynn & Rhee (2014), and I had described similar ideas in the setting of smoothing in an earlier post. The contribution of the new arXiv report is to bring this construction to generic MCMC algorithms.

In the diagram above, the two chains meet at time \tau = 10. This means that an unbiased estimator of the mean of \pi is given by X_0 + \sum_{t=1}^{9} (X_{t} - Y_{t-1}). In the article, we propose a series of variance reduction techniques, leading to estimators that are more similar to the original MCMC averages, with an extra correction term that removes the bias. Namely, for any given integers k<m, we propose the estimator

\frac{1}{m-k+1} \sum_{t=k}^m X_t +\frac{1}{m-k+1} \sum_{t=k}^{\max(\tau-1,m)} (t\wedge m - k + 1) \{X_t - Y_{t-1}\},

and we give heuristics to choose k,m so as to maximize the estimators’ efficiency.

How to construct coupled chains?

We can make Gibbs and Metropolis-Hastings chains meet, as required by the above construction and as described in the paper (Section 4). This means that we can apply the method to a wide variety of settings. In the paper (Sections 5 and 6), we provide applications to hierarchical models, logistic regressions, and the Bayesian Lasso. We also use the method to approximate the “cut” distribution, which arises in Bayesian inference for misspecified models and on which I’ll blog in details soon.

If you use some more fancy MCMC methods, you can either design your custom couplings, or you can interweave your kernel with MH steps in order to create an appropriate coupling without altering much of the marginal mixing of the chains.

What’s the point of unbiased MCMC estimators?

Thanks to the lack of bias, we can compute R estimators in parallel and take their average. The resulting estimator is 1) justified in the asymptotic regime R \to \infty, and 2) parallelizable across the R terms. Each term takes a random but finite time to complete.

Another advantage of the proposed framework is that confidence intervals can easily be constructed for averages of i.i.d. estimators, using the simple Central Limit Theorem. This is again justified as R \to \infty, instead of the usual MCMC confidence intervals justified in the asymptotic of the number of iterations.

Published by Pierre Jacob

Professor of statistics, ESSEC Business School