## Unbiased MCMC with couplings

Hi,

With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.

### What bias?

An MCMC algorithm defines a Markov chain , with stationary distribution , so that time averages of the chain converge to averages with respect to , for instance

,

as . The MCMC estimator is in general biased, for any fixed , because the chain is not started from , but rather from some initial distribution .

It is common to discard some initial iterations as “burn-in”, in order to mitigate that bias, which is particularly problematic for parallel computations. Indeed you might be able to run many chains in parallel for the price of one, but each chain has to converge to for the bias to disappear. In other words, MCMC estimators are intrinsically justified in the asymptotic of the number of iterations, , whereas parallelization can be done in the number of chains, as explained e.g. by Jeff Rosenthal.

### How to remove the bias?

Instead of running one chain, let’s run two chains, and . Each of them is, individually, “as if” it was generated from the same MCMC algorithm; however, we construct the pair of chains such that they will “meet”, as in the above animation. There is a meeting time such that for all . Note the time shift! This construction allows to consider

just adding infinitely many zeros. By taking the expectation, swapping limit and expectation and using a telescoping sum argument (the proper justification being in the paper, Section 3), we get that the expectation of the above sum is

This is hugely inspired by Glynn & Rhee (2014), and I had described similar ideas in the setting of smoothing in an earlier post. The contribution of the new arXiv report is to bring this construction to generic MCMC algorithms.

In the diagram above, the two chains meet at time . This means that an unbiased estimator of the mean of is given by . In the article, we propose a series of variance reduction techniques, leading to estimators that are more similar to the original MCMC averages, with an extra correction term that removes the bias. Namely, for any given integers , we propose the estimator

,

and we give heuristics to choose so as to maximize the estimators’ efficiency.

### How to construct coupled chains?

We can make Gibbs and Metropolis-Hastings chains meet, as required by the above construction and as described in the paper (Section 4). This means that we can apply the method to a wide variety of settings. In the paper (Sections 5 and 6), we provide applications to hierarchical models, logistic regressions, and the Bayesian Lasso. We also use the method to approximate the “cut” distribution, which arises in Bayesian inference for misspecified models and on which I’ll blog in details soon.

If you use some more fancy MCMC methods, you can either design your custom couplings, or you can interweave your kernel with MH steps in order to create an appropriate coupling without altering much of the marginal mixing of the chains.

### What’s the point of unbiased MCMC estimators?

Thanks to the lack of bias, we can compute estimators in parallel and take their average. The resulting estimator is 1) justified in the asymptotic regime , and 2) parallelizable across the terms. Each term takes a random but finite time to complete.

Another advantage of the proposed framework is that confidence intervals can easily be constructed for averages of i.i.d. estimators, using the simple Central Limit Theorem. This is again justified as , instead of the usual MCMC confidence intervals justified in the asymptotic of the number of iterations.

Unbiased MCMC with couplings | A bunch of datasaid, on 14 August 2017 at 22:25[…] Please comment on the article here: Statistics – Statisfaction […]

Statistical learning in models made of modules | Statisfactionsaid, on 11 September 2017 at 06:00[…] know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the […]

Statistical learning in models made of modules | A bunch of datasaid, on 11 September 2017 at 08:02[…] know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the […]

Unbiased Hamiltonian Monte Carlo with couplings | Statisfactionsaid, on 17 September 2017 at 22:52[…] Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is […]

Unbiased Hamiltonian Monte Carlo with couplings | A bunch of datasaid, on 17 September 2017 at 23:17[…] Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is […]

Approximating the cut distribution | Statisfactionsaid, on 1 October 2017 at 21:24[…] This post is about computational issues with the cut distribution for Bayesian inference in misspecified models. Some motivation was given in a previous post about a recent paper on modular Bayesian inference. The cut distribution, or variants of it, might play an important role in combining statistical models, especially in settings where one wants to propagate uncertainty while preventing misspecification from damaging estimation. The cut distribution can also be seen as a probabilistic analog of two-step point estimators. So the cut distribution is more than just a trick! And it raises interesting computational issues which I’ll describe here along with a solution via unbiased MCMC. […]

Approximating the cut distribution | A bunch of datasaid, on 2 October 2017 at 02:30[…] This post is about computational issues with the cut distribution for Bayesian inference in misspecified models. Some motivation was given in a previous post about a recent paper on modular Bayesian inference. The cut distribution, or variants of it, might play an important role in combining statistical models, especially in settings where one wants to propagate uncertainty while preventing misspecification from damaging estimation. The cut distribution can also be seen as a probabilistic analog of two-step point estimators. So the cut distribution is more than just a trick! And it raises interesting computational issues which I’ll describe here along with a solution via unbiased MCMC. […]