Statisfaction

Unbiased MCMC with couplings

Posted in Statistics by Pierre Jacob on 14 August 2017

 

2017-08-unbiasedmcmc

Two chains meeting at time 10, and staying faithful forever. ❤

 

Hi,

With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.

What bias?

An MCMC algorithm defines a Markov chain (X_t), with stationary distribution \pi, so that time averages of the chain converge to averages with respect to \pi, for instance

\frac{1}{m} \sum_{t=1}^m X_t \to \mathbb{E}_\pi[X],

as m\to \infty. The MCMC estimator \frac{1}{m} \sum_{t=1}^m X_t is in general biased, for any fixed m, because the chain is not started from \pi, but rather from some initial distribution \pi_0.

It is common to discard some initial iterations as “burn-in”, in order to mitigate that bias, which is particularly problematic for parallel computations. Indeed you might be able to run many chains in parallel for the price of one, but each chain has to converge to \pi for the bias to disappear. In other words, MCMC estimators are intrinsically justified in the asymptotic of the number of iterations, m \to \infty, whereas parallelization can be done in the number of chains, as explained e.g. by Jeff Rosenthal.

How to remove the bias?

Instead of running one chain, let’s run two chains, (X_t) and (Y_t). Each of them is, individually, “as if” it was generated from the same MCMC algorithm; however, we construct the pair of chains such that they will “meet”, as in the above animation. There is a meeting time \tau such that X_t = Y_{t-1} for all t \geq \tau. Note the time shift! This construction allows to consider

X_{0} + \sum_{t=1}^{\tau-1} (X_{t} - Y_{t-1}) =X_{0} +\sum_{t=1}^{\infty} (X_{t} - Y_{t-1}),

just adding infinitely many zeros. By taking the expectation, swapping limit and expectation and using a telescoping sum argument (the proper justification being in the paper, Section 3), we get that the expectation of the above sum is

\mathbb{E}[X_{0}] + \sum_{t=1}^\infty (\mathbb{E}[X_{t}] -\mathbb{E}[Y_{t-1}]) = \lim_{t\to\infty}\mathbb{E}[X_{t}] =\mathbb{E}_\pi[X].

This is hugely inspired by Glynn & Rhee (2014), and I had described similar ideas in the setting of smoothing in an earlier post. The contribution of the new arXiv report is to bring this construction to generic MCMC algorithms.

In the diagram above, the two chains meet at time \tau = 10. This means that an unbiased estimator of the mean of \pi is given by X_0 + \sum_{t=1}^{9} (X_{t} - Y_{t-1}). In the article, we propose a series of variance reduction techniques, leading to estimators that are more similar to the original MCMC averages, with an extra correction term that removes the bias. Namely, for any given integers k<m, we propose the estimator

\frac{1}{m-k+1} \sum_{t=k}^m X_t +\frac{1}{m-k+1} \sum_{t=k}^{\max(\tau-1,m)} (t\wedge m - k + 1) \{X_t - Y_{t-1}\},

and we give heuristics to choose k,m so as to maximize the estimators’ efficiency.

How to construct coupled chains?

We can make Gibbs and Metropolis-Hastings chains meet, as required by the above construction and as described in the paper (Section 4). This means that we can apply the method to a wide variety of settings. In the paper (Sections 5 and 6), we provide applications to hierarchical models, logistic regressions, and the Bayesian Lasso. We also use the method to approximate the “cut” distribution, which arises in Bayesian inference for misspecified models and on which I’ll blog in details soon.

If you use some more fancy MCMC methods, you can either design your custom couplings, or you can interweave your kernel with MH steps in order to create an appropriate coupling without altering much of the marginal mixing of the chains.

What’s the point of unbiased MCMC estimators?

Thanks to the lack of bias, we can compute R estimators in parallel and take their average. The resulting estimator is 1) justified in the asymptotic regime R \to \infty, and 2) parallelizable across the R terms. Each term takes a random but finite time to complete.

Another advantage of the proposed framework is that confidence intervals can easily be constructed for averages of i.i.d. estimators, using the simple Central Limit Theorem. This is again justified as R \to \infty, instead of the usual MCMC confidence intervals justified in the asymptotic of the number of iterations.

Advertisements

5 Responses

Subscribe to comments with RSS.

  1. […] Please comment on the article here: Statistics – Statisfaction […]

  2. […] know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the […]

  3. […] know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the […]

  4. […] Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is […]

  5. […] Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: