Estimating convergence of Markov chains

Upper bounds as never seen on TV.

Hi all,

Niloy Biswas (PhD student at Harvard) and I have recently arXived a manuscript on the assessment of MCMC convergence (using couplings!). Here I’ll describe the main result, and some experiments (that are not in the current version of the paper) revisiting a 1996 paper by Mary Kathryn Cowles and Jeff Rosenthal entitled “A simulation approach to convergence rates for Markov chain Monte Carlo algorithms“. Code in R producing the figures of this post is available here.

Let (X_t)_{t\geq 0} be a Markov chain generated by your favorite MCMC algorithm. Introduce an integer L referred to as the lag. Introduce a second chain (Y_t)_{t\geq 0}, such that the two chains satisfy the following properties:

  • at all times t, the distributions of X_t and Y_t are identical,
  • there is a random variable \tau, taking values in the positive integers and termed the meeting time, such that X_t = Y_{t-L} for times t\geq \tau.

The chains are coupled, “faithfully” in the sense that they meet at some point and stay together after that, and with a time lag. The main result of the paper is as follows; see the manuscript for a more general result and the assumptions. Denote by d_{\text{TV}}(t) the total variation distance (TV) between the marginal distribution of X_t and its limiting distribution. The TV measures the maximum difference, over all measurable sets, between probabilities assigned by two distributions. Then, at all times t,

d_{\text{TV}}(t) \leq \mathbb{E}\left[\max\left(0, \left\lceil \frac{\tau - L - t}{L}\right\rceil \right)\right].

The right hand side above is an expectation of a simple function of (\tau,L,t). In there, L is the user-chosen lag, \tau is a random variable which we can sample from (many times in parallel), and t is the time index for which we want to bound d_{\text{TV}}(t). The notation \lceil \cdot \rceil stands for the ceiling function, i.e. the smallest integer greater than the input. Therefore, we can approximate the expectation by empirical averages, using independent copies of the meeting time. This extends an idea mentioned in the discussion section of the unbiased MCMC paper, previously discussed here. The use of a lag L greater than one helps obtaining sharper bounds, by reducing the number of triangle inequalities employed to obtain the bounds.

How useful are the resulting bounds? For one thing, the TV is always less than one, so hopefully, the proposed bounds are also sometimes less than one…! We can check that the bounds go to zero as t increases. Phew! Second, the upper bounds depend on the choice of coupling, whereas the TV of interest does not. So if we choose our couplings poorly, we would get poor bounds. Designing couplings of Markov chains such that they meet quickly is generally considered very fun. Third, since we’re eventually approximating expectations by empirical averages, we might obtain dramatically over-confident bounds if the random variable has a large variance, which is very sad.

The manuscript presents numerical results indicating that the bounds can, in fact, be practical. To strengthen this point, the figures of this blog entry are based on the three examples in “A simulation approach to convergence rates for Markov chain Monte Carlo algorithms“. In that paper, three Gibbs samplers are considered, targeting posterior distributions in two Bayesian hierarchical models and a Bayesian probit regression. The authors estimate numerically some constants appearing in certain drift and minorization conditions, which leads to upper bounds on TV. In Example 1, the authors can compare with analytical bounds previously obtained by Jeff Rosenthal. In Examples 2 and 3, such analytical results would be tedious to obtain, while the authors can still obtain practical upper bounds. Concretely these bounds suggest that \sim 10^4-10^5 iterations are enough to guarantee a small TV for Examples 2 and 3.

From the figures, above and below, our proposed upper bounds suggest that the Gibbs chains converge in less than 500 iterations in all cases (less than 4 steps in Example 1!). You can also check from the R code that the implementation is simple, provided that you are aware of maximal couplings. The script also provides rudimentary checks that the bounds are sensical, by comparing visually the marginals of the target, approximated either with 1) many “short” chains run for t steps, with t chosen according to the figures, or with 2) a “long” chain and a conservative burn-in.


No Pima Indians were harmed in the making of these plots.

The proposed method is very closely related to that of Valen Johnson, “Studying Convergence of Markov Chain Monte Carlo Algorithms Using Coupled Sample Paths”, 1996, where couplings of chains are also used to assess convergence of MCMC algorithms.

Note that this has pretty much nothing to do with the autocorrelations of the chains at stationarity, and their estimation, see e.g. the recent work of Dootika Vats et al on that part of the MCMC story.  In a nutshell, these works deal with the asymptotic variance of MCMC estimators, whereas our work is concerned with non-asymptotic bias.

The R script producing the above figures can be used as a tutorial if you want to know more! Any bug reports, and comments on the manuscript, would be most welcome.

Published by Pierre Jacob

Associate professor, Harvard University

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: