# JRSS: Series B read paper and comparison with other unbiased estimators

Hi all,

The paper “unbiased Markov chain Monte Carlo with couplings” co-written with John O’Leary and Yves Atchadé has been accepted as a read paper in JRSS: Series B, to be presented on December 11 at the Royal Statistical Society. Comments can be submitted (400 words max) until two weeks after, that is December 28; see the instructions here. The main contribution is on the removal of the burn-in bias of Markov chain Monte Carlo (MCMC) estimators, using coupled chains. We argue that 1) the burn-bias can be removed for a controllable loss of efficiency, and 2) this can be done in many settings, for low or high-dimensional, discrete or continuous target distributions, provided that the underlying MCMC algorithm works well enough. This might have consequences on the use of parallel processors for MCMC, but also on the estimation of Monte Carlo errors. We believe that there are a lot of open questions related to our work, on the theoretical, methodological and applied sides, and  hope that the discussion will be interesting.

All the ingredients of our approach were already in the literature, from coupling of Markov chains to debiasing techniques, to the analysis of budget-constrained simulations, but we also propose multiple contributions. Below, I summarize the main methodological contributions of the paper, the way I see it at least, and I  try to contrast our approach with those of Glynn & Rhee (2014) and Agapiou, Roberts & Vollmer (2018).

The problem is to obtain unbiased estimators of $\mathbb{E}_\pi[h(X)]$, some expectation of interest with respect to a target distribution $\pi$. That distribution is assumed to be the limiting distribution of a Markov chain $(X_t)$, with Markov kernel $P$, and initial distribution $\pi_0$. Under some assumptions we have a law of large numbers,

$\frac{1}{T-b+1}\sum_{t=b}^{T} h(X_t) \to \mathbb{E}_\pi[h(X)]$

for any choice of “burn-in” $b$, as $T$ goes to infinity, in e.g. probability. This justifies the use of ergodic average as an estimator of $\mathbb{E}_\pi[h(X)]$. However for fixed $T$ the estimator is biased: its expectation is not equal to the quantity of interest. Thus averages of independent replicates of that estimator would not be consistently estimating $\mathbb{E}_\pi[h(X)]$. This is particularly unfortunate in the context of parallel computing as explained here and has motivated a lot of research, e.g. this and that. Parallel computation was also our main motivation for trying to remove the bias, even though we also realized along the way that there are other benefits to unbiased estimators.

The groundbreaking paper by Glynn & Rhee (2014), hereafter GR, shows how to obtain unbiased estimators, using pairs of Markov chains. The construction involves the construction of two Markov chains $(X_t)$ and $(Y_t)$, both with limiting distribution $\pi$, of an integer-valued “truncation” random variable $N$, in order to compute

$Z = h(X_0) + \sum_{t=1}^N \frac{h(X_t) - h(Y_{t-1})}{\mathbb{P}(N\geq t)}$.

Under various conditions on the pair of chains, and on the truncation variable, the above random variable can indeed be an unbiased estimator of  $\mathbb{E}_\pi[h(X)]$, with a finite variance and a finite expected cost.  A poor choice of truncation variable might make the variance or the expected cost of the estimator infinite. A wise choice of truncation variable might or might not require hard work, depending on how the chains are coupled; for instance in their Section 3 on Harris recurrent chains, the coupling construction is sophisticated but $N$ could be set e.g. to infinity almost surely.

The GR paper is excellent in all ways, creative, clear, insightful; it was hugely inspirational for me and I believe many others. I had used the technique previously for smoothing in state space models, with Fredrik Lindsten and Thomas Schön. What GR achieve was simply considered impossible to many people, even to people familiar with some of the “debiasing” tricks, e.g. Russian roulette. The GR paper does not claim itself that the proposed technique is generally applicable in MCMC contexts, nor that you can generally control the loss of efficiency compared to standard ergodic averages. Yet there is no doubt that the authors understood the importance of their paper (the abstract starts with “We introduce a new class of Monte Carlo methods, which we call exact estimation algorithms.“).  The numerical experiments concern two examples: an (intriguing) example of non-irreducible Markov chain, and an M/M/1 queue. Their experiments illustrate clearly that standard $\sqrt{N}$ Monte Carlo rates are achieved in the asymptotic limit of the number of independent replicates, in other words: they can estimate quantities of interest by running many pairs of coupled chains, each for a random but finite amount of time. A new world!

Another paper is that of Agapiou, Roberts & Vollmer (2018). The proposed estimators are of a similar form as those of GR.  In particular there is a truncation variable $N$ to be chosen by the user. If $N$ is chosen poorly, the resulting estimators might have an infinite variance. Unfortunately, to choose $N$ appropriately, one needs to know the “contraction rate” of the coupled chains. That is, something to do with the expected decrease in the distance between two chains, after a certain number of steps, when chains are simulated from a coupled transition kernel, and wherever the two chains start. In Section 6 of the paper, the authors show how to approximate this contraction rate in a three-dimensional logistic regression example. Even with an appropriate choice of truncation variable, the proposed estimator (just as the one in GR) has no reason to achieve an efficiency similar to that of an ergodic average. So in some cases it might be competitive, and in some cases it might not be. Section 6 discusses this in details with two examples: an AR(1) process, and the logistic regression example mentioned above. The other sections of the paper concern target distributions defined on infinite-dimensional spaces with applications to Bayesian inverse problems, and are thus less related to our paper.

Now what do we contribute in our paper? For the sake of adding some novelty here I’ll blend the estimator of the unbiased MCMC paper with the L-lag construction of a recent follow-up paper.

• Sample $X_0\sim \pi_0$ and then $X_t | X_{t-1} \sim P(X_{t-1},\cdot)$ for $t = 1,\ldots,L$. Sample $Y_0 \sim \pi_0$.
• Sample $(X_t,Y_{t-L})$ from a Markov kernel $\bar{P}$ given $(X_{t-1},Y_{t-L-1})$, for $t = L, \ldots, \max(m,\tau)$, where $m$ is a user-chosen integer, and $\tau$ is the meeting time, defined as $\tau = \inf\{t\geq L: X_t = Y_{t-L}\}.$

The above construction pre-supposes that $\tau$ is almost surely finite, i.e. that is it possible, in realistic cases, to make two Markov chains meet exactly. It was known since at least this 1998 paper that this could be done in non-trivial MCMC settings, and our paper gathers various techniques from the literature to design such couplings (I personally consider our efforts to collect these coupling recipes a contribution also).  The benefit of this construction is, firstly, the absence of truncation variables. Thus, no problems associated with choosing the truncation variable. Secondly, from the constructed chains we can define the following estimator:

$\frac{1}{m-k-1}\sum_{t=k}^m h(X_t) + \frac{1}{m-k-1} \sum_{t=k+L}^{\tau-1} \min(m-k+1, \lceil \frac{t - k}{L} \rceil) \{h(X_t) - h(Y_{t-L})\}$.

Here $k,m,L$ are tuning parameters; this estimator is essentially the same as that proposed in the context of coupled conditional particle filters by  Fredrik LindstenThomas Schön and myself here (Section 4.5), although personally I did not see the generality of it at the time. The first term is a standard MCMC average, as if we discarded $k-1$ iterations as burn-in and ran the chain for $m$ iterations in total. The second term is a bias correction, which vanishes as $k$ increases; the sum is defined to be zero in the event $\{\tau - 1 < k + 1\}$. Appropriate choices of $k$, $L$ and $m$ can make the efficiency of the above estimator as close as desired to the efficiency of the underlying ergodic average; a theoretical result in in that direction can be found in Section 3 of our paper.

To end with a concrete note, and in the hope of making our methodological contribution as clear as possible, I revisited the two examples of Section 6 in Agapiou, Roberts & Vollmer (2018), and applied our proposed estimators. Doing so, I did not need any knowledge on the contraction rate of the coupled Markov kernels, and I could achieve an efficiency close to that of ergodic averages. R scripts are available here: