Hi everyone,

and Happy New Year! This post is about some statistical inferences that one can do using as “data” the output of MCMC algorithms. Consider the trace plot above. It has been generated by Metropolis–Hastings using a Normal random walk proposal, with a standard deviation “sigma”, on a certain target. Suppose that you are given a function that evaluates the pdf of that target. Can you retrieve the value of sigma used to generate that chain?

As a statistical problem this is a well-defined question. We view the chain as a time series, and, for once, the model is well-specified! But the difficulty comes from the likelihood function being intractable; see that classic paper by Tierney, equation (1), for an expression of the transition kernel of MH. Specifically, the issue occurs whenever two consecutive states in the chain are identical, which indicates that some proposal was rejected during the course of the algorithm. This results in a term in the likelihood equal to the “rejection probability” from that state, namely

where is the acceptance probability of state from state . That term is intractable because of the integral. But we can estimate r(x)!

A naive estimator is obtained by drawing from the Normal distribution in the integral, and evaluating . The issue with that estimator is that it can be exactly equal to zero, with a non-negligible probability. If many such estimators are multiplied together to estimate the full likelihood, then there is a large chance that at least one of these estimators will be zero, resulting in an overall likelihood estimator equal to zero. This is a bit problematic since we want to compare the likelihood associated with different values of sigma!

There’s a nice trick in “The Alive Particle Filter” by Jasra, Lee, Yau, Zhang which exploits a property of Negative Binomial variables established by Neuts and Zacks in 1967. The estimator is provided by the algorithm below.

The output of that algorithm has expectation r(x) and is guaranteed to never be equal to zero. Equipped with this, we can obtain unbiased, non-negative estimators of the full likelihood of sigma. In combination with some prior information, we can run a pseudo-marginal Metropolis-Hastings algorithm on the sigma space, the output of which is in the figure below.

At this point, a new “meta” problem would be the inference of the standard deviation used in the pseudo-marginal algorithm defined on the sigma space!…

The problem is related to some works on the modeling of animal movements, for instance, “Inference in MCMC step selection models” by Michelot, Blackwell, Chamaillé-Jammes and Matthiopoulos. There, MCMC-type algorithms are used as statistical models for animal movements. Their appeal is to provide simple mechanisms to describe local moves, while being also guaranteed to admit a specified global stationary distribution that might describe where animals roam “on average”.

The code producing the above figures is here: https://github.com/pierrejacob/statisfaction-code/blob/master/2020-01-inferenceMCMC.R

]]>Rémi Bardenet and I are starting a new course on Bayesian machine learning at the MVA master (Mathématiques, Vision et Apprentissage) at ENS Paris-Saclay. Details on the syllabus can be found on the MVA webpage and on this Github repository. In this post, I shortly describe what motivated us for proposing this course and provide results of topic modeling of recent machine learning (conference) papers mentioning *Bayesian *in their abstract.

The Bayesian paradigm and its associated toolbox for inference and prediction have become common in machine learning. Figures (a) and (b) above show the evolution of the number of Bayesian papers in the JMLR journal and at the NeurIPS conference, respectively. More precisely, the red histograms show the papers whose abstracts contain one of the keywords *Bayesian, Monte Carlo, MCMC*, or *variational Bayes*. Note that *Bayesian networks* can be a misnomer as their treatment can be non-Bayesian, but they represent less than 5% of the selected papers. On the other side, some influent researchers use *probabilistic* as a synonym for *Bayesian* (see eg page xxvii of the preface of [Mur12]), but we have refrained from counting such occurrences to avoid unfair advantage. Blue histograms show the total number of accepted papers in each issue. In both venues, Bayesian papers represent around 15% of all accepted papers each year. For comparison, the number of abstracts mentioning *deep* or *neural net* at NeurIPS is shown in red in Figure (c). The recent increase in accepted papers is seen to be largely explained by the neural network literature, after a decade of absence known in the community as the *winter of neural networks*. In contrast, the Bayesian trend is stable over the years. Without appealing to statistical ideology, such a stable trend has a good claim to be represented in any ML curriculum, which is a reason that motivated our Bayesian machine learning course proposal for MVA master (Mathématiques, Vision et Apprentissage) at ENS Paris-Saclay.

To get an overview of what Bayesian machine learning papers tackle, Rémi used latent Dirichlet allocation [BlNgJo03] to extract 10 topics out of all 1102 NeurIPS and JMLR *Bayesian* abstracts. The python code is available on Github. Here are the results:

BNP 1- model models data process latent Bayesian Dirichlet hierarchical nonparametric inference 2- features learn problem different knowledge learning image object example examples 3- method neural Bayesian using linear state based kernel approach model 4- belief propagation nodes local tree posterior node nbsp given algorithm 5- learning data Bayesian model training classification performance selection prediction sets Comp 6- inference Monte Carlo Markov sampling variational time algorithm MCMC approximate 7- function optimization algorithm optimal learning problem gradient methods bounds state 8- learning networks variables structure network Bayesian EM paper distribution algorithm Regr 9- Bayesian gaussian prior regression non estimation likelihood sparse parameters matrix 10- model information Bayesian human visual task probability sensory prior concept

Recognizable topics stand out, such as topic 1- on *Bayesian nonparametric models, *6- on *Computational methods for Bayesian inference*, 9- on *Bayesian regression*. Some topics are too young to appear in this analysis: deep neural networks are booming, see Figure (c), and they are used in ever more risk-sensitive applications (self-driving cars, medical diagnosis, financial predictions) so that there is a clear need for auditable decisions based on quantified uncertainty. This is precisely what the Bayesian paradigm claims to offer, at a computational cost that is usually prohibitive for deep nets (so far).

Bayesian deep learning workshops at NeurIPS

Symposiums on Advances in Approximate Bayesian Inference

The Case for Bayesian Deep Learning by Andrew Gordon Wilson

[Mur12] Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

[BlNgJo03] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

]]>At the dawn of 2020, in case anyone in the stat/ML community is not aware yet of Francis Bach’s blog started last year: this is a great place to learn about general tricks in machine learning explained with easy words. This month’s post *The sum of a geometric series is all you need!* shows how ubiquitous geometric series are in stochastic gradient descent, among others. In this post, I describe just another situation where the sum of a geometric series can be useful in statistics.

I also found the sum of geometric series useful for turning a sum of expectations into a single expectation, by the linearity of expectation. More specifically, for a random variable compactly supported on ,

where the sum can be infinite. Let us specify this with Beta variables: let with positive parameters . Then the *k*-th moment of (for a positive integer ) is

,

where denotes the Pochhammer symbol, or rising factorial, .

On the one hand, the sum doesn’t look like an easy-going guy while turning it into a single expectation reveals a more amenable expression since we can use the result

for any and , and where denotes the beta function. We conclude that the required series has a finite limit for , in which case

which wasn’t trivial to me a priori.

In a joint work with Caroline Lawless, who was an intern at Inria Grenoble Rhône-Alpes in 2018 (now Ph.D. candidate between Oxford and Paris-Dauphine), we have proposed a simple proof of Pitman-Yor’s Chinese restaurant process from its stick-breaking representation, thus generalizing a recent proof for the Dirichlet process by Jeff Miller. One of the technical lemmas there was the one provided in the top picture, that we proved by (backward) induction with the identity obtained in the above section.

By the way, the stick-breaking is a distribution over the infinite simplex, i.e. a specific way to randomly break a stick of unit length into infinitely many pieces, , with positive s, defined by and for , where the s are iid beta random variable. Which sounds like a stochastic version of Zeno’s paradox mentionned by Francis.

Thanks for reading, and happy 2020 to all!

Julyan

]]>Long time no see, Statisfaction!

I’m glad to write about my habilitation entitled *Bayesian statistical learning and applications* I defended yesterday at Inria Grenoble. This *Habilitation à Diriger des Recherches* (HDR) is the highest degree issued through a university examination in France. If I am to believe official texts, the HDR recognizes a candidate’s “high scientific level, the originality of approach in a field of science, ability to master a research strategy in a sufficiently broad scientific or technological field and ability to supervise young researchers”. No less! In French academia, HDR is a necessary condition for supervising a Ph.D., being a reviewer for a Ph.D., and applying to prof positions. Part of the process consists of writing a manuscript summarizing the research done since the Ph.D. Mine is organized in three parts:

**Chapter 1. Bayesian Nonparametric Mixture Modeling**. This chapter is devoted to mixture models that are central to Bayesian nonparametrics. It focusses on BNP mixture models for (i) survival analysis, (ii) image segmentation and (ii) ecotoxicological applications.

**Chapter 2. Approximate Bayesian Inference**. This chapter is concerned in large parts with computational aspects of Bayesian inference. It describes: (i) conditional approaches in the form of truncation-based approximations for the Pitman–Yor process and for completely random measures, (ii) a marginal approach based on approximations of the predictive distribution of Gibbs-type processes, (iii) an approximate Bayesian computation (ABC) algorithm using the energy distance as data discrepancy.

**Chapter 3. Distributional Properties of Statistical and Machine Learning Models**. This chapter is concerned with general distributional properties of statistical and machine learning models, including (i) sub-Gaussian and sub-Weibull properties, (ii) a prior analysis of Bayesian neural networks and (iii) theoretical properties of asymmetric copulas.

I defended with Alessandra Guglielmi (Politecnico di Milano), Éric Moulines (École polytechnique & Académie des Sciences), and Yee Whye Teh (University of Oxford & DeepMind) as reviewers, and Stéphane Girard (Inria), Anatoli Juditsky and Adeline Samson (both Université Grenoble Alpes) as examiners. A word cloud made of the authors cited in my manuscript bibliography shows how much my research directions have been shaped during my postdoc in Italy.

I reproduce here the (lengthy) acknowledgments from my manuscript that owes a lot to many people. I am very grateful to Alessandra Guglielmi, Éric Moulines and Yee Whye Teh for agreeing to be rapporteurs. It is an honor and a pleasure to know that one’s work is read and appreciated by researchers we highly think of. Thank you for devoting time to this manuscript, and for traveling to Grenoble in a period full of teachings (and what’s more, the day after a half-marathon for one of you). Alessandra, thank you for agreeing to open the ball with a presentation. Thank you, Adeline Samson, Anatoli Juditsky and Stéphane Girard for agreeing to be part of the jury.

Ghislaine Gayraud and Judith Rousseau, as well as Kerrie Mengersen: you showed me by example during my Ph.D. thesis how much exciting research was, and also and above all, thanks for supporting me in difficult times.

Igor Prünster, thank you for your confidence, and for the opportunity to join the Collegio Carlo Alberto, first in Turin and then in Milan, for a long postdoc made of new collaborations, (species) discoveries, and a new culture. The supervision, leaving a lot of room for autonomy, has allowed me to develop deep links with my colleagues at the Collegio, Antonio Lijoi, Bernardo Nipoti, Guillaume Kon Kam King, Stefano Favaro, Pierpaolo De Blasi, Matteo Ruggiero, Antonio Canale, Bertrand Lods, and Giovanni Pistone.

I thank the Fellowship Selection Committee of the Alan Turing Institute for not ranking me. This allowed me a posteriori to know the Brexit only “from the outside”.

Thank you, Florence Forbes, Stéphane Girard and Jean-Baptiste Durand for your welcome to Inria and for supporting my application, as well as Judith Rousseau, Kerrie Mengersen, Christian Robert, Peter Müller and Igor Prünster. My position at Inria and my collaborations and discussions with you and others in Grenoble, including Emmanuel Barbier, Michel Dojat, Alexis Arnaud, Pablo Mesejo, Jakob Verbeek, Julien Mairal, Maria Laura Delle Monache, Eugenio Cinquemani, Wilfried Thuiller, Sophie Achard, Simon Barthelmé, Michael Blum, Adeline Samson, have allowed me to broaden my interests in statistics: extreme values, copulas, graphic models including Markov fields, deep learning, expectation-propagation (yes Simon, I should program it one day to understand EP) as well as applications in neuroimaging, road traffic, ecology such as joint species distribution models, etc. Jakob, Wilfried, and Eugenio: thank you for having me kindly, but repeatedly, encouraged to take the HDR.

I have had the opportunity to exchange and collaborate with many students over the past two years in Inria, Marta Crispino, Hongliang Lü, Riccardo Corradin, Łukasz Rajkowski, Mariia Vladimirova, Verónica Muñoz Ramírez, Fabien Boux, Caroline Lawless, Aleksandra Malkova, Michał Lewandowski, Daria Bystrova, Giovanni Poggiato, Sharan Yalburgi. I appreciate your dynamism, and I know I’m lucky to have (had) you here at Inria. I also learned a lot in our discussions with the visits of more seniors ones, Bernardo Nipoti (present at Mistis “every other week” thanks to Ulysses, among others), Guillaume Kon Kam King, Olivier Marchal, Rémi Bardenet, Jean-Bernard Salomond, Botond Szabó, Eric Marchand, Robin Ryder, Hien Nguyen, Nicolas Lartillot, Alisa Kirichenko, and Matteo Sesia.

Thank you Stephen Walker, Peter Müller for your welcome to UT Austin in the spring of 2017, these three months have been extremely productive; thank you Matti Vihola, Éric Parent, Didier Fraix-Burnet for your invitations to present Bayesian statistics courses in summer schools, and Richard Nickl, Hanne Kekkonen, Fabrizio Ruggeri, Bernardo Nipoti, Roberto Cassarin, Raffaele Argiento, Pierre Chainais, Alice Cleynen, François Sillion, Kerrie Mengersen, Antonio Lijoi, Matteo Ruggiero, Mame Diarra Fall, Bruno Gaujal, Jim Griffin, Silvia Montagna, Fabrizio Leisen, Jean-François Cœurjolly, Eric Marchand, Rebecca Steorts, Anne-Laure Fougères, Adeline Samson, Bas Kleijn, Sara Wade, Aurore Lavigne, Jean-Bernard Salomond, Célestin Kokonendji, Florence Forbes, Christophe Biernacki, Igor Prünster, Nicolas Chopin, François Caron, Michele Guindani, for your invitations to present at conferences or seminars. Thank you Michele Guindani and Hien Nguyen for offering me to join editorial boards of great journals. Thank you Pierre Jacob, Jérôme Le and Robin Ryder for putting Statisfaction into orbit in 2010 (what, it’s already been nine years?!?). This blog has been an excellent means of expression during my Ph.D. thesis and still is today.

I would particularly like to thank my co-authors, all of whom are already mentioned above, whose texts are included in the HDR manuscript, sometimes without them even knowing. I have learned a lot about your subjects, but above all these joint projects have been unforgettable moments spent together.

Hi all,

The paper “unbiased Markov chain Monte Carlo with couplings” co-written with John O’Leary and Yves Atchadé has been accepted as a read paper in JRSS: Series B, to be presented on December 11 at the Royal Statistical Society. Comments can be submitted (400 words max) until two weeks after, that is December 28; see the instructions here. The main contribution is on the removal of the burn-in bias of Markov chain Monte Carlo (MCMC) estimators, using coupled chains. We argue that 1) the burn-bias can be removed for a controllable loss of efficiency, and 2) this can be done in many settings, for low or high-dimensional, discrete or continuous target distributions, provided that the underlying MCMC algorithm works well enough. This might have consequences on the use of parallel processors for MCMC, but also on the estimation of Monte Carlo errors. We believe that there are a lot of open questions related to our work, on the theoretical, methodological and applied sides, and hope that the discussion will be interesting.

All the ingredients of our approach were already in the literature, from coupling of Markov chains to debiasing techniques, to the analysis of budget-constrained simulations, but we also propose multiple contributions. Below, I summarize the main methodological contributions of the paper, the way I see it at least, and I try to contrast our approach with those of Glynn & Rhee (2014) and Agapiou, Roberts & Vollmer (2018).

The problem is to obtain unbiased estimators of , some expectation of interest with respect to a target distribution . That distribution is assumed to be the limiting distribution of a Markov chain , with Markov kernel , and initial distribution . Under some assumptions we have a law of large numbers,

for any choice of “burn-in” , as goes to infinity, in e.g. probability. This justifies the use of ergodic average as an estimator of . However for fixed the estimator is biased: its expectation is not equal to the quantity of interest. Thus averages of independent replicates of that estimator would not be consistently estimating . This is particularly unfortunate in the context of parallel computing as explained here and has motivated a lot of research, e.g. this and that. Parallel computation was also our main motivation for trying to remove the bias, even though we also realized along the way that there are other benefits to unbiased estimators.

The groundbreaking paper by Glynn & Rhee (2014), hereafter GR, shows how to obtain unbiased estimators, using pairs of Markov chains. The construction involves the construction of two Markov chains and , both with limiting distribution , of an integer-valued “*truncation*” random variable , in order to compute

.

Under various conditions on the pair of chains, and on the truncation variable, the above random variable can indeed be an unbiased estimator of , with a finite variance and a finite expected cost. A poor choice of truncation variable might make the variance or the expected cost of the estimator infinite. A wise choice of truncation variable might or might not require hard work, depending on how the chains are coupled; for instance in their Section 3 on Harris recurrent chains, the coupling construction is sophisticated but could be set e.g. to infinity almost surely.

The GR paper is excellent in all ways, creative, clear, insightful; it was hugely inspirational for me and I believe many others. I had used the technique previously for smoothing in state space models, with Fredrik Lindsten and Thomas Schön. What GR achieve was simply considered impossible to many people, even to people familiar with some of the “debiasing” tricks, e.g. Russian roulette. The GR paper does not claim itself that the proposed technique is generally applicable in MCMC contexts, nor that you can generally control the loss of efficiency compared to standard ergodic averages. Yet there is no doubt that the authors understood the importance of their paper (the abstract starts with “*We introduce a new class of Monte Carlo methods, which we call exact estimation algorithms.*“). The numerical experiments concern two examples: an (intriguing) example of non-irreducible Markov chain, and an M/M/1 queue. Their experiments illustrate clearly that standard Monte Carlo rates are achieved in the asymptotic limit of the number of independent replicates, in other words: they can estimate quantities of interest by running many pairs of coupled chains, each for a random but finite amount of time. A new world!

Another paper is that of Agapiou, Roberts & Vollmer (2018). The proposed estimators are of a similar form as those of GR. In particular there is a truncation variable to be chosen by the user. If is chosen poorly, the resulting estimators might have an infinite variance. Unfortunately, to choose appropriately, one needs to know the “contraction rate” of the coupled chains. That is, something to do with the expected decrease in the distance between two chains, after a certain number of steps, when chains are simulated from a coupled transition kernel, and wherever the two chains start. In Section 6 of the paper, the authors show how to approximate this contraction rate in a three-dimensional logistic regression example. Even with an appropriate choice of truncation variable, the proposed estimator (just as the one in GR) has no reason to achieve an efficiency similar to that of an ergodic average. So in some cases it might be competitive, and in some cases it might not be. Section 6 discusses this in details with two examples: an AR(1) process, and the logistic regression example mentioned above. The other sections of the paper concern target distributions defined on infinite-dimensional spaces with applications to Bayesian inverse problems, and are thus less related to our paper.

Now what do we contribute in our paper? For the sake of adding some novelty here I’ll blend the estimator of the unbiased MCMC paper with the L-lag construction of a recent follow-up paper.

- Sample and then for . Sample .
- Sample from a Markov kernel given , for , where is a user-chosen integer, and is the
*meeting time*, defined as

The above construction pre-supposes that is almost surely finite, i.e. that is it possible, in realistic cases, to make two Markov chains meet exactly. It was known since at least this 1998 paper that this could be done in non-trivial MCMC settings, and our paper gathers various techniques from the literature to design such couplings (I personally consider our efforts to collect these coupling recipes a contribution also). The benefit of this construction is, firstly, the absence of truncation variables. Thus, no problems associated with choosing the truncation variable. Secondly, from the constructed chains we can define the following estimator:

.

Here are tuning parameters; this estimator is essentially the same as that proposed in the context of coupled conditional particle filters by Fredrik Lindsten, Thomas Schön and myself here (Section 4.5), although personally I did not see the generality of it at the time. The first term is a standard MCMC average, as if we discarded iterations as burn-in and ran the chain for iterations in total. The second term is a bias correction, which vanishes as increases; the sum is defined to be zero in the event . Appropriate choices of , and can make the efficiency of the above estimator as close as desired to the efficiency of the underlying ergodic average; a theoretical result in in that direction can be found in Section 3 of our paper.

To end with a concrete note, and in the hope of making our methodological contribution as clear as possible, I revisited the two examples of Section 6 in Agapiou, Roberts & Vollmer (2018), and applied our proposed estimators. Doing so, I did not need any knowledge on the contraction rate of the coupled Markov kernels, and I could achieve an efficiency close to that of ergodic averages. R scripts are available here:

- https://github.com/pierrejacob/statisfaction-code/blob/master/2019-10-AR-inefficiency.R
- https://github.com/pierrejacob/statisfaction-code/blob/master/2019-10-logisticregression-pCN.R

]]>

Hi all,

This post describes how unbiased MCMC can help in approximating expectations with respect to “BayesBag”, an alternative to standard posterior distributions mentioned in Peter Bühlmann‘s discussion of Big Bayes Stories (which was a special issue of Statistical Science). Essentially BayesBag is the result of “bagging” applied to “Bayesian inference”. In passing, here is an R script implementing this on a model written in the Stan language (as in this previous post), namely a Negative Binomial regression, and using a pure R implementation of unbiased HMC (joint work with Jeremy Heng). The script produces the following figure:

which shows, for two parameters of the model, the cumulative distribution function (CDF) under standard Bayes (blue thin line) and under BayesBag (wider red line). BayesBag results in distributions on the parameter space that are more “spread out” than standard Bayes.

So what is BayesBag? Let’s quote from Bühlmann’s discussion:

We can stabilize the posterior distribution by using a bootstrap and aggregation scheme, in the spirit of bagging (Breiman, 1996b). In a nutshell, denote by a bootstrap- or subsample of the data . The posterior of the random parameters given the data has c.d.f. , and we can stabilize this using

,

where is with respect to the bootstrap- or subsampling scheme. We call it the

BayesBagestimator. It can be approximated by averaging over B posterior computations for bootstrap- or subsamples, which might be a rather demanding task (although say B=10 would already stabilize to a certain extent).

Indeed in the usual MCMC way, we would have to run an MCMC algorithm given each data set , and repeat B times. Each MCMC run is asymptotically consistent in its number of iterations. So to approximate BayesBag consistently one would need both B and the number of iterations per data set to go to infinity. This is awkward; for instance, suppose that you chose some B and some number of iterations per chain, and obtain some result. You would next like to obtain a more precise result, to be sure. What do you do? Increase B or increase the number of iterations per chain? This seems like a difficult choice.

This is where unbiased MCMC might be handy: since BayesBag is defined as an average over posterior distributions, it is very simple to obtain unbiased estimators with respect to BayesBag itself by

- sampling a data set by bootstrapping from ,
- obtaining an unbiased approximation of with unbiased MCMC.

See the R script for an implementation. By the law of total expectation this produces unbiased approximations of BayesBag; we can then repeat B times and let B go to infinity. So if we want to refine our results, we can just increase B. The same idea works for the “cut distribution” as illustrated in Section 5.5 of the unbiased MCMC paper (see this previous post).

It thus appears that unbiased MCMC for BayesBag costs about the same computational effort as unbiased MCMC for standard posteriors: the only difference is that the data are bootstrapped before each pair of chains is generated. This would of course work with other ways of obtaining unbiased estimators or even perfect samples.

]]>Hi all,

On top of recommending the excellent autobiography of Stanislaw Ulam, this post is about using the software Stan, but not directly to perform inference, instead to obtain R functions to evaluate a target’s probability density function and its gradient. With which, one can implement custom methods, while still benefiting from the great work of the Stan team on the “modeling language” side. As a proof of concept I have implemented a plain Hamiltonian Monte Carlo sampler for a random effect logistic regression model (taken from a course on Multilevel Models by Germán Rodríguez), a coupling of that HMC algorithm (as in “Unbiased Hamiltonian Monte Carlo with couplings“, see also this very recent article on the topic of coupling HMC), and then upper bounds on the total variation distance between the chain and its limiting distribution, as in “Estimating Convergence of Markov chains with L-Lag Couplings“.

The R script is here: https://github.com/pierrejacob/statisfaction-code/blob/master/2019-09-stan-logistic.R and is meant to be as simple as possible, and self-contained; warning, this is all really proof of concept and not thoroughly tested.

Basically the R script starts like a standard script that would use rstan for inference; it runs the default algorithm of Stan for a little while, then extracts some info from the “stanfit” object. With these, a pure R implementation of TV upper bounds for a naive HMC algorithm follows, that relies on functions called “stan_logtarget” and “stan_gradlogtarget” to evaluate the target log-pdf and its gradient.

The script takes a few minutes to run in total. Some time is first needed to compile the Stan code, and to run Stan for a few steps. Then some time spent towards the end of the script to generate 250 independent meeting times with a lag of 500 between the chains; the exact run time will of course depend a lot on your number of available processors (on my machine it takes around one minute). The script produces this plot:

This plot suggests that vanilla HMC as implemented in the script converges in less than 1000 iterations to its stationary distribution. This is probably quite conservative, but it’s still usable.

In passing, upon profiling the code of the function that generates each meeting time, it appears that half of the time is spent in Stan‘s “grad_log_prob” function (which computes the gradient of the log pdf of the target). This implies that not that much efficiency is lost in the fact that the algorithms are coded in pure R, at least for this model.

]]>Hi all,

Niloy Biswas (PhD student at Harvard) and I have recently arXived a manuscript on the assessment of MCMC convergence (using couplings!). Here I’ll describe the main result, and some experiments (that are not in the current version of the paper) revisiting a 1996 paper by Mary Kathryn Cowles and Jeff Rosenthal entitled “A simulation approach to convergence rates for Markov chain Monte Carlo algorithms“. Code in R producing the figures of this post is available here.

Let be a Markov chain generated by your favorite MCMC algorithm. Introduce an integer referred to as the lag. Introduce a second chain , such that the two chains satisfy the following properties:

- at all times , the distributions of and are identical,
- there is a random variable , taking values in the positive integers and termed the
*meeting time*, such that for times .

The chains are coupled, “faithfully” in the sense that they meet at some point and stay together after that, and with a time lag. The main result of the paper is as follows; see the manuscript for a more general result and the assumptions. Denote by the total variation distance (TV) between the marginal distribution of and its limiting distribution. The TV measures the maximum difference, over all measurable sets, between probabilities assigned by two distributions. Then, at all times ,

.

The right hand side above is an expectation of a simple function of . In there, is the user-chosen lag, is a random variable which we can sample from (many times in parallel), and is the time index for which we want to bound . The notation stands for the ceiling function, i.e. the smallest integer greater than the input. Therefore, we can approximate the expectation by empirical averages, using independent copies of the meeting time. This extends an idea mentioned in the discussion section of the unbiased MCMC paper, previously discussed here. The use of a lag greater than one helps obtaining sharper bounds, by reducing the number of triangle inequalities employed to obtain the bounds.

How useful are the resulting bounds? *For one thing*, the TV is always less than one, so hopefully, the proposed bounds are also sometimes less than one…! We can check that the bounds go to zero as increases. Phew! *Second*, the upper bounds depend on the choice of coupling, whereas the TV of interest does not. So if we choose our couplings poorly, we would get poor bounds. Designing couplings of Markov chains such that they meet quickly is generally considered very fun. *Third*, since we’re eventually approximating expectations by empirical averages, we might obtain dramatically over-confident bounds if the random variable has a large variance, which is very sad.

The manuscript presents numerical results indicating that the bounds can, in fact, be practical. To strengthen this point, the figures of this blog entry are based on the three examples in “A simulation approach to convergence rates for Markov chain Monte Carlo algorithms“. In that paper, three Gibbs samplers are considered, targeting posterior distributions in two Bayesian hierarchical models and a Bayesian probit regression. The authors estimate numerically some constants appearing in certain drift and minorization conditions, which leads to upper bounds on TV. In Example 1, the authors can compare with analytical bounds previously obtained by Jeff Rosenthal. In Examples 2 and 3, such analytical results would be tedious to obtain, while the authors can still obtain practical upper bounds. Concretely these bounds suggest that iterations are enough to guarantee a small TV for Examples 2 and 3.

From the figures, above and below, our proposed upper bounds suggest that the Gibbs chains converge in less than 500 iterations in all cases (less than 4 steps in Example 1!). You can also check from the R code that the implementation is simple, provided that you are aware of maximal couplings. The script also provides rudimentary checks that the bounds are sensical, by comparing visually the marginals of the target, approximated either with 1) many “short” chains run for steps, with chosen according to the figures, or with 2) a “long” chain and a conservative burn-in.

The proposed method is very closely related to that of Valen Johnson, “Studying Convergence of Markov Chain Monte Carlo Algorithms Using Coupled Sample Paths”, 1996, where couplings of chains are also used to assess convergence of MCMC algorithms.

Note that this has pretty much nothing to do with the autocorrelations of the chains at stationarity, and their estimation, see e.g. the recent work of Dootika Vats et al on that part of the MCMC story. In a nutshell, these works deal with the asymptotic variance of MCMC estimators, whereas our work is concerned with non-asymptotic bias.

The R script producing the above figures can be used as a tutorial if you want to know more! Any bug reports, and comments on the manuscript, would be most welcome.

]]>This package has been developed to support our (with Omiros Papaspiliopoulos) forthcoming book called (tentatively): an introduction to Sequential Monte Carlo. It implements all the algorithms discussed in the book; e.g.

- bootstrap, guided and auxiliary particle filters
- all standard resampling schemes
- most particle smoothing algorithms
- sequential quasi-Monte Carlo
- PMCMC (PMMH, Particle Gibbs), SMC^2
- SMC samplers

It also contains all the scripts that were used to perform the numerical experiments discussed in the book.

This package is hopefully useful to people with different expectations and level of expertise. For instance, if you just want to run a particle filter for a basic state-space model, you may describe that model as follows:

import particles from particles import state_space_models as ssm class ToySSM(ssm.StateSpaceModel): def PX0(self): # Distribution of X_0 return dists.Normal() # X_0 ~ N(0, 1) def PX(self, t, xp): # Distribution of X_t given X_{t-1} return dists.Normal(loc=xp) # X_t ~ N( X_{t-1}, 1) def PY(self, t, xp, x): # Distribution of Y_t given X_t (and X_{t-1}) return dists.Normal(loc=x, scale=self.sigma) # Y_t ~ N(X_t, sigma^2)

And then simulate data, and run the corresponding bootstrap filter, as follows:

my_model = ToySSM(sigma=0.2) x, y = my_model.simulate(200) # sample size is 200 alg = particles.SMC(fk=ssm.Bootstrap(ssm=my_model, data=y), N=200) alg.run()

On the other hand, if you are an SMC expert, you may re-use only the parts you need; e.g. a resampling scheme:

from particles import resampling A = resampling.systematic(W)

Up to now, this package has been tested mostly by my PhD students, and the students of my M2 course on particle filtering at the ENSAE; many thanks to all of them. Since no computer screen has been smashed in the process, I guess I can publicize it a bit more. Please let me know if you have any questions, comments, or feature request. (You may report a bug by raising an issue on the Github page.)

Based on your feedback, I’m planning to write a few more posts in the coming weeks about particles and more generally numerical computation in Python. Stay tuned!

]]>Hi all,

This post is about some results from “Bias Properties of Budget Constrained Simulations“, by Glynn & Heidelberger and published in Operations Research in 1990. I have found these results extremely useful, and our latest manuscript on unbiased MCMC recalls them in detail. Below I go through some of the results and describe the simulations that lead to the above figure.

Consider the following setting: you have a time budget (say, one day), you have access to a number of machines, and you have Monte Carlo simulations to run. Each simulation produces a random variable with expectation , in a random time denoted by . The variable could, for instance, be the output of a rejection sampler, or of some Markov chain-based algorithm such as coupling from the past, some unbiased MCMC procedure, or some sequential Monte Carlo sampler for normalizing constant estimation. In any case you’re interested in .

You could first ask each machine to produce precisely estimators. But you would not be sure how long it would take for the calculations to finish. Also, it could be wasteful: suppose that machine finishes its calculations faster than machine : you would prefer machine to keep on producing estimators while waiting for to finish.

So instead, you prefer to ask each machine to produce as many estimators as possible within the given time budget, starting a new estimator whenever the previous one completes. But what exactly do you do when the budget expires? Do you interrupt the machines? Or do you wait for them to complete their on-going calculations? What else? These are the types of question that Glynn & Heidelberger help to answer.

Let’s introduce some notation (exactly that of Glynn & Heidelberger). Suppose that a machine produces in sequence, in times denoted by . Let us denote by the number of completed estimators at time . Define and for . Thus denotes the estimator obtained if we interrupt the machine at time and average over the available estimators; if no estimator is available yet, then .

Equation (1) in Glynn & Heidelberger states (assuming ):

Thus the estimator is biased for the object of interest , and the bias diminishes with . The proof of that result starts by noting the conditional exchangeability of the pairs given . That conditional exchangeability is used to compute, for any , . Finally, a multiplication by , a sum over and the application of Fubini’s theorem conclude the proof.

An alternative estimator could be , obtained if we wait for on-going calculations to complete. That estimator is biased too, and Theorem 2 in Glynn & Heidelberger gives explicitly:

.

The bias of that estimator goes to zero as , as indeed stated in Corollary 10 of the paper.

It turns out that one can obtain an unbiased estimator of with a mild modification of the above estimators. Define if and if . That is, if at least one estimator is already available, interrupt the on-going calculation at time ; otherwise wait for the first estimator to complete. Then . This is stated in Corollary 7 of Glynn & Heidelberger, and is a consequence of the first equation mentioned above.

Now, what about the figure above? It represents the bias and the expected completion time of the three estimators: , , and . The plots show bias (left) and expected completion time (right) as a function of the budget. In this example , with a mean , and , thus creating dependencies between estimators and compute times. The three estimators are represented by different colours; I’ll let you guess which one is which!

]]>