With Jeremy Heng we have recently arXived a paper describing how to remove the burn-in bias of Hamiltonian Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is that Hamiltonian Monte Carlo can be massively parallelized: instead of running one chain for many iterations, one can run short coupled chains independently in parallel. The proposed estimators are consistent in the limit of the number of parallel replicates. This is appealing as the number of available processors increases much faster than clock speed, over recent years and for the years to come, for a number of reasons explained e.g. here.

As described in the previous blog post, the proposed construction involves a coupling of Markov chains. So consider two chains and . The coupling must be such that each chain is a standard HMC chain (or any variant thereof), but jointly the chains meet exactly, i.e. become identical, after a random number of iterations called the meeting time. That is, the variable is finite almost surely. Here for simplicity, I neglect the time shift mentioned in the earlier post and in the papers.

A simple HMC kernel works as follows. Given a current state , an initial velocity is drawn from a multivariate Normal distribution. Then, the equations of Hamiltonian dynamics, corresponding to the movement of a particle with a potential energy given by minus the log target density, are numerically solved with a leap-frog integrator, with a step size and a number of steps. The final position is then accepted or not as the next state , according to a Metropolis-Hastings (MH) ratio which corrects for the error introduced by the leap-frog integrator. The understanding of HMC has improved considerably in recent months, notably with the contributions of Mangoubi & Smith 2017 and Durmus, Moulines & Saksman 2017.

It turns out that using common random numbers for the two HMC chains, that is, using common initial velocities and common uniform variables for the acceptance steps, leads to the chains contracting very quickly. This is illustrated in the above animation, which shows the first iterations of 250-dimensional chains targeting a multivariate Normal distribution; the plot shows the evolution of the first two components. The distance between the chains goes rapidly to zero. Assumptions on the target distribution are necessary for such contraction to occur, such as strong convexity as in Section 2.4 of Mangoubi & Smith 2017. In our work, we only need such assumptions to be satisfied on subsets of the space that the chains visit regularly (see the paper for more precision). For instance, we might assume that the target density is strictly positive everywhere and that there are compact sets on which strong log-concavity of the target density holds.

In some examples, coupled HMC chains contract so fast that the distance between them goes under machine precision after a reasonably small number of iterations. We can then consider that the chains have met. Yet this is not very clean… so instead, we mix HMC steps with coupled MH steps that use maximally coupled random walk proposals. Indeed, if two chains are already close to one another thanks to the coupled HMC steps, then the coupled MH steps trigger exact meetings with large probability.

In the experiments, we try a multivariate Normal target in dimension 250, a bivariate Normal truncated by quadratic inequalities, and a logistic regression with 300 covariates. These examples show that tuning parameters, such as step size and number of leap-frog steps, that would be optimal for a single HMC chain might not be optimal for the proposed unbiased estimators. This leads to a loss of asymptotic efficiency, which is hardly surprising: this is a classic case of bias removal increasing the variance. Yet from the experiments, the loss of efficiency sounds very reasonable in exchange for immediate benefits on parallel processors. In the logistic regression example, the method essentially amounts to running HMC chains of length 1,000 in parallel.

Another way of parallelizing HMC, or any MCMC method, is to plug it in the rejuvenation step of an SMC sampler (see here or here). As has been known for fifteen years, this has benefits such as normalizing constant estimation and some robustness to multimodality. Both approaches provide estimators that are consistent in the limit of a number of operations that can be massively parallelized: particles in SMC samplers, and independent replicates of unbiased MCMC. One advantage of the latter might lie in the simple construction of confidence intervals from the standard Central Limit Theorem, but these can be built from SMC samplers too.

]]>

Nine R user communities already exist in France and there is a much large number of R communities around the world. It was time for Grenoble to start its own!

The goal of the R user group is to facilitate the identification of local useRs, to initiate contacts, and to organise experience and knowledge sharing sessions. The group is open to any local useR interested in learning and sharing knowledge about R.

The group’s website features a map and table with members of the R group. Members with specific skills related to the use of R are referenced in a table and can be contacted by other members. A gitter allows members to discuss R issues and a calendar presents the upcoming events.

**Working sessions
**Monthly working sessions of two hours start with presentations or tutorials. The second part is dedicated to helping each others, meeting new people and sharing beers and softs offered by the Grenoble Data Institute. The current program looks nice:

]]>

Hi,

With Lawrence Murray, Chris Holmes and Christian Robert, we have recently arXived a paper entitled “Better together? Statistical learning in models made of modules”. Christian blogged about it already. The context is the following: parameters of a first model appear as inputs in another model. The question is whether to consider a “joint model approach”, where all parameters are estimated simultaneously with all of the data. Or if one should instead follow a “modular approach”, where the first parameters are estimated with the first model only, ignoring the second model. Examples of modular approaches include the “cut distribution“, or “two-step estimators” (e.g. Chapter 6 of Newey & McFadden (1994)). In many fields, modular approaches are preferred, because the second model is suspected of being more misspecified than the first one. Misspecification of the second model can “contaminate” the joint model, with dire consequences on inference, as described e.g. in Bayarri, Berger & Liu (2009). Other reasons include computational constraints and the lack of simultaneous availability of all models and associated data. In the paper, we try to make sense of the defects of the joint model approach and we propose a principled, quantitative way of choosing between joint and modular approaches.

Lawrence and I started wondering about these questions in the context of models of plankton population growth, back in 2012. Plankton growth is affected by ocean temperatures. These temperatures are not measured everywhere at all times, but one can first use a geophysics model to infer temperatures at desired locations and times. Then these estimated temperatures can be used as inputs (or “forcings”) to model plankton growth. We were wondering whether we should instead define a joint model of temperatures + plankton, to take the uncertainty of temperatures into account. Parslow, Cressie, Campbell, Jones & Murray (2013) provide an example of plankton model where temperatures are considered fixed, which is common practice.

An initial difficulty with the joint model approach is of a computational nature: if the two stages (e.g. plankton and geophysics) both involve large models, requiring weeks of computation, the joint model might simply be impossible to deal with. Indeed, the number of parameters adds up, and computational methods typically have super-linear costs in the number of parameters, which is the dimension of the space to explore (i.e. to sample or to optimize). It’s even worse than that since extra difficulties accumulate, such as multimodality, intractable likelihoods, etc.

Interestingly, the difficulties are not only computational but also statistical: the joint model approach is not always preferable in terms of estimation. This has been reported in various contexts, such as pharmacokinetics-pharmacodynamics, where the PD part is usually considered *misspecified relative to* the PK part. A good reference is Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009). There seems to have been enough demand from practitioners for WinBUGS to include a “cut” function, which is an attempt at estimating some parameters irrespective of other parts of the model (see the WinBUGS manual). Martyn Plummer (of JAGS) wrote a very interesting paper on the cut distribution, on issues associated with existing algorithms to sample from it, and on proposals to fix them. Another super relevant article is Bayarri, Berger & Liu (2009) that considers the issue in multiple cases of Bayesian inference in misspecified settings. Another link between modular approaches and model misspecification has been thoroughly investigated in the context of causal inference with propensity scores by Corwin Zigler and others.

Departing from the joint model approach seems to pose difficulties for some statisticians. Indeed the cut distribution is strange: it inserts directions in the graph relating variables in the model. Thus A can impact B without B impacting A. This is strange because information is usually modeled as a flow going both ways. In the paper, we argue that the cut distribution can be a reasonable choice, in terms of decision-theoretic principles such as predictive performance assessed with the logarithmic scoring rule. This leads to a quantitative criterion that can be computed to decide whether to “cut” or not. Following this type of decision-theoretic reasoning seems to me in the flavor of the Bayesian paradigm, contrarily to always trusting joint models and associated posterior distributions.

On the other hand, the issue might appear trivial to econometricians, who are used to two-step estimators, and misspecification in general; see e.g. White (1982) and Robustness by Hansen & Sargent, and two-step estimators in e.g. Pagan (1984). The cut distribution is a probabilistic version of these estimators, so I wonder why it has not been studied in more details earlier. If you know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the issues raised by Plummer.

In the end, our article is an attempt at starting a discussion on modular versus joint approaches. It is likely that more and more situations will require the combination of models (e.g. merging heterogeneous data sources), in ways that take model misspecification into account.

]]>

Didier Fraix-Burnet (IPAG), Stéphane Girard (Inria) and myself are organising a School of Statistics for Astrophysics, Stat4Astro, to be held in October in France. The primary goal of the School is to train astronomers to the use of modern statistical techniques. It also aims at bridging the gap between the two communities by emphasising on the practice during works in common, to give firm grounds to the theoretical lessons, and to initiate works on problems brought by the participants. There have been two previous sessions of this school, one on regression and one on clustering. The speakers of this edition, including Christian Robert, Roberto Trotta and David van Dyk, will focus on the** **Bayesian methodology, with the moral support of the Bayesian Society, ISBA. The interest of this statistical approach in astrophysics probably comes from its necessity and its success in determining the cosmological parameters from observations, especially from the cosmic background fluctuations. The cosmological community has thus been very active in this field (see for instance the Cosmostatistics Initiative COIN).

But the Bayesian methodology, complementary to the more classical frequentist one, has many applications in physics in general due to its faculty to incorporate a priori knowledge into the inference computation, such as the uncertainties brought by the observational processes.

As for sophisticated statistical techniques, astronomers are not familiar with Bayesian methodology in general, while it is becoming more and more widespread and useful in the literature. This school will form the participants to both a strong theoretical background and a solid practice of Bayesian inference:

- Introduction to R and Bayesian Statistics (Didier Fraix-Burnet, Institut de Planétologie et d’Astrophysique de Grenoble)
- Foundations of Bayesian Inference (David van Dyk, Imperial College London)
- Markov chain Monte Carlo (David van Dyk, Imperial College London)
- Model Building (David van Dyk, Imperial College London)
- Nested Sampling, Model Selection, and Bayesian Hierarchical Models (Roberto Trotta, Imperial College London)
- Approximate Bayesian Computation (Christian Robert, Univ. Paris-Dauphine, Univ. Warwick and Xi’an (!))
- Bayesian Nonparametric Approaches to Clustering (Julyan Arbel, Université Grenoble Alpes and Inria)

Feel free to register, we are not fully booked yet!

Julyan

]]>

Hi,

In a recent work on parallel computation for MCMC, and also in another one, and in fact also in an earlier one, my co-authors and I use a simple yet very powerful object that is standard in Probability but not so well-known in Statistics: the maximal coupling. Here I’ll describe what this is and an algorithm to sample from such couplings.

Consider two distributions, and on a general (i.e. discrete or continuous) state space . A coupling refers to a joint distribution, say on , with first marginal and second marginal , i.e.

An independent coupling has a density function . A maximal coupling, on the other hand, is such that pairs have maximal probability of being identical, i.e. is maximal under the marginal constraints and .

At first, it might sound weird that there would be any possibility of and being identical since their distributions are on a continuous state space. But indeed it is possible. Intuitively, imagine that is sampled from : as long as then could also be a sample from , could it not?

The following algorithm provides samples from a maximal coupling of and .

- Sample from .
- Sample a uniform variable . If , then output .
- Otherwise, sample from and , until , and output .

It is clear that, if is produced by the above algorithm, then (from step 1). Checking that takes a few lines of calculus, similar to proving the validity of rejection sampling. It can be found e.g. in the appendix of this article. This scheme must have been known for a long time and is definitely in Thorisson’s book on coupling. I’m not quite sure who first came up with it, any tips welcome!

To understand the algorithm, let’s look at the following figure, where the density functions of and are plotted along with a shaded area under the curve . The algorithm tries to sample from the distribution represented by the shaded area first, and if it does not succeed, it samples from the remainder which has density , up to a normalizing constant.

The graph at the beginning of the article shows pairs generated by the algorithm. The red dots represent samples with . The probability of this event is precisely one minus the total variation between and , and by the coupling inequality (e.g. here, Lemma 4.9), this is the maximum probability of the event under the marginal constraints.

]]>

Hi again,

As described in an earlier post, Espen Bernton, Mathieu Gerber and Christian P. Robert and I are exploring Wasserstein distances for parameter inference in generative models. Generally, ABC and indirect inference are fun to play with, as they make the user think about useful distances between data sets (i.i.d. or not), which is sort of implicit in classical likelihood-based approaches. Thinking about distances between data sets can be a helpful and healthy exercise, even if not always necessary for inference. Viewing data sets as empirical distributions leads to considering the Wasserstein distance, and we try to demonstrate in the paper that it leads to an appealing inferential toolbox.

In passing, the first author Espen Bernton will be visiting Marco Cuturi, Christian Robert, Nicolas Chopin and others in Paris from September to January; get in touch with him if you’re over there!

We have just updated the arXiv version of the paper, and the main modifications are as follows.

- We propose a new distance between time series termed “curve-matching”, which turns out to be quite similar to dynamic time warping or Skorokhod distances. This distance might be particularly relevant for models generating non-stationary time series, such as Susceptible-Infected-Recovered models.
- Our theoretical results are generally improved. In particular, for the minimum Wasserstein/Kantorovich estimator and variants of it, the proofs are now based on the notion of epi-convergence, commonly used in optimization, and various results from Rockafellar and Wets (2009),
*Variational analysis.* - On top of the Hilbert distance, based on the Hilbert space-filling curve, we consider the use of the swapping distance of
So we have the Hilbert distance, computable in , where is the number of data points, the swapping distance in and of course the Wasserstein distance in . Various other distances are discussed in Section 6 of the paper.

- On the asymptotic behavior of ABC posteriors, our results now cover the use of Hilbert and swapping distances. This is thanks to the convenient property that the Hilbert distance is indeed a distance, and is always larger than the Wasserstein distance; we also rely on some of Mathieu‘s recent results. And the swapping distance (if initialized with Hilbert sorting) is always sandwiched between Wasserstein and Hilbert.
- The numerical experiments have been revised: there is now a bivariate g-and-k example with comparisons to the actual posterior; the toggle switch example from systems biology is unchanged; a new queueing model example, with comparisons to the actual posterior obtained with particle MCMC (we could have also used the method of Shestopaloff and Neal). Finally, we have a more detailed study of the Lévy-driven stochastic volatility model, with 10,000 observations. There we show how transport distances can be combined with summaries to estimate all model parameters (we previously got only four out of five parameters, using transport distances alone).

The supplementary materials for the new version are here (while the supplementary for the previous arXiv version are still online here).

]]>

Hi,

With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.

An MCMC algorithm defines a Markov chain , with stationary distribution , so that time averages of the chain converge to averages with respect to , for instance

,

as . The MCMC estimator is in general biased, for any fixed , because the chain is not started from , but rather from some initial distribution .

It is common to discard some initial iterations as “burn-in”, in order to mitigate that bias, which is particularly problematic for parallel computations. Indeed you might be able to run many chains in parallel for the price of one, but each chain has to converge to for the bias to disappear. In other words, MCMC estimators are intrinsically justified in the asymptotic of the number of iterations, , whereas parallelization can be done in the number of chains, as explained e.g. by Jeff Rosenthal.

Instead of running one chain, let’s run two chains, and . Each of them is, individually, “as if” it was generated from the same MCMC algorithm; however, we construct the pair of chains such that they will “meet”, as in the above animation. There is a meeting time such that for all . Note the time shift! This construction allows to consider

just adding infinitely many zeros. By taking the expectation, swapping limit and expectation and using a telescoping sum argument (the proper justification being in the paper, Section 3), we get that the expectation of the above sum is

This is hugely inspired by Glynn & Rhee (2014), and I had described similar ideas in the setting of smoothing in an earlier post. The contribution of the new arXiv report is to bring this construction to generic MCMC algorithms.

In the diagram above, the two chains meet at time . This means that an unbiased estimator of the mean of is given by . In the article, we propose a series of variance reduction techniques, leading to estimators that are more similar to the original MCMC averages, with an extra correction term that removes the bias. Namely, for any given integers , we propose the estimator

,

and we give heuristics to choose so as to maximize the estimators’ efficiency.

We can make Gibbs and Metropolis-Hastings chains meet, as required by the above construction and as described in the paper (Section 4). This means that we can apply the method to a wide variety of settings. In the paper (Sections 5 and 6), we provide applications to hierarchical models, logistic regressions, and the Bayesian Lasso. We also use the method to approximate the “cut” distribution, which arises in Bayesian inference for misspecified models and on which I’ll blog in details soon.

If you use some more fancy MCMC methods, you can either design your custom couplings, or you can interweave your kernel with MH steps in order to create an appropriate coupling without altering much of the marginal mixing of the chains.

Thanks to the lack of bias, we can compute estimators in parallel and take their average. The resulting estimator is 1) justified in the asymptotic regime , and 2) parallelizable across the terms. Each term takes a random but finite time to complete.

Another advantage of the proposed framework is that confidence intervals can easily be constructed for averages of i.i.d. estimators, using the simple Central Limit Theorem. This is again justified as , instead of the usual MCMC confidence intervals justified in the asymptotic of the number of iterations.

]]>

Hi there,

In this post, just in time for the summer, I propose a reading list for people interested in discovering the fascinating world of particle methods, aka sequential Monte Carlo methods, and their use in statistics. I also take the opportunity to advertise the SMC workshop in Uppsala (30 Aug – 1 Sept), which features an amazing list of speakers, including my postdoctoral collaborator Jeremy Heng:

www.it.uu.se/conferences/smc2017

Particle filters initially appeared as computational tools to perform online state estimation (i.e. “filtering”) in non-linear, non-gaussian state space models, with applications in target tracking, signal processing, etc. They have been generalized to many other settings in statistics, and now constitute serious competitors to Markov chain Monte Carlo methods, under the name of “Sequential Monte Carlo methods”. These algorithms are often advantageous in terms of parallelization on modern hardware, benefit from a solid theoretical validation, allow the estimation of Bayes factors, and are indeed becoming more and more popular. Below is a tentative reading list on the topic.

Much more detailed reading lists can be found on

- Arnaud Doucet’s webpage on SMC algorithms: http://www.stats.ox.ac.uk/~doucet/smc_resources.html
- and a list of talks at the previous SMC workshop can be found here: https://smc2015.sciencesconf.org/resource/page/id/3

So… here we go.

These three articles present the basics of particle filtering. Once you read them and implement the examples, you’re an expert in particle filtering. Congratulations.

- Neil J. Gordon, David J. Salmond and Adrian FM Smith. “Novel approach to nonlinear/non-Gaussian Bayesian state estimation.” IEE Proceedings F (Radar and Signal Processing). Vol. 140. No. 2. IET Digital Library, 1993.
- Jun Liu and Rong Chen. “Sequential Monte Carlo methods for dynamic systems.” Journal of the American statistical association 93(443), 1032-1044, 1998.
- Michael K. Pitt and Neil Shephard. “Filtering via simulation: Auxiliary particle filters”. Journal of the American statistical association, 94(446), 590-599, 1999.

The following two articles are instrumental: they take particle filters from their original setting and turn them into very general algorithms to obtain samples from probability distributions.

- Nicolas Chopin. “A sequential particle filter method for static models.” Biometrika 89(3), 539-552, 2002.
- Pierre Del Moral, Arnaud Doucet and Ajay Jasra. “Sequential Monte Carlo samplers.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(3), 411-436, 2006.

The latter paper shows a significant advance in the theoretical understanding of particle methods since the seminal articles. And much more has been discovered since then!

The literature on the theory of particle methods is now too vast to list, so I only list some relatively self-contained articles. The first one introduces operators acting on probability measures and establishes consistency of particle methods when the number of particles goes to infinity. The second one deals with stability properties of particle methods with respect to the time horizon. The third one introduces a set of weak assumptions to establish such stability properties.

- Dan Crisan & Arnaud Doucet. “A survey of convergence results on particle filtering methods for practitioners”. IEEE Transactions on Signal Processing, 50(3), 736-746, 2002.
- Frédéric Cérou, Pierre Del Moral, and Arnaud Guyader. “A nonasymptotic theorem for unnormalized Feynman–Kac particle models.” Annales de l’IHP Probabilités et statistiques. 47(3): 629-649, 2011
- Nick Whiteley. Stability properties of some particle filters. The Annals of Applied Probability, 23(6), 2500-2537, 2013.

The reference book on the theoretical foundations of particle methods is:

- Pierre Del Moral. “Feynman-Kac Formulae”. Springer New York, 2004.

The following articles present original and recent developments, and also give a glimpse of the variety of situations where particle methods have found to be useful beyond filtering.

- Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. “Particle Markov chain Monte Carlo methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(3), 269-342, 2010.
- Fredrik Lindsten, Michael Jordan and Thomas B. Schön, “Particle Gibbs with Ancestor Sampling”, Journal of Machine Learning Research, 15, 2145-2184, 2014.
- Liangliang Wang, Alexandre Bouchard-Côté & Arnaud Doucet. “Bayesian Phylogenetic Inference Using a Combinatorial Sequential Monte Carlo Method.” Journal of the American statistical association 110(512), 1362-1374, 2015.
- Patrick Rebeschini, and Ramon Van Handel. “Can local particle filters beat the curse of dimensionality?” The Annals of Applied Probability, 25(5), 2809-2866, 2015.

Particle methods propagate a large number of particles over relatively few steps, which makes them easy to parallelize. In fact, particles are a large collection of interacting, non-homogeneous Markov chains. Because of the interactions, efficient implementation of particle methods on parallel hardware has received some attention:

- Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, and Christopher C. Holmes. “On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods.”
*Journal of computational and graphical statistics*19(4), 769-789, 2010. - Lawrence M. Murray, Anthony Lee, and Pierre E. Jacob. “Parallel resampling in the particle filter.”
*Journal of Computational and Graphical Statistics*25(3), 789-805, 2016. - Nick Whiteley, Anthony Lee, and Kari Heine. “On the role of interaction in sequential Monte Carlo algorithms.”
*Bernoulli*22(1), 494-529, 2016.

Particle methods can be used for evidence estimation, i.e. to estimate Bayes factors. This can be a good reason to use them, instead of standard MCMC methods. Indeed, there are currently no satisfactory ways of retrieving the evidence from the output of an MCMC run. This advantage of SMC was already noted in the articles listed above under “from time series to generic sampling”, i.e. since 2002. More has been said since then, e.g. in:

- Luke Bornn, Arnaud Doucet, and Raphael Gottardo. “An efficient computational approach for prior sensitivity analysis and cross‐validation.”
*Canadian Journal of Statistics*38(1): 47-64, 2010. - Zhou Yan, Adam M. Johansen, and John A.D. Aston. “Toward Automatic Model Comparison: An Adaptive Sequential Monte Carlo Approach.”
*Journal of Computational and Graphical Statistics*25(3): 701-726, 2016.

If you have suggestions for more articles, please use the comments!

]]>

Hello,

An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2) numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

- Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
- Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
- Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
- Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

]]>

With my friend Olivier Marchal (mathematician, not filmmaker, nor the cop), we have just arXived a note on the sub-Gaussianity of the Beta and Dirichlet distributions.

The notion, introduced by Jean-Pierre Kahane, is as follows:

A random variable with finite mean is sub-Gaussian if there is a positive number such that:

Such a constant is called a proxy variance, and we say that is -sub-Gaussian. If is sub-Gaussian, one is usually interested in the optimal proxy variance:

Note that the variance always gives a lower bound on the optimal proxy variance: . In particular, when , is said to be

strictlysub-Gaussian.

The sub-Gaussian property is closely related to the tails of the distribution. Intuitively, being sub-Gaussian amounts to having tails lighter than a Gaussian. This is actually a characterization of the property. Let . Then:

That equivalence clearly implies exponential upper bounds for the tails of the distribution since a Gaussian satisfies

That can also be seen directly: for a -sub-Gaussian variable ,

The polynomial function is minimized on at , for which we obtain

.

In that sense, the sub-Gaussian property of any compactly supported random variable comes for free since in that case the tails are obviously lighter than those of a Gaussian. A simple general proxy variance is given by Hoeffding’s lemma. Let be supported on with . Then for any ,

so is -sub-Gaussian.

Back to the Beta where , this shows the Beta is -sub-Gaussian. The question of finding the optimal proxy variance is a more challenging issue. In addition to characterizing the optimal proxy variance of the Beta distribution in the note, we provide the simple upper bound . It matches with Hoeffding’s bound for the extremal case , , where the Beta random variable concentrates on the two-point set (and when Hoeffding’s bound is tight).

In getting the bound , we prove a recent conjecture made by Sam Elder in the context of Bayesian adaptive data analysis. I’ll say more about getting the optimal proxy variance in a next post soon.

Cheers!

Julyan

]]>