Hi again,
As described in an earlier post, Espen Bernton, Mathieu Gerber and Christian P. Robert and I are exploring Wasserstein distances for parameter inference in generative models. Generally, ABC and indirect inference are fun to play with, as they make the user think about useful distances between data sets (i.i.d. or not), which is sort of implicit in classical likelihood-based approaches. Thinking about distances between data sets can be a helpful and healthy exercise, even if not always necessary for inference. Viewing data sets as empirical distributions leads to considering the Wasserstein distance, and we try to demonstrate in the paper that it leads to an appealing inferential toolbox.
In passing, the first author Espen Bernton will be visiting Marco Cuturi, Christian Robert, Nicolas Chopin and others in Paris from September to January; get in touch with him if you’re over there!
We have just updated the arXiv version of the paper, and the main modifications are as follows.
So we have the Hilbert distance, computable in , where is the number of data points, the swapping distance in and of course the Wasserstein distance in . Various other distances are discussed in Section 6 of the paper.
The supplementary materials for the new version are here (while the supplementary for the previous arXiv version are still online here).
]]>
Hi,
With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.
An MCMC algorithm defines a Markov chain , with stationary distribution , so that time averages of the chain converge to averages with respect to , for instance
,
as . The MCMC estimator is in general biased, for any fixed , because the chain is not started from , but rather from some initial distribution .
It is common to discard some initial iterations as “burn-in”, in order to mitigate that bias, which is particularly problematic for parallel computations. Indeed you might be able to run many chains in parallel for the price of one, but each chain has to converge to for the bias to disappear. In other words, MCMC estimators are intrinsically justified in the asymptotic of the number of iterations, , whereas parallelization can be done in the number of chains, as explained e.g. by Jeff Rosenthal.
Instead of running one chain, let’s run two chains, and . Each of them is, individually, “as if” it was generated from the same MCMC algorithm; however, we construct the pair of chains such that they will “meet”, as in the above animation. There is a meeting time such that for all . Note the time shift! This construction allows to consider
just adding infinitely many zeros. By taking the expectation, swapping limit and expectation and using a telescoping sum argument (the proper justification being in the paper, Section 3), we get that the expectation of the above sum is
This is hugely inspired by Glynn & Rhee (2014), and I had described similar ideas in the setting of smoothing in an earlier post. The contribution of the new arXiv report is to bring this construction to generic MCMC algorithms.
In the diagram above, the two chains meet at time . This means that an unbiased estimator of the mean of is given by . In the article, we propose a series of variance reduction techniques, leading to estimators that are more similar to the original MCMC averages, with an extra correction term that removes the bias. Namely, for any given integers , we propose the estimator
,
and we give heuristics to choose so as to maximize the estimators’ efficiency.
We can make Gibbs and Metropolis-Hastings chains meet, as required by the above construction and as described in the paper (Section 4). This means that we can apply the method to a wide variety of settings. In the paper (Sections 5 and 6), we provide applications to hierarchical models, logistic regressions, and the Bayesian Lasso. We also use the method to approximate the “cut” distribution, which arises in Bayesian inference for misspecified models and on which I’ll blog in details soon.
If you use some more fancy MCMC methods, you can either design your custom couplings, or you can interweave your kernel with MH steps in order to create an appropriate coupling without altering much of the marginal mixing of the chains.
Thanks to the lack of bias, we can compute estimators in parallel and take their average. The resulting estimator is 1) justified in the asymptotic regime , and 2) parallelizable across the terms. Each term takes a random but finite time to complete.
Another advantage of the proposed framework is that confidence intervals can easily be constructed for averages of i.i.d. estimators, using the simple Central Limit Theorem. This is again justified as , instead of the usual MCMC confidence intervals justified in the asymptotic of the number of iterations.
Hi there,
In this post, just in time for the summer, I propose a reading list for people interested in discovering the fascinating world of particle methods, aka sequential Monte Carlo methods, and their use in statistics. I also take the opportunity to advertise the SMC workshop in Uppsala (30 Aug – 1 Sept), which features an amazing list of speakers, including my postdoctoral collaborator Jeremy Heng:
www.it.uu.se/conferences/smc2017
Particle filters initially appeared as computational tools to perform online state estimation (i.e. “filtering”) in non-linear, non-gaussian state space models, with applications in target tracking, signal processing, etc. They have been generalized to many other settings in statistics, and now constitute serious competitors to Markov chain Monte Carlo methods, under the name of “Sequential Monte Carlo methods”. These algorithms are often advantageous in terms of parallelization on modern hardware, benefit from a solid theoretical validation, allow the estimation of Bayes factors, and are indeed becoming more and more popular. Below is a tentative reading list on the topic.
Much more detailed reading lists can be found on
So… here we go.
These three articles present the basics of particle filtering. Once you read them and implement the examples, you’re an expert in particle filtering. Congratulations.
The following two articles are instrumental: they take particle filters from their original setting and turn them into very general algorithms to obtain samples from probability distributions.
The latter paper shows a significant advance in the theoretical understanding of particle methods since the seminal articles. And much more has been discovered since then!
The literature on the theory of particle methods is now too vast to list, so I only list some relatively self-contained articles. The first one introduces operators acting on probability measures and establishes consistency of particle methods when the number of particles goes to infinity. The second one deals with stability properties of particle methods with respect to the time horizon. The third one introduces a set of weak assumptions to establish such stability properties.
The reference book on the theoretical foundations of particle methods is:
The following articles present original and recent developments, and also give a glimpse of the variety of situations where particle methods have found to be useful beyond filtering.
Particle methods propagate a large number of particles over relatively few steps, which makes them easy to parallelize. In fact, particles are a large collection of interacting, non-homogeneous Markov chains. Because of the interactions, efficient implementation of particle methods on parallel hardware has received some attention:
Particle methods can be used for evidence estimation, i.e. to estimate Bayes factors. This can be a good reason to use them, instead of standard MCMC methods. Indeed, there are currently no satisfactory ways of retrieving the evidence from the output of an MCMC run. This advantage of SMC was already noted in the articles listed above under “from time series to generic sampling”, i.e. since 2002. More has been said since then, e.g. in:
If you have suggestions for more articles, please use the comments!
Hello,
An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2) numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:
github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf
for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).
Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].
These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.
With my friend Olivier Marchal (mathematician, not filmmaker, nor the cop), we have just arXived a note on the sub-Gaussianity of the Beta and Dirichlet distributions.
The notion, introduced by Jean-Pierre Kahane, is as follows:
A random variable with finite mean is sub-Gaussian if there is a positive number such that:
Such a constant is called a proxy variance, and we say that is -sub-Gaussian. If is sub-Gaussian, one is usually interested in the optimal proxy variance:
Note that the variance always gives a lower bound on the optimal proxy variance: . In particular, when , is said to be strictly sub-Gaussian.
The sub-Gaussian property is closely related to the tails of the distribution. Intuitively, being sub-Gaussian amounts to having tails lighter than a Gaussian. This is actually a characterization of the property. Let . Then:
That equivalence clearly implies exponential upper bounds for the tails of the distribution since a Gaussian satisfies
That can also be seen directly: for a -sub-Gaussian variable ,
The polynomial function is minimized on at , for which we obtain
.
In that sense, the sub-Gaussian property of any compactly supported random variable comes for free since in that case the tails are obviously lighter than those of a Gaussian. A simple general proxy variance is given by Hoeffding’s lemma. Let be supported on with . Then for any ,
so is -sub-Gaussian.
Back to the Beta where , this shows the Beta is -sub-Gaussian. The question of finding the optimal proxy variance is a more challenging issue. In addition to characterizing the optimal proxy variance of the Beta distribution in the note, we provide the simple upper bound . It matches with Hoeffding’s bound for the extremal case , , where the Beta random variable concentrates on the two-point set (and when Hoeffding’s bound is tight).
In getting the bound , we prove a recent conjecture made by Sam Elder in the context of Bayesian adaptive data analysis. I’ll say more about getting the optimal proxy variance in a next post soon.
Cheers!
Julyan
Hi all,
Last week I attended a wonderful meeting on Approximate Bayesian Computation in Banff, which gathered a nice crowd of ABC users and enthusiasts, including lots of people outside of computational stats, whom I wouldn’t have met otherwise. Christian blogged about it there. My talk on Inference with Wasserstein distances is available as a video here (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, the paper is here). In this post, I’ll summarize a few (personal) points and questions on ABC methods, after recalling the basics of ABC (ahem).
The goal is to learn parameters from a generative model. We know how to sample “fake” data from the model (also called “simulator”, “generator” or “black-box”), given the parameters . We have a prior , and data , where each is -dimensional. We cannot evaluate the likelihood function , thus we cannot apply the usual MLE and Bayesian toolbox. So what can we do?
We can sample parameters from the prior, and sample fake data given these parameters. Some of these fake data will resemble the actual data, in which case we might be interested in the corresponding parameters. More formally, we can sample and until , where is a distance or pseudo-distance between samples (e.g. the Euclidean distance between summary statistics of the samples), and is a threshold. This procedure corresponds to an ABC “rejection sampler”, which targets the so-called ABC posterior distribution, which, itself, approximates a certain distribution as . Essentially, if the discrepancy measure is sensible, and is small enough value, there is hope that the ABC posterior is useful for estimating parameters. Lots of variations of this idea exist: see the bibliography here. Now, some points gathered from the meeting.
Hi! It’s been too long!
In a recent arXiv entry, Espen Bernton, Mathieu Gerber and Christian P. Robert and I explore the use of the Wasserstein distance to perform parameter inference in generative models. A by-product is an ABC-type approach that bypasses the choice of summary statistics. Instead, one chooses a metric on the observation space. Our work fits in the minimum distance estimation framework and is particularly related to “On minimum Kantorovich distance estimators”, by Bassetti, Bodini and Regazzini. A recent and very related paper is “Wasserstein training of restricted Boltzmann machines“, by Montavon, Müller and Cuturi, who have similar objectives but are not considering purely generative models. Similarly to that paper, we make heavy use of recent breakthroughs in numerical methods to approximate Wasserstein distances, breakthroughs which were not available to Bassetti, Bodini and Regazzini in 2006.
Here I’ll describe the main ideas in a simple setting. If you’re excited about ABC, asymptotic properties of minimum Wasserstein estimators, Hilbert space-filling curves, delay reconstructions and Takens’ theorem, or SMC samplers with r-hit kernels, check our paper!
We have data in and a model. A model is a collection of probability distributions on , with d-dimensional parameters to estimate. Problem: you can only simulate from , and not evaluate its probability density function. This is the “ABC” setting of “purely generative” models.
A first step is to view the observations as an empirical distribution , and not as a vector in . It is a very sensible idea for independent data, for which the order should not matter. The paper discusses extensions to dependent data in details.
The next step is inspired by minimum distance estimation principles: we can estimate parameters by minimizing a distance between and , over all . Which distance should we use? In the purely generative case, we can approximate by drawing from it. We are then faced with the problem of computing a distance between two empirical distributions. Many metrics could be envisioned, but the Wasserstein distance is an appealing choice for multiple reasons.
Once the distance is chosen, multiple estimation frameworks are possible. The minimum distance “point” estimator leads to an optimization program, whereas an ABC approach leads to a quasi-posterior distribution to sample, which we tackle with SMC samplers and r-hit kernels.
The theoretical study of the minimum distance estimators associated with Hilbert space-filling curve distances has just started; however we have a variety of results for the standard minimum Wasserstein distance estimator, and for its Bayesian counterpart (see the supplementary materials), including in the misspecified setting. Numerical experiments show very promising performance in a variety of examples, and in particular, we show how careful choices of summary statistics can be completely bypassed. Some examples will be described in future blog entries.
Alan thus decided to re-implement my method and several others (including Christian Robert’s accept-reject algorithm proposed in this paper) in C; see here:
https://github.com/alanrogers/dtnorm
Alan also sent me this interesting plot that compares the different methods. The color of a dot at position (a,b) corresponds to the fastest method for simulating N(0,1) truncated to [a,b];
A few personal remarks:
The Italian mathematician Francesco Faà di Bruno was born in Alessandria (Piedmont, Italy) in 1825 and died in Turin in 1888. At the time of his birth, Piedmont used to be part of the Kingdom of Sardinia, led by the Dukes of Savoy. Italy was then unified in 1861, and the Kingdom of Sardinia became the Kingdom of Italy, of which Turin was declared the first capital. At that time, Piedmontese used to commonly speak both Italian and French.
Faà di Bruno is probably best known today for the eponymous formula which generalizes the derivative of a composition of two functions, , to any order:
over -tuples satisfying
Faà di Bruno published his formula in two notes:
Faà Di Bruno, F. (1857). Note sur une nouvelle formule de calcul différentiel. Quarterly Journal of Pure and Applied Mathematics, 1:359–360. Google Books link.
They both date from December 1855, and were signed in Paris. They are similar and essentially state the formula without a proof. I have arXived a note which contains a translation from the French version to English (reproduced below), as well as the two original notes in French and in Italian. I’ve used for this the Erasmus MMXVI font, thanks Xian for sharing!
NOTE ON A NEW FORMULA FOR DIFFERENTIAL CALCULUS.
By M. Faà di Bruno.
Having observed, when dealing with series development of functions, that there did not exist to date any proper formula dedicated to readily calculate the derivative of any order of a function of function without resorting to computing all preceding derivatives, I thought that it would be very useful to look for it. The formula which I found is well simple; and I hope it shall become of general use henceforth.
Let be any function of the variable , linked to another one by the equation of the form
Denote by the product and by etc the successive derivatives of the function ; the value of will have the following expression:
the sign runs over all integer and non negative values of , for which
the value of being
The expression can also take the form of a determinant, and one has
It is implicit that the exponents of will be considered as orders of derivation.
Paris, December 17. 1855.
Hi,
interested in doing a post-doc with me on anything related to Bayesian Computation? Please let me know, as there is currently a call for post-doc grants at the ENSAE, see below.
Nicolas Chopin
The Labex ECODEC is a research consortium in Economics and Decision Sciences common to three leading French higher education institutions based in the larger Paris area: École polytechnique, ENSAE and HEC Paris. The Labex Ecodec offers:
Two-year postdoctoral fellowships for 2017-2019
The monthly gross salary of postdoctoral fellowships is 3 000 €.
Candidates are invited to contact as soon as possible members of the research group (see below) with whom they intend to work.
Research groups concerned by the call:
Area 1: Secure Careers in a Global Economy
Area 2: Financial Market Failures and Regulation
Area 3: Product Market Regulation and Consumer Decision-Making
Area 4: Evaluating the Impact of Public Policies and Firms’ Decisions
Area 5: New Challenges for New Data
Details of axis can be found on the website:
Deadlines for application:
31^{st} December 2016
Screening of applications and decisions can be made earlier for srong candidates who need an early decision.
The application should be sent to application@labex-ecodec.fr in PDF. Please mention the area number on which you apply in the subject.
The application package includes:
Please note that HEC, Genes, and X PhD students are not eligible to apply for this call.
Selection will be based on excellence and a research project matching the group’s research agenda.
Area 1 “Secure careers in a Global Economy”: Pierre Cahuc (ENSAE), Dominique Rouziès (HEC), Isabelle Méjean (École polytechnique)
Area 2: “Financial Market Failures and Regulation”: François Derrien (HEC), Jean-David Fermanian, (ENSAE) Edouard Challe (École polytechnique)
Area 3: “Decision-Making and Market Regulation”: Nicolas Vieille (HEC), Philippe Choné (ENSAE), Marie-Laure Allain (École polytechnique)
Area 4: “Evaluating the Impact of Public Policies and Firm’s Decisions”: Bruno Crépon (ENSAE), Yukio Koriyama (École polytechnique), Daniel Halbheer (HEC)
Area 5: “New Challenges for New Data”: Anna Simoni (ENSAE), Gilles Stoltz (HEC)