# Statisfaction

## Statistical learning in models made of modules

Posted in General, Statistics by Pierre Jacob on 9 September 2017

Graph of variables in a model made of two modules: the first with parameter theta1 and data Y1, and the second with parameter theta2 and data Y2, defined conditionally upon theta1.

Hi,

With Lawrence Murray, Chris Holmes and Christian Robert, we have recently arXived a paper entitled “Better together? Statistical learning in models made of modules”. Christian blogged about it already. The context is the following: parameters of a first model appear as inputs in another model. The question is whether to consider a “joint model approach”, where all parameters are estimated simultaneously with all of the data. Or if one should instead follow a “modular approach”, where the first parameters are estimated with the first model only, ignoring the second model. Examples of modular approaches include the “cut distribution“, or “two-step estimators” (e.g. Chapter 6 of Newey & McFadden (1994)). In many fields, modular approaches are preferred, because the second model is suspected of being more misspecified than the first one. Misspecification of the second model can “contaminate” the joint model, with dire consequences on inference, as described e.g. in Bayarri, Berger & Liu (2009). Other reasons include computational constraints and the lack of simultaneous availability of all models and associated data. In the paper, we try to make sense of the defects of the joint model approach and we propose a principled, quantitative way of choosing between joint and modular approaches.

## School of Statistics for Astrophysics, Autrans, France, October 9-13

Posted in General by Julyan Arbel on 7 September 2017

Didier Fraix-Burnet (IPAG), Stéphane Girard (Inria) and myself are organising a School of Statistics for Astrophysics, Stat4Astro, to be held in October in France. The primary goal of the School is to train astronomers to the use of modern statistical techniques. It also aims at bridging the gap between the two communities by emphasising on the practice during works in common, to give firm grounds to the theoretical lessons, and to initiate works on problems brought by the participants. There have been two previous sessions of this school, one on regression and one on clustering. The speakers of this edition, including Christian Robert, Roberto Trotta and David van Dyk, will focus on the Bayesian methodology, with the moral support of the Bayesian Society, ISBA. The interest of this statistical approach in astrophysics probably comes from its necessity and its success in determining the cosmological parameters from observations, especially from the cosmic background fluctuations.  The cosmological community has thus been very active in this field (see for instance the Cosmostatistics Initiative COIN).

But the Bayesian methodology, complementary to the more classical frequentist one, has many applications in physics in general due to its faculty to incorporate a priori knowledge into the inference computation, such as the uncertainties brought by the observational processes.

As for sophisticated statistical techniques, astronomers are not familiar with Bayesian methodology in general, while it is becoming more and more widespread and useful in the literature. This school will form the participants to both a strong theoretical background and a solid practice of Bayesian inference:

• Introduction to R and Bayesian Statistics (Didier Fraix-Burnet, Institut de Planétologie et d’Astrophysique de Grenoble)
• Foundations of Bayesian Inference (David van Dyk, Imperial College London)
• Markov chain Monte Carlo (David van Dyk, Imperial College London)
• Model Building (David van Dyk, Imperial College London)
• Nested Sampling, Model Selection, and Bayesian Hierarchical Models (Roberto Trotta, Imperial College London)
• Approximate Bayesian Computation (Christian Robert, Univ. Paris-Dauphine, Univ. Warwick and Xi’an (!))
• Bayesian Nonparametric Approaches to Clustering (Julyan Arbel, Université Grenoble Alpes and Inria)

Feel free to register, we are not fully booked yet!

Julyan

## Sampling from a maximal coupling

Posted in Statistics by Pierre Jacob on 6 September 2017

Sample from a maximal coupling of two Normal distributions, X ~ N(0.5,0.82) and Y ~ N(-0.5,0.22).

Hi,

In a recent work on parallel computation for MCMC, and also in another one, and in fact also in an earlier one, my co-authors and I use a simple yet very powerful object that is standard in Probability but not so well-known in Statistics: the maximal coupling. Here I’ll describe what this is and an algorithm to sample from such couplings.

## Update on inference with Wasserstein distances

Posted in Statistics by Pierre Jacob on 15 August 2017

You have to read the arXiv report to understand this figure. There’s no way around it.

Hi again,

As described in an earlier postEspen BerntonMathieu Gerber and Christian P. Robert and I are exploring Wasserstein distances for parameter inference in generative models. Generally, ABC and indirect inference are fun to play with, as they make the user think about useful distances between data sets (i.i.d. or not), which is sort of implicit in classical likelihood-based approaches. Thinking about distances between data sets can be a helpful and healthy exercise, even if not always necessary for inference. Viewing data sets as empirical distributions leads to considering the Wasserstein distance, and we try to demonstrate in the paper that it leads to an appealing inferential toolbox.

In passing, the first author Espen Bernton will be visiting Marco Cuturi,  Christian Robert, Nicolas Chopin and others in Paris from September to January; get in touch with him if you’re over there!

We have just updated the arXiv version of the paper, and the main modifications are as follows.

## Unbiased MCMC with couplings

Posted in Statistics by Pierre Jacob on 14 August 2017

Two chains meeting at time 10, and staying faithful forever. ❤

Hi,

With John O’Leary and Yves Atchadé , we have just arXived our work on removing the bias of MCMC estimators. Here I’ll explain what this bias is about, and the benefits of removing it.

## Particle methods in Statistics

Posted in General, Statistics by Pierre Jacob on 30 June 2017

A statistician sampling from a posterior distribution with particle methods

Hi there,

In this post, just in time for the summer, I propose a reading list for people interested in discovering the fascinating world of particle methods, aka sequential Monte Carlo methods, and their use in statistics. I also take the opportunity to advertise the SMC workshop in Uppsala (30 Aug – 1 Sept), which features an amazing list of speakers, including my postdoctoral collaborator Jeremy Heng:

www.it.uu.se/conferences/smc2017

## Likelihood calculation for the g-and-k distribution

Posted in R, Statistics by Pierre Jacob on 11 June 2017

Histogram of 1e5 samples from the g-and-k distribution, and overlaid probability density function

Hello,

An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2)  numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance  (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

1. Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
2. Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
3. Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
4. Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

## Sub-Gaussian property for the Beta distribution (part 1)

Posted in General by Julyan Arbel on 2 May 2017

With my friend Olivier Marchal (mathematician, not filmmaker, nor the cop), we have just arXived a note on the sub-Gaussianity of the Beta and Dirichlet distributions.

The notion, introduced by Jean-Pierre Kahane, is as follows:

A random variable $X$ with finite mean $\mu=\mathbb{E}[X]$ is sub-Gaussian if there is a positive number $\sigma$ such that:

$\mathbb{E}[\exp(\lambda (X-\mu))]\le\exp\left(\frac{\lambda^2\sigma^2}{2}\right)\,\,\text{for all } \lambda\in\mathbb{R}.$

Such a constant $\sigma^2$ is called a proxy variance, and we say that $X$ is $\sigma^2$-sub-Gaussian. If $X$ is sub-Gaussian, one is usually interested in the optimal proxy variance:

$\sigma_{\text{opt}}^2(X)=\min\{\sigma^2\geq 0\text{ such that } X \text{ is } \sigma^2\text{-sub-Gaussian}\}.$

Note that the variance always gives a lower bound on the optimal proxy variance: $\text{Var}[X]\leq \sigma_{\text{opt}}^2(X)$. In particular, when $\sigma_{\text{opt}}^2(X)=\text{Var}[X]$, $X$ is said to be strictly sub-Gaussian.

The sub-Gaussian property is closely related to the tails of the distribution. Intuitively, being sub-Gaussian amounts to having tails lighter than a Gaussian. This is actually a characterization of the property. Let $Z\sim\mathcal{N}(0,1)$. Then:

$X \text{ is sub-Gaussian } \iff \exists c, \forall x\geq0:\, \mathsf{P}(|X-\mathbb{E}[X]|\geq x) \leq c\mathsf{P}(|Z|\geq x).$

That equivalence clearly implies exponential upper bounds for the tails of the distribution since a Gaussian $Z\sim\mathcal{N}(0,\sigma^2)$ satisfies

$\mathsf{P}(Z\ge x)\le\exp(-\frac{x^2}{2\sigma^2}).$

That can also be seen directly: for a $\sigma^2$-sub-Gaussian variable $X$,

$\forall\, \lambda>0\,:\,\,\mathsf{P}(X-\mu\geq x) = \mathsf{P}(e^{\lambda(X-\mu)}\geq e^{\lambda x})\leq \frac{\mathbb{E}[e^{\lambda(X-\mu)}]}{e^{\lambda x}}\quad\text{by Markov inequality,}$

$\leq\exp(\frac{\sigma^2\lambda^2}{2}-\lambda x)\quad\text{by sub-Gaussianity.}$

The polynomial function $\lambda\mapsto \frac{\sigma^2\lambda^2}{2}-\lambda x$ is minimized on $\mathbb{R}_+$ at $\lambda = \frac{x}{\sigma^2}$, for which we obtain

$\mathsf{P}(X-\mu\geq x) \leq\exp(-\frac{x^2}{2\sigma^2})$.

In that sense, the sub-Gaussian property of any compactly supported random variable $X$ comes for free since in that case the tails are obviously lighter than those of a Gaussian. A simple general proxy variance is given by Hoeffding’s lemma. Let $X$ be supported on $[a,b]$ with $\mathbb{E}[X]=0$. Then for any $\lambda\in\mathbb{R}$,

$\mathbb{E}[\exp(\lambda X)]\leq\exp\left(\frac{(b-a)^2}{8}\lambda^2\right)$

so $X$ is $\frac{(b-a)^2}{4}$-sub-Gaussian.

Back to the Beta where $[a,b]=[0,1]$, this shows the Beta is $\frac{1}{4}$-sub-Gaussian. The question of finding the optimal proxy variance is a more challenging issue. In addition to characterizing the optimal proxy variance of the Beta distribution in the note, we provide the simple upper bound $\frac{1}{4(\alpha+\beta+1)}$. It matches with Hoeffding’s bound for the extremal case $\alpha\to0$, $\beta\to0$, where the Beta random variable concentrates on the two-point set $\{0,1\}$ (and when Hoeffding’s bound is tight).

In getting the bound $\frac{1}{4(\alpha+\beta+1)}$, we prove a recent conjecture made by Sam Elder in the context of Bayesian adaptive data analysis. I’ll say more about getting the optimal proxy variance in a next post soon.

Cheers!

Julyan

Tagged with:

## ABC in Banff

Posted in General, Seminar/Conference, Statistics by Pierre Jacob on 6 March 2017

Banff, also known as not the worst location for a scientific meeting.

Hi all,

Last week I attended a wonderful meeting on Approximate Bayesian Computation in Banff, which gathered a nice crowd of ABC users and enthusiasts, including lots of people outside of computational stats, whom I wouldn’t have met otherwise. Christian blogged about it there. My talk on Inference with Wasserstein distances is available as a video here (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, the paper is here). In this post, I’ll summarize a few (personal) points and questions on ABC methods, after recalling the basics of ABC (ahem).

## Statistical inference with the Wasserstein distance

Posted in Statistics by Pierre Jacob on 27 January 2017

Nature transporting piles of sand.

Hi! It’s been too long!

In a recent arXiv entryEspen Bernton, Mathieu Gerber and Christian P. Robert and I explore the use of the Wasserstein distance to perform parameter inference in generative models. A by-product is an ABC-type approach that bypasses the choice of summary statistics. Instead, one chooses a metric on the observation space. Our work fits in the minimum distance estimation framework and is particularly related to “On minimum Kantorovich distance estimators”, by Bassetti, Bodini and Regazzini. A recent and very related paper is “Wasserstein training of restricted Boltzmann machines“, by Montavon, Müller and Cuturi, who have similar objectives but are not considering purely generative models. Similarly to that paper, we make heavy use of recent breakthroughs in numerical methods to approximate Wasserstein distances, breakthroughs which were not available to Bassetti, Bodini and Regazzini in 2006.

Here I’ll describe the main ideas in a simple setting.  If you’re excited about ABCasymptotic properties of minimum Wasserstein estimators, Hilbert space-filling curves, delay reconstructions and Takens’ theorem, or SMC samplers with r-hit kernels, check our paper!