# Statisfaction

## Statistical inference with the Wasserstein distance

Posted in Statistics by Pierre Jacob on 27 January 2017

Nature transporting piles of sand.

Hi! It’s been too long!

In a recent arXiv entryEspen Bernton, Mathieu Gerber and Christian P. Robert and I explore the use of the Wasserstein distance to perform parameter inference in generative models. A by-product is an ABC-type approach that bypasses the choice of summary statistics. Instead, one chooses a metric on the observation space. Our work fits in the minimum distance estimation framework and is particularly related to “On minimum Kantorovich distance estimators”, by Bassetti, Bodini and Regazzini. A recent and very related paper is “Wasserstein training of restricted Boltzmann machines“, by Montavon, Müller and Cuturi, who have similar objectives but are not considering purely generative models. Similarly to that paper, we make heavy use of recent breakthroughs in numerical methods to approximate Wasserstein distances, breakthroughs which were not available to Bassetti, Bodini and Regazzini in 2006.

Here I’ll describe the main ideas in a simple setting.  If you’re excited about ABCasymptotic properties of minimum Wasserstein estimators, Hilbert space-filling curves, delay reconstructions and Takens’ theorem, or SMC samplers with r-hit kernels, check our paper!

We have data $y_1,\ldots,y_n$ in $\mathcal{Y}$ and a model. A model is a collection of probability distributions on $\mathcal{Y}$, $\{\mu_\theta, \theta \in \mathbb{R}^d \}$ with d-dimensional parameters $\theta$ to estimate. Problem: you can only simulate from $\mu_\theta$, and not evaluate its probability density function. This is the “ABC” setting of “purely generative” models.

A first step is to view the observations as an empirical distribution $\hat\mu_n = \sum_{i=1}^n \delta_{y_i}$, and not as a vector in $\mathcal{Y}^n$. It is a very sensible idea for independent data, for which the order should not matter. The paper discusses extensions to dependent data in details.

The next step is inspired by minimum distance estimation principles: we can estimate parameters by minimizing a distance between $\mu_\theta$ and $\hat\mu_n$, over all $\theta \in \mathbb{R}^d$. Which distance should we use? In the purely generative case, we can approximate $\mu_\theta$ by drawing from it. We are then faced with the problem of computing a distance between two empirical distributions. Many metrics could be envisioned, but the Wasserstein distance is an appealing choice for multiple reasons.

1. It takes into account a distance defined on the underlying observation space. Warning: it’s also a choice that has to be made. If you can’t make any decent choice of metric on the observation space, then you can’t define the Wasserstein distance between distributions on that space. This is a crucial difference between transport metrics and, say, the Kullback-Leibler divergence.
2. Recent advances in numerical transport enable the computation of a “entropy-regularized” Wasserstein distance, for a cost that is quadratic in $n$ but only linear in the dimension of $\mathcal{Y}$. In the univariate case, the Wasserstein distance can be computed exactly in $n \log n$ via permutations.
3. Extending the permutation idea, we propose new probability metrics based on an ordering of the projected data through a Hilbert space-filling curve. This leads to a distance computable in $n \log n$ and linear in the dimension: it’s fast.

Once the distance is chosen, multiple estimation frameworks are possible. The minimum distance “point” estimator leads to an optimization program, whereas an ABC approach leads to a quasi-posterior distribution to sample, which we tackle with SMC samplers and r-hit kernels.

The theoretical study of the minimum distance estimators associated with Hilbert space-filling curve distances has just started; however we have a variety of results for the standard minimum Wasserstein distance estimator, and for its Bayesian counterpart (see the supplementary materials), including in the misspecified setting. Numerical experiments show very promising performance in a variety of examples, and in particular, we show how careful choices of summary statistics can be completely bypassed. Some examples will be described in future blog entries.