# Statisfaction

## ABC in Banff

Posted in General, Seminar/Conference, Statistics by Pierre Jacob on 6 March 2017

Banff, also known as not the worst location for a scientific meeting.

Hi all,

Last week I attended a wonderful meeting on Approximate Bayesian Computation in Banff, which gathered a nice crowd of ABC users and enthusiasts, including lots of people outside of computational stats, whom I wouldn’t have met otherwise. Christian blogged about it there. My talk on Inference with Wasserstein distances is available as a video here (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, the paper is here). In this post, I’ll summarize a few (personal) points and questions on ABC methods, after recalling the basics of ABC (ahem).

The goal is to learn parameters $\theta$ from a generative model. We know how to sample “fake” data $z_{1:n} = z_1, \ldots, z_n$ from the model $p(z_{1:n}|\theta)$ (also called “simulator”, “generator” or “black-box”), given the parameters  $\theta$. We have a prior $p(\theta)$, and data $y_{1:n}$, where each $y_i$ is $d$-dimensional. We cannot evaluate the likelihood function $\theta \mapsto p(y_{1:n}|\theta)$, thus we cannot apply the usual MLE and Bayesian toolbox. So what can we do?

We can sample parameters from the prior, and sample fake data given these parameters. Some of these fake data will resemble the actual data, in which case we might be interested in the corresponding parameters. More formally, we can sample $\theta \sim p(\theta)$ and $z_{1:n}\sim p(z_{1:n}|\theta)$ until $d(y_{1:n}, z_{1:n}) \leq \varepsilon$, where $d(y_{1:n}, z_{1:n})$ is a distance or pseudo-distance between samples (e.g. the Euclidean distance between summary statistics of the samples), and $\varepsilon$ is a threshold. This procedure corresponds to an ABC “rejection sampler”, which targets the so-called ABC posterior distribution, which, itself, approximates a certain distribution as $\varepsilon \to 0$. Essentially, if the discrepancy measure $d(y_{1:n}, z_{1:n})$ is sensible, and $\varepsilon$ is small enough value, there is hope that the ABC posterior is useful for estimating parameters. Lots of variations of this idea exist: see the bibliography here. Now, some points gathered from the meeting.

• People use ABC in challenging scenarios, to the point of equating ABC to “statistical inference in complex models”.  Some people care about statistical guarantees, such as coverage,  while some don’t. Perhaps some people come from fields were providing wrong answers is not a big deal, whereas in some other fields it would be a catastrophe.
• Lots of tools that can be readily plugged in all parts of an ABC approach, yielding as many “ABC methods”, i.e. prediction / classification / density estimation techniques. But how do these methods scale with the number of observations, their dimension, the dimension of the parameter space?
• With respect to scaling, some asymptotic rates are now available, which is a huge advance compared to just a few years ago, see e.g. Fearnhead and Li and Frazier, Martin, Robert and Rousseau.
• Expensive models or simulators might take hours to generate fake data $z_{1:n}$. This generally precludes long ABC-MCMC runs, and motivates cheap “emulators” or “surrogate models”, and Bayesian optimization / design techniques.
• In other settings, it can be cheap to generate data sets. This is typically the case for low-dimensional state space models, where parameter inference might be expensive but simulating each time series is cheap.
• It can be difficult to compare sampling algorithms which have different target distributions. It is often the case that variants of ABC are compared to standard MCMC methods, and then all of these have different targets…! I find it helpful to recall that I could also sample from the prior distribution. How do I know that my ABC method does better than that?
• In some settings, the pseudo-distance $d(y_{1:n}, z_{1:n})$ used to compare samples relies on summary statistics, which are themselves coming from years of domain-specific knowledge and experience. There, the issue is not really to by-pass summary statistics (i.e. to remove the “A” in ABC), but rather to make the best use of these statistics, i.e. via regression adjustments, or by selecting among them in a clever way.