Hi,

With Stephane Shao, Jie Ding and Vahid Tarokh we have just arXived a tech report entitled “Bayesian model comparison with the Hyvärinen score: computation and consistency“. Here I’ll explain the context, that is, scoring rules and Hyvärinen scores (originating in Hyvärinen’s score matching approach to inference), and then what we actually do in the paper.

Let’s start with *scoring rules*. These are loss functions for the task of predicting a variable with a probability distribution . If is used to predict and occurs, then the score is a real value, e.g. denoted by ; the smaller score the better, and overall we want to find that minimizes , where the expectation is with respect to the distribution of . A scoring rule is *proper* if the above expectation is minimized when is precisely the distribution of . An example of *proper scoring rule* is , the *logarithmic scoring rule*.

We can interpret Bayes factors in terms of logarithmic scoring rules (as in Chapter 6 of Bernardo & Smith). Indeed, the logarithm of the Bayes factor between model and is the difference of log-evidences:

,

In this sense, the Bayes factor compares the predictive performance of models. Decomposing these marginal likelihoods into conditionals and assuming that , we have for model :

,

(with a convention for ), which can be interpreted as a measure of performance of out-of-sample predictive distributions , summed up over time. Importantly, this interpretation of Bayes factors holds also when models are misspecified.

So what’s not to like? Prior specification affects the evidence, which is completely fine per se. What’s concerning is the extent of the impact of the prior. Seemingly innocent changes of prior distributions can have drastic effects on the evidence and thus on Bayes factors. This is the case in the simplest example of a Normal location model: , with fixed and prior . Then the log-evidence behaves like when . This means that the log-evidence can take crazy values, and is not even well-defined in that limit. However, that limit corresponds to a flat prior which is not crazy in this model, at least in terms of parameter inference. This is a reason for people to avoid vague priors when relying on Bayes factors for model comparison.

Conversely, this is a reason to seek alternatives to the evidence as a model comparison criterion, see for instance intrinsic Bayes factors, fractional Bayes factors, or the mixture approach of Kamary et al. Our work follows Dawid & Musio (2015) who propose to change the scoring rule. Instead of the logarithmic scoring rule, they advocate the Hyvärinen scoring rule, which leads to replacing by

.

This barbaric expression involves derivatives of the log-density of predictive distributions, instead of log-densities. It can then be checked in the Normal location model that the score is well-defined even in the limit . Thankfully it can also be checked that the Hyvärinen score is *proper*. Note that variants of the score have been proposed for discrete observations, but there are cases where the Hyvärinen score is inapplicable, namely when predictive densities are not smooth enough, e.g. Laplace distributions.

In the paper, we show how sequential Monte Carlo samplers can approximate this scoring rule, for a wide range of models including nonlinear state space models. We also show the consistency of this scoring rule for model selection, as the number of observations goes to infinity; our proof relies on strong regularity assumptions, but the numerical experiments indicate that the results hold under weaker conditions. Finally we investigate an example of population growth model applied to kangaroos, and a Lévy-driven stochastic volatility model which we use to illustrate the consistency result. Both of these cases feature intractable likelihoods approximated by particle filters within an SMC^2 algorithm.

The code producing the figures of the paper is available on Github: https://github.com/pierrejacob/bayeshscore

]]>

**1° Using ****arXiv**

I always had the feeling that the default presentation for author’s arXived list was a bit crude and unfit for identifying researchers. Actually there is a simple way to improve it, by creating an arXiv public author identifier. The action required to create your own public author identifier is described here. See below for a before/after comparison of the presentation. From there, it is possible to dynamically include the list of your publications in your own home page using the following JavaScript widget.

**2° Using the French portal ****HAL**

Apparently only in French. The widget is called Haltools, and is developed by Inria. It just requires to enter a researcher name. There are formatting options such as ranking by year/publication type, etc, or showing abstract, pictures, etc. See eg my page as displayed below.

**3° Using bibtex to html converter**

Apparently, there exist at least two such converters with the same name: bibtex2html by Jean-Christophe Filliâtre and bibtex2html by Grégoire Malandain. Both from Inria. I personally use the first. I use some bash code (below or link) to first run the *bib2bib* command on mybiblio.bib file, and second run the *bibtex2html* command on the file created. A nifty option is called *named-field*: it creates links to eg journal webpage, arXiv, blog posts, DOI, etc:

bib2bib -oc intermediatefile -ob webpage.bib mybiblio.bib

bibtex2html -nobibsource -citefile intermediatefile –sort-by-date –reverse-sort –revkeys –style –named-field springer “Springer” –named-field blog “blog” –named-field pdf “pdf” –named-field book “book” –named-field journal-link “journal” –named-field hal “HAL” webpage.bib

The output is made of webpage.bib and webpage.html, that I manually copy-paste to my webpage.

]]>

The International Society for Bayesian Analysis (ISBA), is running elections until November, 15. This year, two contributors on this blog, Nicolas Chopin and myself, are running for an ISBA Section office. The sections of the society, nine in number as of today, gather researchers with common research interests: Computation, Objective Bayes, Nonparametrics, etc.

Here are our candidate statements:

**Nicolas Chopin**

**Position Title/Affiliation**

Prof. of Statistics at the ENSAE, Paris

**Position Being Sought 2018**

Bayesian Computation Chair-Elect

**Candidate Statement 2018**

MCMC, SMC, Variational Bayes, Expectation propagation, ABC approaches… There are so many ways to compute Bayesian quantities these days, and each way seem to have its merits and use cases.

If elected, I would like to put particular attention on making the section as inclusive as possible: that is, to attract all scientists interested in some form of Bayesian computation, whether deterministic or Monte Carlo based, whether generic or specialised to a particular problem.

To know more about me and my research: https://sites.google.com/site/nicolaschopinstatistician/

**Julyan Arbel**

**Position Title/Affiliation**

Researcher/Inria Grenoble Rhône-Alpes

**Position Being Sought 2018**

Objective Bayes Treasurer

**Candidate Statement 2018**

I am a researcher at Inria, Grenoble, capital of the French Alps. Earlier this year, I had the chance to spend three months at the University of Texas at Austin, a hotspot of Objective Bayes! I completed my PhD at Paris-Dauphine with Judith Rousseau and Ghislaine Gayraud, and did a three-year postdoc at the wonderful Collegio Carlo Alberto in Turin, and at Bocconi University in Milan. My research interests cover statistical inference and theoretical understanding of Bayesian stochastic models in a variety of applications ranging from Ecology to Astrophysics.

Objective Bayes was instrumental during my undergraduate studies (some ten years ago) to make me eager for a PhD when I visited José-Miguel Bernardo in Valencia for a summer internship. A vibrant place for learning Objective Bayes—and reference inference—and a perfect time in my life to start and understand what the life of a researcher is all about. A couple of years later, OBayes 2009, Philadelphia, was the very first conference I attended. Now I look forward to repaying the Section for what I have benefited from it since then.

Please visit my webpage if you want to know more about me and my research: http://www.julyanarbel.com/

]]>

Hi,

This post is about computational issues with the cut distribution for Bayesian inference in misspecified models. Some motivation was given in a previous post about a recent paper on modular Bayesian inference. The cut distribution, or variants of it, might play an important role in combining statistical models, especially in settings where one wants to propagate uncertainty while preventing misspecification from damaging estimation. The cut distribution can also be seen as a probabilistic analog of two-step point estimators. So the cut distribution is more than just a trick! And it raises interesting computational issues which I’ll describe here along with a solution via unbiased MCMC.

What is the cut distribution? Suppose that you estimate with the distribution . This might be the posterior distribution under some model. Then consider a second parameter , which you would want to infer via a distribution ; this might be a posterior distribution in a second model. There are many such situations, e.g. might represent missing data or covariates plugged into the second model. You can then consider the joint probability distribution:

This is different from the posterior distribution in a joint model, where both parameters would be estimated simultaneously. Here the marginal of is insensitive to whatever craziness might be encoded in the second model. And this is the point: under the cut, the specification of the second model does not impact the first parameters.

It is difficult to design MCMC algorithms for , because its density cannot be evaluated point-wise. This is very well explained in Martyn Plummer’s paper. As an example, if the second posterior density takes the form

,

then one can often evaluate the prior and the likelihood appearing in the numerator. However, the denominator is a function of which might not have an analytical form, and thus the density function of the cut distribution cannot be evaluated point-wise. Martyn Plummer proposes a solution which can be convenient but introduces an extra bias. The discussion mentions the similarity with doubly intractable problems, but as far as I can see this does not lead to practical algorithms here.

A naive MCMC solution goes as follows. First approximate with an MCMC sample. Then, for each of these samples, say , generate an MCMC sample approximating . This leads to a lot of MCMC runs to do. Each run goes for a number of iterations which needs to be chosen. The resulting approximation will be valid as *all the numbers of iterations* go to infinity: this is cumbersome.

If one could sample i.i.d. from and , then a much simpler solution is available: sample from and then from . The resulting pair follows the cut distribution. Then one could sample such pairs independently many times to approximate the cut. Perfect sampling is an active research area, but unfortunately many distributions are still such that perfect samplers are not available or prohibitely costly.

This is closely connected to a solution proposed in our unbiased MCMC paper. Instead of providing perfect samples, we aim at the more humble goal of estimating integrals of arbitrary functions, say , with respect to the cut distribution. By the tower property of expectation, we have

,

where

.

Using the proposed machinery, if MCMC algorithms are available at both stages, we can estimate without bias for any , and use these to estimate without bias . The lack of bias makes the resulting procedure consistent in the number of independent replicates of the proposed estimator, which can be computed completely in parallel.

]]>

With Jeremy Heng we have recently arXived a paper describing how to remove the burn-in bias of Hamiltonian Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is that Hamiltonian Monte Carlo can be massively parallelized: instead of running one chain for many iterations, one can run short coupled chains independently in parallel. The proposed estimators are consistent in the limit of the number of parallel replicates. This is appealing as the number of available processors increases much faster than clock speed, over recent years and for the years to come, for a number of reasons explained e.g. here.

As described in the previous blog post, the proposed construction involves a coupling of Markov chains. So consider two chains and . The coupling must be such that each chain is a standard HMC chain (or any variant thereof), but jointly the chains meet exactly, i.e. become identical, after a random number of iterations called the meeting time. That is, the variable is finite almost surely. Here for simplicity, I neglect the time shift mentioned in the earlier post and in the papers.

A simple HMC kernel works as follows. Given a current state , an initial velocity is drawn from a multivariate Normal distribution. Then, the equations of Hamiltonian dynamics, corresponding to the movement of a particle with a potential energy given by minus the log target density, are numerically solved with a leap-frog integrator, with a step size and a number of steps. The final position is then accepted or not as the next state , according to a Metropolis-Hastings (MH) ratio which corrects for the error introduced by the leap-frog integrator. The understanding of HMC has improved considerably in recent months, notably with the contributions of Mangoubi & Smith 2017 and Durmus, Moulines & Saksman 2017.

It turns out that using common random numbers for the two HMC chains, that is, using common initial velocities and common uniform variables for the acceptance steps, leads to the chains contracting very quickly. This is illustrated in the above animation, which shows the first iterations of 250-dimensional chains targeting a multivariate Normal distribution; the plot shows the evolution of the first two components. The distance between the chains goes rapidly to zero. Assumptions on the target distribution are necessary for such contraction to occur, such as strong convexity as in Section 2.4 of Mangoubi & Smith 2017. In our work, we only need such assumptions to be satisfied on subsets of the space that the chains visit regularly (see the paper for more precision). For instance, we might assume that the target density is strictly positive everywhere and that there are compact sets on which strong log-concavity of the target density holds.

In some examples, coupled HMC chains contract so fast that the distance between them goes under machine precision after a reasonably small number of iterations. We can then consider that the chains have met. Yet this is not very clean… so instead, we mix HMC steps with coupled MH steps that use maximally coupled random walk proposals. Indeed, if two chains are already close to one another thanks to the coupled HMC steps, then the coupled MH steps trigger exact meetings with large probability.

In the experiments, we try a multivariate Normal target in dimension 250, a bivariate Normal truncated by quadratic inequalities, and a logistic regression with 300 covariates. These examples show that tuning parameters, such as step size and number of leap-frog steps, that would be optimal for a single HMC chain might not be optimal for the proposed unbiased estimators. This leads to a loss of asymptotic efficiency, which is hardly surprising: this is a classic case of bias removal increasing the variance. Yet from the experiments, the loss of efficiency sounds very reasonable in exchange for immediate benefits on parallel processors. In the logistic regression example, the method essentially amounts to running HMC chains of length 1,000 in parallel.

Another way of parallelizing HMC, or any MCMC method, is to plug it in the rejuvenation step of an SMC sampler (see here or here). As has been known for fifteen years, this has benefits such as normalizing constant estimation and some robustness to multimodality. Both approaches provide estimators that are consistent in the limit of a number of operations that can be massively parallelized: particles in SMC samplers, and independent replicates of unbiased MCMC. One advantage of the latter might lie in the simple construction of confidence intervals from the standard Central Limit Theorem, but these can be built from SMC samplers too.

]]>

Nine R user communities already exist in France and there is a much large number of R communities around the world. It was time for Grenoble to start its own!

The goal of the R user group is to facilitate the identification of local useRs, to initiate contacts, and to organise experience and knowledge sharing sessions. The group is open to any local useR interested in learning and sharing knowledge about R.

The group’s website features a map and table with members of the R group. Members with specific skills related to the use of R are referenced in a table and can be contacted by other members. A gitter allows members to discuss R issues and a calendar presents the upcoming events.

**Working sessions
**Monthly working sessions of two hours start with presentations or tutorials. The second part is dedicated to helping each others, meeting new people and sharing beers and softs offered by the Grenoble Data Institute. The current program looks nice:

]]>

Hi,

With Lawrence Murray, Chris Holmes and Christian Robert, we have recently arXived a paper entitled “Better together? Statistical learning in models made of modules”. Christian blogged about it already. The context is the following: parameters of a first model appear as inputs in another model. The question is whether to consider a “joint model approach”, where all parameters are estimated simultaneously with all of the data. Or if one should instead follow a “modular approach”, where the first parameters are estimated with the first model only, ignoring the second model. Examples of modular approaches include the “cut distribution“, or “two-step estimators” (e.g. Chapter 6 of Newey & McFadden (1994)). In many fields, modular approaches are preferred, because the second model is suspected of being more misspecified than the first one. Misspecification of the second model can “contaminate” the joint model, with dire consequences on inference, as described e.g. in Bayarri, Berger & Liu (2009). Other reasons include computational constraints and the lack of simultaneous availability of all models and associated data. In the paper, we try to make sense of the defects of the joint model approach and we propose a principled, quantitative way of choosing between joint and modular approaches.

Lawrence and I started wondering about these questions in the context of models of plankton population growth, back in 2012. Plankton growth is affected by ocean temperatures. These temperatures are not measured everywhere at all times, but one can first use a geophysics model to infer temperatures at desired locations and times. Then these estimated temperatures can be used as inputs (or “forcings”) to model plankton growth. We were wondering whether we should instead define a joint model of temperatures + plankton, to take the uncertainty of temperatures into account. Parslow, Cressie, Campbell, Jones & Murray (2013) provide an example of plankton model where temperatures are considered fixed, which is common practice.

An initial difficulty with the joint model approach is of a computational nature: if the two stages (e.g. plankton and geophysics) both involve large models, requiring weeks of computation, the joint model might simply be impossible to deal with. Indeed, the number of parameters adds up, and computational methods typically have super-linear costs in the number of parameters, which is the dimension of the space to explore (i.e. to sample or to optimize). It’s even worse than that since extra difficulties accumulate, such as multimodality, intractable likelihoods, etc.

Interestingly, the difficulties are not only computational but also statistical: the joint model approach is not always preferable in terms of estimation. This has been reported in various contexts, such as pharmacokinetics-pharmacodynamics, where the PD part is usually considered *misspecified relative to* the PK part. A good reference is Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009). There seems to have been enough demand from practitioners for WinBUGS to include a “cut” function, which is an attempt at estimating some parameters irrespective of other parts of the model (see the WinBUGS manual). Martyn Plummer (of JAGS) wrote a very interesting paper on the cut distribution, on issues associated with existing algorithms to sample from it, and on proposals to fix them. Another super relevant article is Bayarri, Berger & Liu (2009) that considers the issue in multiple cases of Bayesian inference in misspecified settings. Another link between modular approaches and model misspecification has been thoroughly investigated in the context of causal inference with propensity scores by Corwin Zigler and others.

Departing from the joint model approach seems to pose difficulties for some statisticians. Indeed the cut distribution is strange: it inserts directions in the graph relating variables in the model. Thus A can impact B without B impacting A. This is strange because information is usually modeled as a flow going both ways. In the paper, we argue that the cut distribution can be a reasonable choice, in terms of decision-theoretic principles such as predictive performance assessed with the logarithmic scoring rule. This leads to a quantitative criterion that can be computed to decide whether to “cut” or not. Following this type of decision-theoretic reasoning seems to me in the flavor of the Bayesian paradigm, contrarily to always trusting joint models and associated posterior distributions.

On the other hand, the issue might appear trivial to econometricians, who are used to two-step estimators, and misspecification in general; see e.g. White (1982) and Robustness by Hansen & Sargent, and two-step estimators in e.g. Pagan (1984). The cut distribution is a probabilistic version of these estimators, so I wonder why it has not been studied in more details earlier. If you know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the issues raised by Plummer.

In the end, our article is an attempt at starting a discussion on modular versus joint approaches. It is likely that more and more situations will require the combination of models (e.g. merging heterogeneous data sources), in ways that take model misspecification into account.

]]>

Didier Fraix-Burnet (IPAG), Stéphane Girard (Inria) and myself are organising a School of Statistics for Astrophysics, Stat4Astro, to be held in October in France. The primary goal of the School is to train astronomers to the use of modern statistical techniques. It also aims at bridging the gap between the two communities by emphasising on the practice during works in common, to give firm grounds to the theoretical lessons, and to initiate works on problems brought by the participants. There have been two previous sessions of this school, one on regression and one on clustering. The speakers of this edition, including Christian Robert, Roberto Trotta and David van Dyk, will focus on the** **Bayesian methodology, with the moral support of the Bayesian Society, ISBA. The interest of this statistical approach in astrophysics probably comes from its necessity and its success in determining the cosmological parameters from observations, especially from the cosmic background fluctuations. The cosmological community has thus been very active in this field (see for instance the Cosmostatistics Initiative COIN).

But the Bayesian methodology, complementary to the more classical frequentist one, has many applications in physics in general due to its faculty to incorporate a priori knowledge into the inference computation, such as the uncertainties brought by the observational processes.

As for sophisticated statistical techniques, astronomers are not familiar with Bayesian methodology in general, while it is becoming more and more widespread and useful in the literature. This school will form the participants to both a strong theoretical background and a solid practice of Bayesian inference:

- Introduction to R and Bayesian Statistics (Didier Fraix-Burnet, Institut de Planétologie et d’Astrophysique de Grenoble)
- Foundations of Bayesian Inference (David van Dyk, Imperial College London)
- Markov chain Monte Carlo (David van Dyk, Imperial College London)
- Model Building (David van Dyk, Imperial College London)
- Nested Sampling, Model Selection, and Bayesian Hierarchical Models (Roberto Trotta, Imperial College London)
- Approximate Bayesian Computation (Christian Robert, Univ. Paris-Dauphine, Univ. Warwick and Xi’an (!))
- Bayesian Nonparametric Approaches to Clustering (Julyan Arbel, Université Grenoble Alpes and Inria)

Feel free to register, we are not fully booked yet!

Julyan

]]>

Hi,

In a recent work on parallel computation for MCMC, and also in another one, and in fact also in an earlier one, my co-authors and I use a simple yet very powerful object that is standard in Probability but not so well-known in Statistics: the maximal coupling. Here I’ll describe what this is and an algorithm to sample from such couplings.

Consider two distributions, and on a general (i.e. discrete or continuous) state space . A coupling refers to a joint distribution, say on , with first marginal and second marginal , i.e.

An independent coupling has a density function . A maximal coupling, on the other hand, is such that pairs have maximal probability of being identical, i.e. is maximal under the marginal constraints and .

At first, it might sound weird that there would be any possibility of and being identical since their distributions are on a continuous state space. But indeed it is possible. Intuitively, imagine that is sampled from : as long as then could also be a sample from , could it not?

The following algorithm provides samples from a maximal coupling of and .

- Sample from .
- Sample a uniform variable . If , then output .
- Otherwise, sample from and , until , and output .

It is clear that, if is produced by the above algorithm, then (from step 1). Checking that takes a few lines of calculus, similar to proving the validity of rejection sampling. It can be found e.g. in the appendix of this article. This scheme must have been known for a long time and is definitely in Thorisson’s book on coupling. I’m not quite sure who first came up with it, any tips welcome!

To understand the algorithm, let’s look at the following figure, where the density functions of and are plotted along with a shaded area under the curve . The algorithm tries to sample from the distribution represented by the shaded area first, and if it does not succeed, it samples from the remainder which has density , up to a normalizing constant.

The graph at the beginning of the article shows pairs generated by the algorithm. The red dots represent samples with . The probability of this event is precisely one minus the total variation between and , and by the coupling inequality (e.g. here, Lemma 4.9), this is the maximum probability of the event under the marginal constraints.

]]>

Hi again,

As described in an earlier post, Espen Bernton, Mathieu Gerber and Christian P. Robert and I are exploring Wasserstein distances for parameter inference in generative models. Generally, ABC and indirect inference are fun to play with, as they make the user think about useful distances between data sets (i.i.d. or not), which is sort of implicit in classical likelihood-based approaches. Thinking about distances between data sets can be a helpful and healthy exercise, even if not always necessary for inference. Viewing data sets as empirical distributions leads to considering the Wasserstein distance, and we try to demonstrate in the paper that it leads to an appealing inferential toolbox.

In passing, the first author Espen Bernton will be visiting Marco Cuturi, Christian Robert, Nicolas Chopin and others in Paris from September to January; get in touch with him if you’re over there!

We have just updated the arXiv version of the paper, and the main modifications are as follows.

- We propose a new distance between time series termed “curve-matching”, which turns out to be quite similar to dynamic time warping or Skorokhod distances. This distance might be particularly relevant for models generating non-stationary time series, such as Susceptible-Infected-Recovered models.
- Our theoretical results are generally improved. In particular, for the minimum Wasserstein/Kantorovich estimator and variants of it, the proofs are now based on the notion of epi-convergence, commonly used in optimization, and various results from Rockafellar and Wets (2009),
*Variational analysis.* - On top of the Hilbert distance, based on the Hilbert space-filling curve, we consider the use of the swapping distance of
So we have the Hilbert distance, computable in , where is the number of data points, the swapping distance in and of course the Wasserstein distance in . Various other distances are discussed in Section 6 of the paper.

- On the asymptotic behavior of ABC posteriors, our results now cover the use of Hilbert and swapping distances. This is thanks to the convenient property that the Hilbert distance is indeed a distance, and is always larger than the Wasserstein distance; we also rely on some of Mathieu‘s recent results. And the swapping distance (if initialized with Hilbert sorting) is always sandwiched between Wasserstein and Hilbert.
- The numerical experiments have been revised: there is now a bivariate g-and-k example with comparisons to the actual posterior; the toggle switch example from systems biology is unchanged; a new queueing model example, with comparisons to the actual posterior obtained with particle MCMC (we could have also used the method of Shestopaloff and Neal). Finally, we have a more detailed study of the Lévy-driven stochastic volatility model, with 10,000 observations. There we show how transport distances can be combined with summaries to estimate all model parameters (we previously got only four out of five parameters, using transport distances alone).

The supplementary materials for the new version are here (while the supplementary for the previous arXiv version are still online here).

]]>