In this third and last post about the Sub-Gaussian property for the Beta distribution [1] (post 1 and post 2), I would like to show the interplay with the Bernoulli distribution as well as some connexions with optimal transport (OT is a hot topic in general, and also on this blog with Pierre’s posts on Wasserstein ABC).

Let us see how sub-Gaussian proxy variances can be derived from transport inequalities. To this end, we need first to introduce the **Wasserstein distance** (of order 1) between two probability measures *P* and * Q* on a space . It is defined wrt a distance *d* on by

where is the set of probability measures on with fixed marginal distributions respectively and Then, a probability measure is said to satisfy a **transport inequality** with positive constant , if for any probability measure dominated by ,

where is the entropy, or Kullback–Leibler divergence, between and . The nice result proven by Bobkov and Götze (1999) [2] is that the constant is a sub-Gaussian proxy variance for *P*.

For a discrete space equipped with the Hamming metric, , the induced Wasserstein distance reduces to the total variation distance, . In that setting, Ordentlich and Weinberger (2005) [3] proved the distribution-sensitive transport inequality:

where the function is defined by and the coefficient is called the balance coefficient of , and is defined by . In particular, the Bernoulli balance coefficient is easily shown to coincide with its mean. Hence, applying the result of Bobkov and Götze (1999) [2] to the above transport inequality yields a distribution-sensitive proxy variance of for the Bernoulli with mean , as plotted in blue above.

In the Beta distribution case, we have not been able to extend this transport inequality methodology since the support is not discrete. However, a nice limiting argument holds. Consider a sequence of Beta random variables with fixed mean and with a sum going to zero. This converges to a Bernoulli random variable with mean , and we have shown that the limiting optimal proxy variance of such a sequence of Beta with decreasing sum is the one of the Bernoulli.

[1] Marchal, O. and Arbel, J. (2017), On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.

[2] Bobkov, S. G. and Götze, F. (1999). Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28.

[3] Ordentlich, E. and Weinberger, M. J. (2005). A distribution dependent refinement of Pinsker’s inequality. IEEE Transactions on Information Theory, 51(5):1836–1840.

As a follow-up on my previous post on the sub-Gaussian property for the Beta distribution [1], I’ll give here a visual illustration of the proof.

A random variable with finite mean is sub-Gaussian if there is a positive number such that:

We focus on *X* being a Beta random variable. Its moment generating function is known as the Kummer function, or confluent hypergeometric function . So *X *is -sub-Gaussian as soon as the difference function

remains positive on . This difference function is plotted on the right panel above for parameters . In the plot, is varying from green for the variance (which is a lower bound to the optimal proxy variance) to blue for the value , a simple upper bound given by Elder (2016), [2]. The idea of the proof is simple: the optimal proxy-variance corresponds to the value of for which admits a double zero, as illustrated with the red curve (black dot). The left panel shows the curves with varying, interpolating from green for to blue for , with only one curve qualifying as the optimal proxy variance in red.

[1] Marchal and Arbel (2017), On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.

[2] Elder (2016), Bayesian Adaptive Data Analysis Guarantees from Subgaussianity, https://arxiv.org/abs/1611.00065

Hi all,

Kristian Lum, who was already one of my Statistics superheroes for her many interesting papers and great talks, bravely wrote the following text about her experience as a young statistician going to conferences:

https://medium.com/@kristianlum/statistics-we-have-a-problem-304638dc5de5

I can’t thank Kristian enough for speaking out. Her experience is both shocking and hardly surprising. Many, many academics report similar stories. This simply can’t go on like that.

I happen to have gone to the conferences mentioned by Kristian, and my experience as a young man was completely different. It was all about meeting interesting people, discussing ideas, being challenged, and having good times. Nobody harassed, touched or assaulted me. There was some flirting, as I guess is natural when hundreds of people are put in sunny places far away from home, but I was never the victim of any misconduct or abuse of power. So instead of driving me out of the field, conferences became important, enriching and rewarding moments of my professional life.

Looking back at those conferences I feel sick, and heartbroken, at the thought that some of my peers were having such a difficult time, because of predators who don’t ever face the consequences of their actions. Meanwhile I was part of the silent majority.

The recent series of revelations about sexual harassment and assaults in other professional environments indicate that this is not specific to our field, nor to academia. But this does not make it any more acceptable. I know for a fact that many leaders of our field take this issue extremely seriously (as Kristian mentions too), but clearly much much more needs to be done. The current situation is just shameful; strong and coordinated actions will be needed to fix it. Thanks again to Kristian for the wake-up call.

]]>

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on unbiased MCMC. Reducing the problem to a simple form, consider the following code, which iteratively samples a vector ‘x’ and stores it in a row of a large matrix called ‘chain’ (I’ve kept the MCMC terminology).

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

If you execute this code, you will see that it is surprisingly slow: it takes close to a minute on my computer. Now, consider the next block, which does exactly the same except that the vector ‘x’ is not copied into the matrix ‘chain’.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) # chain[imcmc,] = x #no more copying }

This code runs nearly instantaneously. Could it be that just copying a vector in a matrix takes a lot of time? Sounds unlikely. Now consider this third block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nmcmc){ #no call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This code runs nearly instantaneously as well; this time ‘x’ is copied into ‘chain’, but the call to the nrow function is removed….?! What is nrow doing? It is meant to simply return dim(chain)[1], the first dimension of chain. So consider this fourth block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == dim(chain)[1]){ #call to dim instead of nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This one also runs instantaneously! So replacing nrow(chain) by dim(chain)[1] solves the problem. Why?

The answer comes from R guru and terrific statistician Louis Aslett. I directly quote from an exchange of emails, since he brilliantly explains the phenomenon.

You probably know R stores everything by reference, so if I do:

x <- matrix(0, nrow=1e5, ncol=100)

y <- xI actually only have one copy of the matrix in memory with two references to it. If I then do:

x[1,1] <- 1

R will first make a copy of the whole matrix, update x to point to that and then change the first element to one. This idea is used when you pass a variable to a standard (i.e. non-core, non-primitive) R function, which nrow is: it creates a reference to the variable you pass so that it doesn’t have to copy and the function call is very fast …. as long as you don’t write to it inside the function, no copy need ever happen. But the “bad design” bit is that R makes a decision whether to copy on write based only on a reference count and crucially that reference count stays increased even after a function returns, irrespective of whether or not the function has touched the variable.

So:

x <- matrix(0, nrow=1e5, ncol=100) # matrix has ref count 1

x[1,1] <- 1 # ref count is 1, so write with no copy

nrow(x) # ref count is 2 even though nothing was touched

x[1,1] <- 1 # ref count still 2, so R copies before writing first element. Now the ref count drops to 1 again

x[2,2] <- 1 # this writes without a copy as ref count got reset on last line

nrow(x) # ref count jumps

x[3,3] <- 1 # copy invoked again! Aaaargh!So by calling nrow in the loop for the first example, the chain matrix is being copied in full on every iteration. In the second example, chain is never written to so there is no negative side effect to the ref count having gone up. In the third example, chain only ever has ref count 1 so there are no copies and each row is written in-place. I did a quick bit of profiling and indeed in the slow example, the R garbage collector allocates and tidies up nearly 9GB of RAM when executing the loop!

The crazy thing is that dim(chain)[1] works full speed even though that is all that nrow is doing under the hood, but the reason is that dim is a so-called “primitive” core R function which is special because it doesn’t affect the reference counter of its arguments. If you want to dig into this yourself, there’s a function refs() in the pryr package which tells you the current reference count to any variable.

Thanks Louis!

]]>Hi,

With Stephane Shao, Jie Ding and Vahid Tarokh we have just arXived a tech report entitled “Bayesian model comparison with the Hyvärinen score: computation and consistency“. Here I’ll explain the context, that is, scoring rules and Hyvärinen scores (originating in Hyvärinen’s score matching approach to inference), and then what we actually do in the paper.

Let’s start with *scoring rules*. These are loss functions for the task of predicting a variable with a probability distribution . If is used to predict and occurs, then the score is a real value, e.g. denoted by ; the smaller score the better, and overall we want to find that minimizes , where the expectation is with respect to the distribution of . A scoring rule is *proper* if the above expectation is minimized when is precisely the distribution of . An example of *proper scoring rule* is , the *logarithmic scoring rule*.

We can interpret Bayes factors in terms of logarithmic scoring rules (as in Chapter 6 of Bernardo & Smith). Indeed, the logarithm of the Bayes factor between model and is the difference of log-evidences:

,

In this sense, the Bayes factor compares the predictive performance of models. Decomposing these marginal likelihoods into conditionals and assuming that , we have for model :

,

(with a convention for ), which can be interpreted as a measure of performance of out-of-sample predictive distributions , summed up over time. Importantly, this interpretation of Bayes factors holds also when models are misspecified.

So what’s not to like? Prior specification affects the evidence, which is completely fine per se. What’s concerning is the extent of the impact of the prior. Seemingly innocent changes of prior distributions can have drastic effects on the evidence and thus on Bayes factors. This is the case in the simplest example of a Normal location model: , with fixed and prior . Then the log-evidence behaves like when . This means that the log-evidence can take crazy values, and is not even well-defined in that limit. However, that limit corresponds to a flat prior which is not crazy in this model, at least in terms of parameter inference. This is a reason for people to avoid vague priors when relying on Bayes factors for model comparison.

Conversely, this is a reason to seek alternatives to the evidence as a model comparison criterion, see for instance intrinsic Bayes factors, fractional Bayes factors, or the mixture approach of Kamary et al. Our work follows Dawid & Musio (2015) who propose to change the scoring rule. Instead of the logarithmic scoring rule, they advocate the Hyvärinen scoring rule, which leads to replacing by

.

This barbaric expression involves derivatives of the log-density of predictive distributions, instead of log-densities. It can then be checked in the Normal location model that the score is well-defined even in the limit . Thankfully it can also be checked that the Hyvärinen score is *proper*. Note that variants of the score have been proposed for discrete observations, but there are cases where the Hyvärinen score is inapplicable, namely when predictive densities are not smooth enough, e.g. Laplace distributions.

In the paper, we show how sequential Monte Carlo samplers can approximate this scoring rule, for a wide range of models including nonlinear state space models. We also show the consistency of this scoring rule for model selection, as the number of observations goes to infinity; our proof relies on strong regularity assumptions, but the numerical experiments indicate that the results hold under weaker conditions. Finally we investigate an example of population growth model applied to kangaroos, and a Lévy-driven stochastic volatility model which we use to illustrate the consistency result. Both of these cases feature intractable likelihoods approximated by particle filters within an SMC^2 algorithm.

The code producing the figures of the paper is available on Github: https://github.com/pierrejacob/bayeshscore

]]>

**1° Using ****arXiv**

I always had the feeling that the default presentation for author’s arXived list was a bit crude and unfit for identifying researchers. Actually there is a simple way to improve it, by creating an arXiv public author identifier. The action required to create your own public author identifier is described here. See below for a before/after comparison of the presentation. From there, it is possible to dynamically include the list of your publications in your own home page using the following JavaScript widget.

**2° Using the French portal ****HAL**

Apparently only in French. The widget is called Haltools, and is developed by Inria. It just requires to enter a researcher name. There are formatting options such as ranking by year/publication type, etc, or showing abstract, pictures, etc. See eg my page as displayed below.

**3° Using bibtex to html converter**

Apparently, there exist at least two such converters with the same name: bibtex2html by Jean-Christophe Filliâtre and bibtex2html by Grégoire Malandain. Both from Inria. I personally use the first. I use some bash code (below or link) to first run the *bib2bib* command on mybiblio.bib file, and second run the *bibtex2html* command on the file created. A nifty option is called *named-field*: it creates links to eg journal webpage, arXiv, blog posts, DOI, etc:

bib2bib -oc intermediatefile -ob webpage.bib mybiblio.bib

bibtex2html -nobibsource -citefile intermediatefile –sort-by-date –reverse-sort –revkeys –style –named-field springer “Springer” –named-field blog “blog” –named-field pdf “pdf” –named-field book “book” –named-field journal-link “journal” –named-field hal “HAL” webpage.bib

The output is made of webpage.bib and webpage.html, that I manually copy-paste to my webpage.

]]>

The International Society for Bayesian Analysis (ISBA), is running elections until November, 15. This year, two contributors on this blog, Nicolas Chopin and myself, are running for an ISBA Section office. The sections of the society, nine in number as of today, gather researchers with common research interests: Computation, Objective Bayes, Nonparametrics, etc.

Here are our candidate statements:

**Nicolas Chopin**

**Position Title/Affiliation**

Prof. of Statistics at the ENSAE, Paris

**Position Being Sought 2018**

Bayesian Computation Chair-Elect

**Candidate Statement 2018**

MCMC, SMC, Variational Bayes, Expectation propagation, ABC approaches… There are so many ways to compute Bayesian quantities these days, and each way seem to have its merits and use cases.

If elected, I would like to put particular attention on making the section as inclusive as possible: that is, to attract all scientists interested in some form of Bayesian computation, whether deterministic or Monte Carlo based, whether generic or specialised to a particular problem.

To know more about me and my research: https://sites.google.com/site/nicolaschopinstatistician/

**Julyan Arbel**

**Position Title/Affiliation**

Researcher/Inria Grenoble Rhône-Alpes

**Position Being Sought 2018**

Objective Bayes Treasurer

**Candidate Statement 2018**

I am a researcher at Inria, Grenoble, capital of the French Alps. Earlier this year, I had the chance to spend three months at the University of Texas at Austin, a hotspot of Objective Bayes! I completed my PhD at Paris-Dauphine with Judith Rousseau and Ghislaine Gayraud, and did a three-year postdoc at the wonderful Collegio Carlo Alberto in Turin, and at Bocconi University in Milan. My research interests cover statistical inference and theoretical understanding of Bayesian stochastic models in a variety of applications ranging from Ecology to Astrophysics.

Objective Bayes was instrumental during my undergraduate studies (some ten years ago) to make me eager for a PhD when I visited José-Miguel Bernardo in Valencia for a summer internship. A vibrant place for learning Objective Bayes—and reference inference—and a perfect time in my life to start and understand what the life of a researcher is all about. A couple of years later, OBayes 2009, Philadelphia, was the very first conference I attended. Now I look forward to repaying the Section for what I have benefited from it since then.

Please visit my webpage if you want to know more about me and my research: http://www.julyanarbel.com/

Hi,

This post is about computational issues with the cut distribution for Bayesian inference in misspecified models. Some motivation was given in a previous post about a recent paper on modular Bayesian inference. The cut distribution, or variants of it, might play an important role in combining statistical models, especially in settings where one wants to propagate uncertainty while preventing misspecification from damaging estimation. The cut distribution can also be seen as a probabilistic analog of two-step point estimators. So the cut distribution is more than just a trick! And it raises interesting computational issues which I’ll describe here along with a solution via unbiased MCMC.

What is the cut distribution? Suppose that you estimate with the distribution . This might be the posterior distribution under some model. Then consider a second parameter , which you would want to infer via a distribution ; this might be a posterior distribution in a second model. There are many such situations, e.g. might represent missing data or covariates plugged into the second model. You can then consider the joint probability distribution:

This is different from the posterior distribution in a joint model, where both parameters would be estimated simultaneously. Here the marginal of is insensitive to whatever craziness might be encoded in the second model. And this is the point: under the cut, the specification of the second model does not impact the first parameters.

It is difficult to design MCMC algorithms for , because its density cannot be evaluated point-wise. This is very well explained in Martyn Plummer’s paper. As an example, if the second posterior density takes the form

,

then one can often evaluate the prior and the likelihood appearing in the numerator. However, the denominator is a function of which might not have an analytical form, and thus the density function of the cut distribution cannot be evaluated point-wise. Martyn Plummer proposes a solution which can be convenient but introduces an extra bias. The discussion mentions the similarity with doubly intractable problems, but as far as I can see this does not lead to practical algorithms here.

A naive MCMC solution goes as follows. First approximate with an MCMC sample. Then, for each of these samples, say , generate an MCMC sample approximating . This leads to a lot of MCMC runs to do. Each run goes for a number of iterations which needs to be chosen. The resulting approximation will be valid as *all the numbers of iterations* go to infinity: this is cumbersome.

If one could sample i.i.d. from and , then a much simpler solution is available: sample from and then from . The resulting pair follows the cut distribution. Then one could sample such pairs independently many times to approximate the cut. Perfect sampling is an active research area, but unfortunately many distributions are still such that perfect samplers are not available or prohibitely costly.

This is closely connected to a solution proposed in our unbiased MCMC paper. Instead of providing perfect samples, we aim at the more humble goal of estimating integrals of arbitrary functions, say , with respect to the cut distribution. By the tower property of expectation, we have

,

where

.

Using the proposed machinery, if MCMC algorithms are available at both stages, we can estimate without bias for any , and use these to estimate without bias . The lack of bias makes the resulting procedure consistent in the number of independent replicates of the proposed estimator, which can be computed completely in parallel.

]]>With Jeremy Heng we have recently arXived a paper describing how to remove the burn-in bias of Hamiltonian Monte Carlo (HMC). This follows a recent work on unbiased MCMC estimators in general on which I blogged here. The case of HMC requires a specific yet very simple coupling. A direct consequence of this work is that Hamiltonian Monte Carlo can be massively parallelized: instead of running one chain for many iterations, one can run short coupled chains independently in parallel. The proposed estimators are consistent in the limit of the number of parallel replicates. This is appealing as the number of available processors increases much faster than clock speed, over recent years and for the years to come, for a number of reasons explained e.g. here.

As described in the previous blog post, the proposed construction involves a coupling of Markov chains. So consider two chains and . The coupling must be such that each chain is a standard HMC chain (or any variant thereof), but jointly the chains meet exactly, i.e. become identical, after a random number of iterations called the meeting time. That is, the variable is finite almost surely. Here for simplicity, I neglect the time shift mentioned in the earlier post and in the papers.

A simple HMC kernel works as follows. Given a current state , an initial velocity is drawn from a multivariate Normal distribution. Then, the equations of Hamiltonian dynamics, corresponding to the movement of a particle with a potential energy given by minus the log target density, are numerically solved with a leap-frog integrator, with a step size and a number of steps. The final position is then accepted or not as the next state , according to a Metropolis-Hastings (MH) ratio which corrects for the error introduced by the leap-frog integrator. The understanding of HMC has improved considerably in recent months, notably with the contributions of Mangoubi & Smith 2017 and Durmus, Moulines & Saksman 2017.

It turns out that using common random numbers for the two HMC chains, that is, using common initial velocities and common uniform variables for the acceptance steps, leads to the chains contracting very quickly. This is illustrated in the above animation, which shows the first iterations of 250-dimensional chains targeting a multivariate Normal distribution; the plot shows the evolution of the first two components. The distance between the chains goes rapidly to zero. Assumptions on the target distribution are necessary for such contraction to occur, such as strong convexity as in Section 2.4 of Mangoubi & Smith 2017. In our work, we only need such assumptions to be satisfied on subsets of the space that the chains visit regularly (see the paper for more precision). For instance, we might assume that the target density is strictly positive everywhere and that there are compact sets on which strong log-concavity of the target density holds.

In some examples, coupled HMC chains contract so fast that the distance between them goes under machine precision after a reasonably small number of iterations. We can then consider that the chains have met. Yet this is not very clean… so instead, we mix HMC steps with coupled MH steps that use maximally coupled random walk proposals. Indeed, if two chains are already close to one another thanks to the coupled HMC steps, then the coupled MH steps trigger exact meetings with large probability.

In the experiments, we try a multivariate Normal target in dimension 250, a bivariate Normal truncated by quadratic inequalities, and a logistic regression with 300 covariates. These examples show that tuning parameters, such as step size and number of leap-frog steps, that would be optimal for a single HMC chain might not be optimal for the proposed unbiased estimators. This leads to a loss of asymptotic efficiency, which is hardly surprising: this is a classic case of bias removal increasing the variance. Yet from the experiments, the loss of efficiency sounds very reasonable in exchange for immediate benefits on parallel processors. In the logistic regression example, the method essentially amounts to running HMC chains of length 1,000 in parallel.

Another way of parallelizing HMC, or any MCMC method, is to plug it in the rejuvenation step of an SMC sampler (see here or here). As has been known for fifteen years, this has benefits such as normalizing constant estimation and some robustness to multimodality. Both approaches provide estimators that are consistent in the limit of a number of operations that can be massively parallelized: particles in SMC samplers, and independent replicates of unbiased MCMC. One advantage of the latter might lie in the simple construction of confidence intervals from the standard Central Limit Theorem, but these can be built from SMC samplers too.

]]>Nine R user communities already exist in France and there is a much large number of R communities around the world. It was time for Grenoble to start its own!

The goal of the R user group is to facilitate the identification of local useRs, to initiate contacts, and to organise experience and knowledge sharing sessions. The group is open to any local useR interested in learning and sharing knowledge about R.

The group’s website features a map and table with members of the R group. Members with specific skills related to the use of R are referenced in a table and can be contacted by other members. A gitter allows members to discuss R issues and a calendar presents the upcoming events.

**Working sessions
**Monthly working sessions of two hours start with presentations or tutorials. The second part is dedicated to helping each others, meeting new people and sharing beers and softs offered by the Grenoble Data Institute. The current program looks nice: