We are organising a two-day Bayesian workshop in Grenoble in September 6-7, 2018. It will be the second edition of the Italian-French statistics seminar (link to first edition), titled this year: **Bayesian learning theory for complex data modeling**. The workshop will give to young statisticians the opportunity to learn from and interact with highly qualified senior researchers in probability, theoretical and applied statistics, with a particular focus on Bayesian methods.

Anyone interested in this field is welcome. There will be two junior sessions and a poster session with a call for abstract open until June 30. A particular focus will be given to researchers in the early stage of their career, or currently studying for a PhD, MSc or BSc. The junior session is supported by ISBA through travel awards.

There will be a social dinner on September 6, and a hike organised in the mountains on September 8.

**Confirmed invited speakers**

• Simon Barthelmé, Gipsa-lab, Grenoble, France

• Arnoldo Frigessi, University of Oslo, Norway

• Benjamin Guedj, Inria Lille – Nord Europe, France

• Alessandra Guglielmi, Politecnico di Milano, Italy

• Antonio Lijoi, University Bocconi, Milan, Italy

• Bernardo Nipoti, Trinity College Dublin, Ireland

• Sonia Petrone, University Bocconi, Milan, Italy

**Important Dates:**

• June 30, 2018: Abstract submission closes

• July 20, 2018: Notification on abstract acceptance

• August 25, 2018: Registration closes

More details and how to register: https://sites.google.com/view/bigworkshop

We look forward to seeing you in Grenoble.

Best,

Julyan

]]>Hi all,

In this post, I’ll go through numerical experiments illustrating the scaling of some MCMC algorithms with respect to the dimension. I will focus on a simple setting, to illustrate some theoretical results developed by Gareth Roberts, Andrew Stuart, Alexandros Beskos, Jeff Rosenthal and many of their co-authors over many years, for instance here for random walk Metropolis-Hastings (RWMH, and see here more recently), here for Modified Adjusted Langevin Algorithm (MALA), here for Hamiltonian Monte Carlo (HMC).

The target distribution, here will be a standard multivariate Normal distribution in dimension , with zero mean, and identity covariance matrix: is . All chains will be started at stationarity: the first state of the chain is sampled from the target distribution, so there’s no burn-in or warm-up phase.

An MCMC algorithm generates a Markov chain, so that after iterations, which can be used to estimate integrals with respect to . We will focus on the task of estimating the second moment of the first marginal of . So we look at the average , which converges to 1 as . Here represents the first component of the -th iteration of the Markov chain, which lives in a -dimensional space. Focusing on a fixed-dimensional marginal of the target distribution enables comparisons across dimensions.

After iterations, the MCMC average won’t be exactly equal to one. Its expectation is exactly one, because the chain starts at stationarity, but there is some variability due to the random nature of these algorithms. Below I’ll plot mean squared errors, defined as the expected squared difference between the estimator and the estimand 1, i.e. . This expectation cannot be computed analytically but, we can generate independent copies of and approximate with a sample average over the copies (below I’ll generate 500 copies).

This setup can be used to see how algorithms behave in increasing dimensions. A complication is that most MCMC algorithms have tuning parameters. How do we tune the algorithms in such a way that they achieve stable performance as the dimension increases? If we are not careful with the choice of tuning parameters, the algorithms would eventually stop working as the dimension increases. This is where the diffusion limit results are helpful: they provide choices of tuning parameters that make the algorithms work in all dimensions.

Consider first a Metropolis-Hastings algorithm with Gaussian random walk proposal. At each iteration, given , the next state is generated as follows:

- draw
- draw
- if , set (this is an acceptance), otherwise set (this is a rejection).

This paper shows that if we choose , where is a constant, and is the dimension, then the acceptance rate of the algorithm converges to a constant which is neither 0 or 1 as the dimension increases. I’ll use below, which is not optimal. With this scaling of , we take a number of iterations that grows linearly with , i.e. for instance. Then the mean squared error should be stable in the dimension. Does that work? The plot below shows acceptance rate and mean squared error being stable with the dimension.

So overall, each iteration has a cost of order (evaluating costs of the order of operations), the total complexity of the algorithm can be said to be of the order of .

Now what about MALA? According to this paper, if we take the stepsize to be , then the acceptance rate will be stable; then if we take a number of steps of the order of , the mean squared error should be stable. Below I’ve used , and . The overall complexity is thus of order .

For HMC, following this paper, we can fix the integration time of each solution of Hamilton’s equations to be constant.If we pick a stepsize in the leapfrog integrator of the order of , incurring leapfrog steps to reach a constant integration time, then the acceptance rate will be stable. Since the integration time is fixed, we can use a fixed number of MCMC steps, say , in all dimensions. The overall complexity is of order : each iteration involves leap-frog steps, which each costs operations. Does it work? Below I took the stepsize to be , the number of leapfrog steps to be .

The plot for the MSE of HMC is a bit rugged because the number of leapfrog steps varies discontinuously (it needs to increase with the dimension but it also needs to be an integer). Note that the plots of MSE above are not comparable across samplers because they’re not adjusted for computing time, and because the constants are not optimally tuned. The aim is simply to illustrate that the scalings obtained in the literature can be indeed checked with numerical experiments.

So on this i.i.d. Gaussian target, HMC scales better than MALA, which scales better than RWMH. The literature contains various generalizations to non-Gaussian and non-product distributions, i.e. distributions that do not factorize as a product of marginals, see e.g. here. Qualitatively, all these algorithms deteriorate with the dimension, in a polynomial fashion, and the difference is in the exponent.

What about Gibbs samplers? It’s hard to talk about Gibbs sampler in very generic terms, as it is mostly used in specific forms tailored for applications of interest. And it can be indeed very useful. If we were to perform RWMH steps to update each component of the chain given the others, in a systematic scan fashion, then we could use a fixed step size for each component. Thus we would not need to scale the number of iterations (i.e. full sweeps) with the dimension. However, each iteration would involve a sweep over all components, and furthermore, each target evaluation would, in general, involve operations (if we don’t know how to evaluate the conditional target density, we have to evaluate the joint target density). This would result in a cost of the order of operations, which is the same as RWMH above. However, if we knew how to compute the target conditional densities in the order of one operation, then the scaling of this Gibbs sampler would be linear in , which beats all the samplers mentioned above. Specific, case-by-case information about conditional distributions seems very important to a successful implementation of Gibbs samplers. The rewards are high in high-dimensional settings. Beyond the toy example considered here, see *Scalable MCMC for Bayes Shrinkage Priors, *by Johndrow, Orenstein, Bhattacharya for a recent application in a regression model in high dimension.

Possible future blog posts might cover scaling of methods based on importance sampling, scaling of software package implementations of MCMC algorithms, and scalings of couplings of MCMC algorithms.

]]>I finally took the time to read about axiomatic foundations of Bayesian statistics. I like axioms, I like Bayesians stats, so this was definitely going to be a pleasant opportunity to read some books, comfortably seated on the sofa I just added to my office. Moreover, my team in Lille includes supporters of Dempster-Shafer belief functions, another framework for uncertainty modelling and decision-making, so being precise on my own axioms was the best way to discuss more constructively with my colleagues.

Long story short, I took the red pill: there is a significant gap between axiomatic constructions of the Bayesian paradigm and current Bayesian practice. None of this is new, but I had never been told. It’s not keeping me awake at night, but it’s bothering my office mates who cannot stop me blabbering about it over coffee. The good side is that smart people have thought about this in the past. Also, reading about this helped me understand some of the philosophical nuances between the thought processes of different Bayesians, say de Finettians vs. Jeffreysians. I will not attempt a survey in a blog post, nor do I feel to be knowledgeable enough for this, but I thought I could spare my office mates today and annoy Statisfaction’s readers for once.

Take Savage’s axioms, for instance. I’ve always heard that they were the current justification behind the saying “being Bayesian is being a coherent decision-maker”. To be precise, let be the set of states of the world, that is, everything useful to make your decision. To fix ideas, in a statistical experiment, your decision might be a “credible” interval on some real parameter, so should at least be the product of times whatever space your data live in. Now an action is defined to be a map from to some set of outcomes . For the interval problem, an action corresponds to the choice of a particular interval and the outcomes should contain whatever you need to assess the performance of your action, say, the indicator of the parameter actually belonging to your interval , and the length of . Outcomes are judged by utility, that is, we consider functions that map outcomes to nonnegative rewards. In our example, this could be a weighted sum of the indicator and the interval length. The weights translate your preference for an interval that actually captures the value of the parameter of interest over a short interval. Now, the axioms give the equivalence between the two following bullets:

- (being Bayesian) There is a unique pair , made of a utility function and a finitely additive probability measure defined on all subsets of the set of states of the world, such that you choose your actions by maximizing an expected utility criterion:

- (being coherent) Ranking actions according to a preference relation that satisfies a few abstract properties that make intuitive sense for most applications, such as transitivity: if you prefer to and to , then you prefer to . Add to this a few structural axioms that impose constraints on on .

Furthermore, there is a natural notion of conditional preference among actions that follows from Savage’s axioms. Taken together, these axioms give an operational definition of our “beliefs” that seems to match Bayesian practice. In particular, 1) our beliefs take the form of a probability measure –which depends on our utility–, 2) we should update these beliefs by conditioning probabilities, and 3) make decisions using expected utility with respect to our belief. This is undeniably beautiful. Not only does Savage avoid shaky arguments or interpretations by using your propensity to act to define your beliefs, but he also avoids using “extraneous probabilities”. By the latter I mean any axiom that artificially brings mathematical probability structures into the picture, such as “there exists an ideal Bernoulli coin”.

But the devil is in the details. For instance, some of the less intuitive of Savage’s axioms require the set of states of the world to be uncountable and the utility bounded. Also, the measure is actually only required to be finitely additive, and it has to be defined on all subsets of the set of states of the world. Now-traditional notions like Lebesgue integration, -additivity, or -algebras do not appear. In particular, if you want to put a prior on the mean of a Gaussian that lives in , Savage says your prior should weight all subsets of the real line, so forget about using any probability measure that has a density with respect to the Lebesgue measure! Or, to paraphrase de Finetti, -additive probability does not exist. Man, before reading about axioms I thought “Haha, let’s see whether someone has actually worked out the technical details to justify Bayesian nonparametrics with expected utility, this must be technically tricky”; now I don’t even know how to fit the mean of a Gaussian anymore. Thank you, Morpheus-Savage.

There are axiomatic ways around these shortcomings. From what I’ve read they all either include extraneous probabilities or rather artificial mathematical constructions. Extraneous probabilities lead to philosophically beautiful axioms and interpretations, see e.g. Chapter 2 of Bernardo and Smith (2000), and they can get you finite and countably finite sets of states of the world, for instance, whereas Savage’s axioms do not. Stronger versions also give you -additivity, see below. Loosely speaking, I understand extraneous probabilities as measuring uncertainty with respect to an ideal coin, similarly to measuring heat in degrees Celsius by comparing a physical system to freezing or boiling water. However, I find extraneous probability axioms harder to swallow than (most of) Savage’s axioms, and they involve accepting a more general notion of probability than personal propensity to act.

If you want to bypass extraneous probability and still recover -additivity, you could follow Villegas (1964), and try to complete the state space so that well-behaved measures extend uniquely to -additive measures on a -algebra on this bigger set of states . Defining the extended involves sophisticated functional analysis, and requires to add potentially hard-to-intepret states of the world, so losing some of the interpretability of Savage’s construction. Authors of reference books seem reluctant to go in that direction: De Groot (1970), for instance, solves the issue by using a strong extraneous probability axiom that allows working in the original set with -additive beliefs. Bernardo & Smith use extraneous probabilities, but keep their measures finitely additive until the end of Chapter 2. Then they admit departing from axioms for practical purposes and define “generalized beliefs” in Chapter 3, defined on a -algebra of the original . Others seem to more readily accept the gap between axioms and practice, and look for a more pragmatic justification of the combined use of expected utility and countably additive probabilities. For instance, Robert (2007) introduces posterior expected utility, and then argues that it has desirable properties among decision-making frameworks, such as respecting the likelihood principle. This is unlike Savage’s approach, for whom the (or rather, a finitely additive version of the) likelihood principle is a consequence of the axioms. I think this is an interesting subtlelty.

To conclude, I just wanted to share my excitement for having read some fascinating works on decision-theoretic axioms for Bayesian statistics. There still is some unresolved tension between having both an applicable and axiomatized Bayesian theory of belief. I would love this post to generate discussions, and help me understand the different thought processes behind each Bayesian being Bayesian (and each non-Bayesian being non-Bayesian). For instance, I had not realised how conceptually different the points of view in the reference books of Robert and Bernardo & Smith were. This definitely helped me understand (Xi’an) Robert’s short three answers to this post.

If this has raised your interest, I will mention here a few complementary sources that I have found useful, ping me if you want more. Chapters 2 and 3 of Bernardo and Smith (2000) contain a detailed description of their set of axioms with extraneous probability, and they give great lists of pointers on thorny issues at the end of each chapter. A lighter read is Parmigiani and Inoue (2009), which I think is a great starting point, with emphasis on the main ideas of de Finetti, Ramsey, Savage, and Anscombe and Aumann, how they apply, and how they relate to each other, rather than the technical details. Technical details and exhaustive reviews of sets of axioms for subjective probability can be found in their references to Fishburn’s work, which I have found to be beautifully clear, rigorous and complete, although like many papers involving low-level set manipulations, the proofs sometimes feel like they are written for robots. But after all, a normative theory of rationality is maybe only meant for robots.

]]>

This is an advertisement for on conference on AI organised at Inria Grenoble by Thoth team and Naver labs : https://project.inria.fr/paiss/. This AI summer school comprises lectures and practical sessions conducted by renowned experts in different areas of artificial intelligence.

This event is the revival of a past series of very successful summer schools which took place in Grenoble and Paris. The latest edition of this series was held in 2013. While originally focusing on computer vision, the summer school now targets a broader AI audience, and will also include presentations about machine learning, natural language processing, robotics, and cognitive science.

Note that NAVER LABS is funding a number of students to attend PAISS. Apply before 4th April.

The 2018 edition of the AI summer school will feature lecturers including:

Lourdes Agapito is a Professor of 3D Vision in the Department of Computer Science at University College London (UCL). | Kyunghyun Cho is an assistant professor of computer science and data science at New York University. |

Emmanuel Dupoux is Directeur d’Études at École des Hautes Etudes en Sciences Sociales in Paris, where he heads a team at the intersection of cognitive and computer sciences. |
Martial Hebert is a Professor of Robotics and the Director of the Robotics Institute at Carnegie Mellon University. |

Diane Larlus is a senior research scientist in the Computer Vision group at NAVER LABS Europe. | Hugo Larochelle is the lead of the Google Brain group in Montreal and an adjunct professor at Université de Sherbrooke. |

Yann LeCun is Facebook’s Chief AI Scientist, and Silver Professor of Data Science, Computer Science, Neural Science, and Electrical Engineering at New York University. | Julien Mairal is a research scientist in the Thoth research team at Inria Grenoble – Rhône-Alpes. |

Julien Perez is the manager of the Machine Learning and Optimization group at NAVER LABS Europe. | Jean Ponce is a Professor at École Normale Supérieure in Paris, and is currently a visiting professor at New York University. |

Cordelia Schmid is a research director at Inria Grenoble – Rhône-Alpes, where she heads the Thoth research team. | Andrew Zisserman leads the Visual Geometry Group at the University of Oxford, and is also affiliated with DeepMind. |

There is a controversy these days on social media about academics claiming that “if you do not feel like working 60+ hours per week including weekends and evenings you should probably find another job”. This is utterly frustrating. As an academic myself, I clearly do not feel like working days and nights, even if I am neck deep in a project. Does it makes me a poor assistant prof ?

It is no secret that academic research is not, in general, a 9 to 5 job (not saying that it cannot be). I myself usually work on weekends, when commuting or during holidays. I always carry a few papers that I did not had time to read or relentlessly write equations in my notebook when an idea strikes me during the morning commute. I do think about how I could improve my next day lecture and make some changes in my slides late in the evening. That is partly because I am disorganised, also because the job sort of requires it. We lack time to do all the things we want to do while at the lab. Conferences, seminars and meetings wreck your schedule if you had any. So you might end up seeing any time off as a lost opportunity to do more research.

This situation is clearly not good, and many academics, including me, have or had a hard time dealing with this. In particular when you are still a PhD/postdoc/tenure track (you name it) and need to stick out of the pack to get a position in an extremely competitive environment. And when senior full professors are telling you that you need to work even harder if you want to be considered as worthy, that is clearly not helping.

My view is that even if we all want to shoot for the moon (and hit the damn thing), it is totally fine to simply do some good research. Each of your paper does not have to be ground breaking, as long as it contribute to the field, it should be good enough and acknowledge as such. If your goal is to have that-many JRSS/Biometrika/Annals of Stats papers and a sky-rocketing h-index no matter what, you are probably doing it wrong. Competition between academics can be a good boost from time to time, but it should not be the end of it. More importantly, it should not be what drives an academic careerer. The negative externalities of this system are depressed junior researchers and limited scientific research squeezed out of our brains to extend our publication record.

So what should we do about this ? Well first, stop bragging about how many hours to work per week, if you feel like working 24/7, good for you, but it does not have to be that way for everyone. Secondly, stop judging academics (especially junior ones) on some dumb metrics such as h-index or so, if you need to evaluate a candidate, read their research. In short, cut the competition and be supportive. Make academia fun again !

]]>In this third and last post about the Sub-Gaussian property for the Beta distribution [1] (post 1 and post 2), I would like to show the interplay with the Bernoulli distribution as well as some connexions with optimal transport (OT is a hot topic in general, and also on this blog with Pierre’s posts on Wasserstein ABC).

Let us see how sub-Gaussian proxy variances can be derived from transport inequalities. To this end, we need first to introduce the **Wasserstein distance** (of order 1) between two probability measures *P* and * Q* on a space . It is defined wrt a distance *d* on by

where is the set of probability measures on with fixed marginal distributions respectively and Then, a probability measure is said to satisfy a **transport inequality** with positive constant , if for any probability measure dominated by ,

where is the entropy, or Kullback–Leibler divergence, between and . The nice result proven by Bobkov and Götze (1999) [2] is that the constant is a sub-Gaussian proxy variance for *P*.

For a discrete space equipped with the Hamming metric, , the induced Wasserstein distance reduces to the total variation distance, . In that setting, Ordentlich and Weinberger (2005) [3] proved the distribution-sensitive transport inequality:

where the function is defined by and the coefficient is called the balance coefficient of , and is defined by . In particular, the Bernoulli balance coefficient is easily shown to coincide with its mean. Hence, applying the result of Bobkov and Götze (1999) [2] to the above transport inequality yields a distribution-sensitive proxy variance of for the Bernoulli with mean , as plotted in blue above.

In the Beta distribution case, we have not been able to extend this transport inequality methodology since the support is not discrete. However, a nice limiting argument holds. Consider a sequence of Beta random variables with fixed mean and with a sum going to zero. This converges to a Bernoulli random variable with mean , and we have shown that the limiting optimal proxy variance of such a sequence of Beta with decreasing sum is the one of the Bernoulli.

[1] Marchal, O. and Arbel, J. (2017), On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.

[2] Bobkov, S. G. and Götze, F. (1999). Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28.

[3] Ordentlich, E. and Weinberger, M. J. (2005). A distribution dependent refinement of Pinsker’s inequality. IEEE Transactions on Information Theory, 51(5):1836–1840.

As a follow-up on my previous post on the sub-Gaussian property for the Beta distribution [1], I’ll give here a visual illustration of the proof.

A random variable with finite mean is sub-Gaussian if there is a positive number such that:

We focus on *X* being a Beta random variable. Its moment generating function is known as the Kummer function, or confluent hypergeometric function . So *X *is -sub-Gaussian as soon as the difference function

remains positive on . This difference function is plotted on the right panel above for parameters . In the plot, is varying from green for the variance (which is a lower bound to the optimal proxy variance) to blue for the value , a simple upper bound given by Elder (2016), [2]. The idea of the proof is simple: the optimal proxy-variance corresponds to the value of for which admits a double zero, as illustrated with the red curve (black dot). The left panel shows the curves with varying, interpolating from green for to blue for , with only one curve qualifying as the optimal proxy variance in red.

[1] Marchal and Arbel (2017), On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.

[2] Elder (2016), Bayesian Adaptive Data Analysis Guarantees from Subgaussianity, https://arxiv.org/abs/1611.00065

Hi all,

Kristian Lum, who was already one of my Statistics superheroes for her many interesting papers and great talks, bravely wrote the following text about her experience as a young statistician going to conferences:

https://medium.com/@kristianlum/statistics-we-have-a-problem-304638dc5de5

I can’t thank Kristian enough for speaking out. Her experience is both shocking and hardly surprising. Many, many academics report similar stories. This simply can’t go on like that.

I happen to have gone to the conferences mentioned by Kristian, and my experience as a young man was completely different. It was all about meeting interesting people, discussing ideas, being challenged, and having good times. Nobody harassed, touched or assaulted me. There was some flirting, as I guess is natural when hundreds of people are put in sunny places far away from home, but I was never the victim of any misconduct or abuse of power. So instead of driving me out of the field, conferences became important, enriching and rewarding moments of my professional life.

Looking back at those conferences I feel sick, and heartbroken, at the thought that some of my peers were having such a difficult time, because of predators who don’t ever face the consequences of their actions. Meanwhile I was part of the silent majority.

The recent series of revelations about sexual harassment and assaults in other professional environments indicate that this is not specific to our field, nor to academia. But this does not make it any more acceptable. I know for a fact that many leaders of our field take this issue extremely seriously (as Kristian mentions too), but clearly much much more needs to be done. The current situation is just shameful; strong and coordinated actions will be needed to fix it. Thanks again to Kristian for the wake-up call.

]]>

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on unbiased MCMC. Reducing the problem to a simple form, consider the following code, which iteratively samples a vector ‘x’ and stores it in a row of a large matrix called ‘chain’ (I’ve kept the MCMC terminology).

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

If you execute this code, you will see that it is surprisingly slow: it takes close to a minute on my computer. Now, consider the next block, which does exactly the same except that the vector ‘x’ is not copied into the matrix ‘chain’.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) # chain[imcmc,] = x #no more copying }

This code runs nearly instantaneously. Could it be that just copying a vector in a matrix takes a lot of time? Sounds unlikely. Now consider this third block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nmcmc){ #no call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This code runs nearly instantaneously as well; this time ‘x’ is copied into ‘chain’, but the call to the nrow function is removed….?! What is nrow doing? It is meant to simply return dim(chain)[1], the first dimension of chain. So consider this fourth block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == dim(chain)[1]){ #call to dim instead of nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This one also runs instantaneously! So replacing nrow(chain) by dim(chain)[1] solves the problem. Why?

The answer comes from R guru and terrific statistician Louis Aslett. I directly quote from an exchange of emails, since he brilliantly explains the phenomenon.

You probably know R stores everything by reference, so if I do:

x <- matrix(0, nrow=1e5, ncol=100)

y <- xI actually only have one copy of the matrix in memory with two references to it. If I then do:

x[1,1] <- 1

R will first make a copy of the whole matrix, update x to point to that and then change the first element to one. This idea is used when you pass a variable to a standard (i.e. non-core, non-primitive) R function, which nrow is: it creates a reference to the variable you pass so that it doesn’t have to copy and the function call is very fast …. as long as you don’t write to it inside the function, no copy need ever happen. But the “bad design” bit is that R makes a decision whether to copy on write based only on a reference count and crucially that reference count stays increased even after a function returns, irrespective of whether or not the function has touched the variable.

So:

x <- matrix(0, nrow=1e5, ncol=100) # matrix has ref count 1

x[1,1] <- 1 # ref count is 1, so write with no copy

nrow(x) # ref count is 2 even though nothing was touched

x[1,1] <- 1 # ref count still 2, so R copies before writing first element. Now the ref count drops to 1 again

x[2,2] <- 1 # this writes without a copy as ref count got reset on last line

nrow(x) # ref count jumps

x[3,3] <- 1 # copy invoked again! Aaaargh!So by calling nrow in the loop for the first example, the chain matrix is being copied in full on every iteration. In the second example, chain is never written to so there is no negative side effect to the ref count having gone up. In the third example, chain only ever has ref count 1 so there are no copies and each row is written in-place. I did a quick bit of profiling and indeed in the slow example, the R garbage collector allocates and tidies up nearly 9GB of RAM when executing the loop!

The crazy thing is that dim(chain)[1] works full speed even though that is all that nrow is doing under the hood, but the reason is that dim is a so-called “primitive” core R function which is special because it doesn’t affect the reference counter of its arguments. If you want to dig into this yourself, there’s a function refs() in the pryr package which tells you the current reference count to any variable.

Thanks Louis!

]]>Hi,

With Stephane Shao, Jie Ding and Vahid Tarokh we have just arXived a tech report entitled “Bayesian model comparison with the Hyvärinen score: computation and consistency“. Here I’ll explain the context, that is, scoring rules and Hyvärinen scores (originating in Hyvärinen’s score matching approach to inference), and then what we actually do in the paper.

Let’s start with *scoring rules*. These are loss functions for the task of predicting a variable with a probability distribution . If is used to predict and occurs, then the score is a real value, e.g. denoted by ; the smaller score the better, and overall we want to find that minimizes , where the expectation is with respect to the distribution of . A scoring rule is *proper* if the above expectation is minimized when is precisely the distribution of . An example of *proper scoring rule* is , the *logarithmic scoring rule*.

We can interpret Bayes factors in terms of logarithmic scoring rules (as in Chapter 6 of Bernardo & Smith). Indeed, the logarithm of the Bayes factor between model and is the difference of log-evidences:

,

In this sense, the Bayes factor compares the predictive performance of models. Decomposing these marginal likelihoods into conditionals and assuming that , we have for model :

,

(with a convention for ), which can be interpreted as a measure of performance of out-of-sample predictive distributions , summed up over time. Importantly, this interpretation of Bayes factors holds also when models are misspecified.

So what’s not to like? Prior specification affects the evidence, which is completely fine per se. What’s concerning is the extent of the impact of the prior. Seemingly innocent changes of prior distributions can have drastic effects on the evidence and thus on Bayes factors. This is the case in the simplest example of a Normal location model: , with fixed and prior . Then the log-evidence behaves like when . This means that the log-evidence can take crazy values, and is not even well-defined in that limit. However, that limit corresponds to a flat prior which is not crazy in this model, at least in terms of parameter inference. This is a reason for people to avoid vague priors when relying on Bayes factors for model comparison.

Conversely, this is a reason to seek alternatives to the evidence as a model comparison criterion, see for instance intrinsic Bayes factors, fractional Bayes factors, or the mixture approach of Kamary et al. Our work follows Dawid & Musio (2015) who propose to change the scoring rule. Instead of the logarithmic scoring rule, they advocate the Hyvärinen scoring rule, which leads to replacing by

.

This barbaric expression involves derivatives of the log-density of predictive distributions, instead of log-densities. It can then be checked in the Normal location model that the score is well-defined even in the limit . Thankfully it can also be checked that the Hyvärinen score is *proper*. Note that variants of the score have been proposed for discrete observations, but there are cases where the Hyvärinen score is inapplicable, namely when predictive densities are not smooth enough, e.g. Laplace distributions.

In the paper, we show how sequential Monte Carlo samplers can approximate this scoring rule, for a wide range of models including nonlinear state space models. We also show the consistency of this scoring rule for model selection, as the number of observations goes to infinity; our proof relies on strong regularity assumptions, but the numerical experiments indicate that the results hold under weaker conditions. Finally we investigate an example of population growth model applied to kangaroos, and a Lévy-driven stochastic volatility model which we use to illustrate the consistency result. Both of these cases feature intractable likelihoods approximated by particle filters within an SMC^2 algorithm.

The code producing the figures of the paper is available on Github: https://github.com/pierrejacob/bayeshscore

]]>