## nrow, references and copies

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on unbiased MCMC. Reducing the problem to a simple form, consider the following code, which iteratively samples a vector ‘x’ and stores it in a row of a large matrix called ‘chain’ (I’ve kept the MCMC terminology).

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

If you execute this code, you will see that it is surprisingly slow: it takes close to a minute on my computer. Now, consider the next block, which does exactly the same except that the vector ‘x’ is not copied into the matrix ‘chain’.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nrow(chain)){ #call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) # chain[imcmc,] = x #no more copying }

This code runs nearly instantaneously. Could it be that just copying a vector in a matrix takes a lot of time? Sounds unlikely. Now consider this third block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == nmcmc){ #no call to nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This code runs nearly instantaneously as well; this time ‘x’ is copied into ‘chain’, but the call to the nrow function is removed….?! What is nrow doing? It is meant to simply return dim(chain)[1], the first dimension of chain. So consider this fourth block.

dimstate = 100 nmcmc = 1e4 chain = matrix(0, nrow = nmcmc, ncol = dimstate) for (imcmc in 1:nmcmc){ if (imcmc == dim(chain)[1]){ #call to dim instead of nrow } x = rnorm(dimstate, mean = 0, sd = 1) chain[imcmc,] = x #copying of x in chain }

This one also runs instantaneously! So replacing nrow(chain) by dim(chain)[1] solves the problem. Why?

The answer comes from R guru and terrific statistician Louis Aslett. I directly quote from an exchange of emails, since he brilliantly explains the phenomenon.

You probably know R stores everything by reference, so if I do:

x <- matrix(0, nrow=1e5, ncol=100)

y <- xI actually only have one copy of the matrix in memory with two references to it. If I then do:

x[1,1] <- 1

R will first make a copy of the whole matrix, update x to point to that and then change the first element to one. This idea is used when you pass a variable to a standard (i.e. non-core, non-primitive) R function, which nrow is: it creates a reference to the variable you pass so that it doesn’t have to copy and the function call is very fast …. as long as you don’t write to it inside the function, no copy need ever happen. But the “bad design” bit is that R makes a decision whether to copy on write based only on a reference count and crucially that reference count stays increased even after a function returns, irrespective of whether or not the function has touched the variable.

So:

x <- matrix(0, nrow=1e5, ncol=100) # matrix has ref count 1

x[1,1] <- 1 # ref count is 1, so write with no copy

nrow(x) # ref count is 2 even though nothing was touched

x[1,1] <- 1 # ref count still 2, so R copies before writing first element. Now the ref count drops to 1 again

x[2,2] <- 1 # this writes without a copy as ref count got reset on last line

nrow(x) # ref count jumps

x[3,3] <- 1 # copy invoked again! Aaaargh!So by calling nrow in the loop for the first example, the chain matrix is being copied in full on every iteration. In the second example, chain is never written to so there is no negative side effect to the ref count having gone up. In the third example, chain only ever has ref count 1 so there are no copies and each row is written in-place. I did a quick bit of profiling and indeed in the slow example, the R garbage collector allocates and tidies up nearly 9GB of RAM when executing the loop!

The crazy thing is that dim(chain)[1] works full speed even though that is all that nrow is doing under the hood, but the reason is that dim is a so-called “primitive” core R function which is special because it doesn’t affect the reference counter of its arguments. If you want to dig into this yourself, there’s a function refs() in the pryr package which tells you the current reference count to any variable.

Thanks Louis!

## New R user community in Grenoble, France

Nine R user communities already exist in France and there is a much large number of R communities around the world. It was time for Grenoble to start its own!

The goal of the R user group is to facilitate the identification of local useRs, to initiate contacts, and to organise experience and knowledge sharing sessions. The group is open to any local useR interested in learning and sharing knowledge about R.

The group’s website features a map and table with members of the R group. Members with specific skills related to the use of R are referenced in a table and can be contacted by other members. A gitter allows members to discuss R issues and a calendar presents the upcoming events. (more…)

## Likelihood calculation for the g-and-k distribution

Hello,

An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2) numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

- Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
- Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
- Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
- Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

## 3D density plot in R with Plotly

In Bayesian nonparametrics, many models address the problem of *density regression*, including covariate dependent processes. These were settled by the pioneering works by [current ISBA president] MacEachern (1999) who introduced the general class of dependent Dirichlet processes. The literature on dependent processes was developed in numerous models, such as nonparametric regression, time series data, meta-analysis, to cite but a few, and applied to a wealth of fields such as, e.g., epidemiology, bioassay problems, genomics, finance. For references, see for instance the chapter by David Dunson in the Bayesian nonparametrics textbook (edited in 2010 by Nils Lid Hjort, Chris Holmes, Peter Müller and Stephen G. Walker). With Kerrie Mengersen and Judith Rousseau, we have proposed a dependent model in the same vein for modeling the influence of fuel spills on species diversity (arxiv).

Several densities can be plotted on the same 3D plot thanks to the Plotly R library, *“an interactive, browser-based charting library built on the open source JavaScript graphing library, plotly.js.”*

In our ecological example, the model provides a series of densities on the *Y* axis (in our case, posterior density of species diversity), indexed by some covariate *X* (a pollutant). See file density_plot.txt. The following Plotly R code provides a graph as below. For the interactive version, see the RPubs page here.

library(plotly) mydata = read.csv("density_plot.txt") df = as.data.frame(mydata) plot_ly(df, x = Y, y = X, z = Z, group = X, type = "scatter3d", mode = "lines")

## Statistics journals network

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite *fascinating* for us, *academics fascinated by academic rankings, for better or for worse *(ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include *Bayesian Analysis* (published from 2006), *The Annals of Applied Statistics* (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between *Annals of Statistics* and *Biometrika*. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction,
*CSDA*is way ahead of all the others. - the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.

**Some remarks**

The two software journals included in the dataset are quite outliers:

- the
*Journal of Statistical Software (JSS)*is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015). - the largest is the self citations of the
*STATA Journal (StataJ).*

Centrality:

*CSDA*is the most central journal in the sense of the highest (unweighted) degree.

**Some further thoughts**

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

## momentify R package at BAYSM14

I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called *marginal methods*). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model

where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).

We consider the survival function which is recovered from the hazard rate by the transform

and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.

The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.

I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:

- evaluates the approximate density
- samples from it

A package example for a mixture of beta and 2 to 7 moments gives that result:

## Moustache target distribution and Wes Anderson

Today I am going to introduce the moustache target distribution (*moustarget* distribution for brievety). Load some packages first.

library(wesanderson) # on CRAN library(RShapeTarget) # available on https://github.com/pierrejacob/RShapeTarget/ library(PAWL) # on CRAN

Let’s invoke the *moustarget* distribution.

shape <- create_target_from_shape( file_name=system.file(package = "RShapeTarget", "extdata/moustache.svg"), lambda=5) rinit <- function(size) matrix(rnorm(2*size), ncol = 2) moustarget <- target(name = "moustache", dimension = 2, rinit = rinit, logdensity = shape$logd, parameters = shape$algo_parameters)

This defines a target distribution represented by a SVG file using RShapeTarget. The target probability density function is defined on and is proportional to on the segments described in the SVG files, and decreases exponentially fast to away from the segments. The density function of the *moustarget* is plotted below, a picture being worth a thousand words.

## Rasmus Bååth’s Bayesian first aid

Besides having coded a pretty cool MCMC app in Javascript, this guy Rasmus Bååth has started the Bayesian first aid project. The idea is that if there’s an R function called **blabla.test** performing test “blabla”, there should be a function **bayes.blabla.test** performing a similar test in a Bayesian framework, and showing the output in a similar way so that the user can easily compare both approaches.This post explains it all. Jags and BEST seem to be the two main workhorses under the hood.

Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).

## From SVG to probability distributions [with R package]

Hey,

To illustrate generally complex probability density functions on continuous spaces, researchers always use the same examples, for instance mixtures of Gaussian distributions or a banana shaped distribution defined on with density function:

If we draw a sample from this distribution using MCMC we obtain a [scatter]plot like this one:

Clearly it doesn’t really look like a banana, even if you use yellow to colour the dots like here. Actually it looks more like a boomerang, if anything. I was worried about this for a while, until I came up with a more realistic banana shaped distribution:

See how the shape is well defined compared to the first figure? And there’s even the little tail, that proves so convenient when we want to peel off the fruit. More generally we might want to create target density functions based on general shapes. For this you can now try RShapeTarget, which you can install directly from R using devtools:

library(devtools) install_github(repo="RShapeTarget", username="pierrejacob")

The package parses SVG files representing shapes, and creates target densities from them. More precisely, a SVG files contains “paths”, which are sequence of points (for instance the above banana is a single closed path). The associated log density at any point is defined by where is the closest path of the shape from and is the distance between the point and the path. The parameter specifies the rate at which the density decays when the point goes away from the shape. With this you can define the maple leaf distribution, as a tribute to JSM 2013:

In the package you can get a distribution from a SVG file using the following code:

library(RShapeTarget) # create target from file my_shape_target <- create_target_from_shape(my_svg_file_name, lambda =1) # test the log density function on 25 randomly generated points my_shape_target$logd(matrix(rnorm(50), ncol = 2), my_shape_target$algo_parameters)

Since characters are just a bunch of paths, you can also define distributions based on words, for instance:

which is done as follows (warning you’re only allowed a-z and A-Z, no numbers no space no punctuation for now):

library(RShapeTarget) word_target <- create_target_from_word("Hodor")

For the words, I defined the target density function as before, except that it’s constant on the letters: so if a point is outside a letter its density is computed based on the distance to the nearest path; if it’s inside a letter it’s just constant, so that the letters are “filled” with some constant density. I thought it’d look better.

Now I’m not worried about the banana shaped distribution any more, but by the fact that the only word I could think of was “Hodor” (with whom you can chat over there).

## Using R in LaTeX with knitr and RStudio

I presented today at INSEE R user group (FLR) how to use **knitr** (Sweave evolution) for writing documents which are self contained with respect to the source code: your data changed? No big deal, just compile your .Rnw file again and you are done with an updated version of your paper![Ctrl+Shift+I] is easy. Some benefits with respect to having two separate .R and .tex files: it is integrated in a single software (RStudio), you can call variables in your text with the \Sexpr{} command. The slow speed at compilation is no more a real matter as one can put “cache=TRUE” in code chunk options not to reevaluate unchanged chunks, which fastens things.

I share the (brief) slides below. They won’t help much those who already use knitr, but they give the first steps for those who would like to give it a try.

5comments