## Leave the Pima indians alone: the R package

Hi there,

while everyone was away in July, James Ridgway and I posted our “leave (the) pima paper alone” paper on arxiv, in which we discuss to which extent probit/logit regression and not too big datasets (such as the now famous Pima Indians dataset) constitute a relevant benchmark for Bayesian computation.

The actual title of the paper is “Leave Pima Indians alone…”, but xian changed it to “Leave *the* Pima Indians alone…” when discussing it on his blog. Any opinion on whether it does sound better with “the”?

On a different note, one of our findings is that Expectation-Propagation works wonderfully for such models; yes it is an approximate method, but it is very fast, and the approximation error is consistently negligible on all the datasets we looked at.

James has just posted on CRAN the EPGLM package, which computes an EP approximation of the posterior of a logit or probit model. The documentation is a bit terse at the moment, but it is very straightforward to use.

Comments on the package, the paper, its grammar or Pima Indians are most welcome!

## Who has the biggest in Bayesian Nonparametrics?

This very fine title quotes a pretty hilarious banquet speech by David Dunson at the last BNP conference held in Raleigh last June. The graph is by François Caron who used it in his talk there. See below for his explanation.

After the summer break, back to work. The academic year to come looks promising from a BNP point of view. Not least that three special issues have been announced, in Statistics & Computing (guest editors: Tamara Broderick (MIT), Katherine Heller (Duke), Peter Mueller (UT Austin)), the Electronic Journal of Statistics (guest editor: Subhashis Ghoshal (NCSU)), and in the International Journal of Approximate Reasoning (proposal deadline December 1st, guest editors: Alessio Benavoli (Lugano), Antonio Lijoi (Pavia) and Antonietta Mira (Lugano)).

BNP is also going to infiltrate MCMSki V, Lenzerheide, Switzerland, January 4-7 2016, with three sessions with a BNP flavor, in addition to plenary speakers David Dunson and Michael Jordan. The International Society for Bayesian Analysis World Meeting, 13 -17 June, 2016, should also host plenty of BNP sessions. And a De Finetti Lecture by Persi Diaconis (Stanford University). (more…)

## slides of SMC2015

## Turing revisited in Turin, and Oxford

With colleagues Stefano Favaro and Bernardo Nipoti from Turin and Yee Whye Teh from Oxford, we have just arXived an article on discovery probabilities. If you are looking for some info on a space shuttle, a cycling team or a TV channel, it’s the wrong place. Instead, discovery probabilities are central to ecology, biology and genomics where data can be seen as a population of individuals belonging to an (ideally) infinite number of species. Given a sample of size , the -discovery probability is the probability that the next individual observed matches a species with frequency in the -sample. For instance, the probability of observing a new species is key for devising sampling experiments.

By the way, why Alan Turing? Because with his fellow researcher at Bletchley Park Irving John Good, starred in The Imitation Game too, Turing is also known for the so-called *Good-Turing estimator* of the discovery probability

which involves , the number of species with frequency in the sample (ie frequencies frequency, if you follow me). As it happens, this estimator defined in Good 1953 Biometrika paper became wildly popular among ecology-biology-genomics communities since then, at least in the small circles where wild popularity and probability aren’t mutually exclusive.

Simple explicit estimators of discovery probabilities in the Bayesian nonparametric (BNP) framework of Gibbs-type priors were given by Lijoi, Mena and Prünster in a 2007 Biometrika paper. The main difference between the two estimators of is that Good-Turing involves and only, while the BNP involves , (instead of ), and , the total number of observed species. It has been shown in the literature that the BNP estimators are more reliable than Good-Turing estimators.

How do we contribute? (i) we describe the posterior distribution of the discovery probabilities in the BNP model, which is pretty useful for deriving exact credible intervals of the estimates, and (ii) we investigate large asymptotic behavior of the estimators.

## Who is Julia ?

Hi there !

Unfortunately this post is indeed about statistics…

If you are randomly walking around the statistics blogs, you probably have certainly heard of this new language called Julia. It is said by the developers to be as easy to write as R and as fast as C (!) which is quite a catchy way of selling their work. After talking with a Julia enthusiastic user in Amsterdam, I decided to give it a try. And here I am sharing my first impressions.

Fist thing first, the installation is as easy as any other language, plus there is a neat Package management that allows you to get started quite easily. In this respect it is very similar to R.

On the minus side I became a big fan of RStudio Julian (… oupsy Julyan) told you about a long time ago. These kind of programs really make your life easier. I thus tried Juno which turned out to be cumbersome and terribly slow. I would have loved to have an IDE for Julia that would be up to the RStudio standard. Nevermind.

No lets talk a little about what is really interesting : “Is their catch phrase false advertising or not?!”.

There is a bunch of relatively good tutorials online which are really helpful to learn the basic vocabulary, but indeed if like me you are use to code in R and/or Python, you should get it pretty fast and can almost copy-paste your favourite code into Julia and with a few adjustments, it will work. So as easy to write as R : quite so.

I then tried to compare computational times for some of my latest codes and there came the good surprise ! A code that would take a handful of minutes to run in R mainly due to unavoidable loops took a couple of seconds to run in Julia, without any other sorts of optimization. The handling of big objects is smooth and I did not ran into memory problems that R was suffering from.

So far so good ! But of course there has to be some drawbacks. The first one is the poor package repository compare to CRAN or even what you can get for Python. This might of course improve in the next few years as the language is still quite new. However, it is bothering to have to re-code something when you are used to simply load a package in R. Another, probably less important problem, is the lack of data visualization methods and especially the absence of ggplot2 that we have grown quite found of around here. There is of course Gadfly, which is quite close but once again, it is up to now very limited compared to what I was used to…

All in all, I am happy to have tried Julia, and I am quite sure that I will be using it quite a lot from now on. However, even if from a efficiency point of view, it is great, and it is way easier to learn than C (which I should have done a while ago), R and its tremendous package repository is far from beaten.

Oh and by the way, it uses PyPlot based on MatplotLib that allow you to make some xkcd-like plots, which can make your presentations a lot more fun.

## Reading Bayesian classics — presentations

The students did a great job in presenting some Bayesian classics. I enjoyed reading the papers (pdfs can be found here), most of which I hadn’t read before, and enjoyed also the students’ talks. I share here some of the best ones, as well as some demonstrative excerpts from the papers. In chronological order (presentations on slideshare below):

- W. Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.
*Biometrika*, 57(1):97–109, 1970.

In this paper, we shall consider Markov chain methods of sampling that are generalizations of a method proposed by Metropolis et al. (1953), which has been used extensively for numerical problems in statistical mechanics.

- Dennis V. Lindley and Adrian F.M. Smith. Bayes estimates for the linear model.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology),*with discussion, 1–41, 1972.

From Prof. B. de Finetti discussion (note the *valliant* collaborator Smith!):

I think that the main point to stress about this interesting and important paper is its significance for the philosophical questions underlying the acceptance of the Bayesian standpoint as the true foundation for inductive reasoning, and in particular for statistical inference. So far as I can remember, the present paper is the first to emphasize the role of the Bayesian standpoint as a logical framework for the analysis of intricate statistical situation. […] I would like to express my warmest congratulations to my friend Lindley and his valiant collaborator Smith.

## Statistics journals network

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite *fascinating* for us, *academics fascinated by academic rankings, for better or for worse *(ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include *Bayesian Analysis* (published from 2006), *The Annals of Applied Statistics* (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between *Annals of Statistics* and *Biometrika*. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction,
*CSDA*is way ahead of all the others. - the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.

**Some remarks**

The two software journals included in the dataset are quite outliers:

- the
*Journal of Statistical Software (JSS)*is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015). - the largest is the self citations of the
*STATA Journal (StataJ).*

Centrality:

*CSDA*is the most central journal in the sense of the highest (unweighted) degree.

**Some further thoughts**

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

## [Meta-]Blogging as young researchers

*Hello all,*

*This is an article intended for the ISBA bulletin, jointly written by us all at Statisfaction, Rasmus Bååth from Publishable Stuff, Boris Hejblum from Research side effects, Thiago G. Martins from tgmstat@wordpress, Ewan Cameron from Another Astrostatistics Blog and Gregory Gandenberger from gandenberger.org. *

Inspired by established blogs, such as the popular Statistical Modeling, Causal Inference, and Social Science or Xi’an’s Og, each of us began blogging as a way to diarize our learning adventures, to share bits of R code or LaTeX tips, and to advertise our own papers and projects. Along the way we’ve come to a new appreciation of the world of academic blogging: a never-ending international seminar, attended by renowned scientists and anonymous users alike. Here we share our experiences by weighing the pros and cons of blogging from the point of view of young researchers.

## Unfortunate typos in read paper

Mathieu and I have just realised that the version of our SQMC paper made available on the RSS web site contains several unfortunate typos. In particular, the symbol for “small o” has been replaced by a “big O” by editors. For instance, Theorem 9 should state the QMC beats standard SMC; i.e. the MSE (mean square error) of an SQMC estimator is

but in the RSS version, it reads

.

Well, that’s a bummer. For now, I recommend anyone to read instead the arxiv version (updated on Monday).

## SQMC read paper

Almost 10 months since my latest post? I guess bloggin’ ain’t my thing… In my defense, Mathieu Gerber and I were quite busy revising our SQMC paper. I am happy to announce that it has just been accepted as a read paper in JRSSB. If all goes as planned, we should present the paper at the RSS ordinary meeting on Dec 10. Everybody is welcome to attend, and submit an oral or written discussion (or both). More details soon, when the event is officially announced on the RSS web-site.

What is SQMC? It is a QMC (Quasi-Monte Carlo) version of particle filtering. For the same CPU cost, it typically generates much more accurate estimators. Interested? consider reading the paper here (more recent version coming soon), checking this video where I present SQMC, or, even better, attending our talk in London!

leave a comment