Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite fascinating for us, academics fascinated by academic rankings, for better or for worse (ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include Bayesian Analysis (published from 2006), The Annals of Applied Statistics (published from 2007).
I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between Annals of Statistics and Biometrika. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).
I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:
- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction, CSDA is way ahead of all the others.
- the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.
The two software journals included in the dataset are quite outliers:
- the Journal of Statistical Software (JSS) is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015).
- the largest is the self citations of the STATA Journal (StataJ).
- CSDA is the most central journal in the sense of the highest (unweighted) degree.
Some further thoughts
All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?
This week I’ll start my Bayesian Statistics master’s course at the Collegio Carlo Alberto. I realized that some of last year students got PhD positions in prestigious US universities. So I thought that letting this year’s students have a first grasp of some great Bayesian papers wouldn’t do harm. The idea is that in addition to the course, the students will pick a paper from a list and present it (or rather part of it) to the others and to me. Which will let them earn some extra points for the final exam mark. It’s in the spirit of Xian’s Reading Classics Seminar (his list here).
I’ve made up the list below, inspired by two textbooks references lists and biased by personal tastes: Xian’s Bayesian Choice and Peter Hoff’s First Course in Bayesian Statistical Methods. See the pdf list and zipped folder for papers. Comments on the list are much welcome!
PS: reference n°1 isn’t a joke!
This is an article intended for the ISBA bulletin, jointly written by us all at Statisfaction, Rasmus Bååth from Publishable Stuff, Boris Hejblum from Research side effects, Thiago G. Martins from tgmstat@wordpress, Ewan Cameron from Another Astrostatistics Blog and Gregory Gandenberger from gandenberger.org.
Inspired by established blogs, such as the popular Statistical Modeling, Causal Inference, and Social Science or Xi’an’s Og, each of us began blogging as a way to diarize our learning adventures, to share bits of R code or LaTeX tips, and to advertise our own papers and projects. Along the way we’ve come to a new appreciation of the world of academic blogging: a never-ending international seminar, attended by renowned scientists and anonymous users alike. Here we share our experiences by weighing the pros and cons of blogging from the point of view of young researchers.
I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called marginal methods). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model
where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).
We consider the survival function which is recovered from the hazard rate by the transform
and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.
The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.
I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:
- evaluates the approximate density
- samples from it
A package example for a mixture of beta and 2 to 7 moments gives that result:
With Alexandre Thiéry we’ve been working on non-negative unbiased estimators for a while now. Since I’ve been talking about it at conferences and since we’ve just arXived the second version of the article, it’s time for a blog post. This post is kind of a follow-up of a previous post from July, where I was commenting on Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade.
It’s been a while I haven’t written about parallelization and GPUs. With colleagues Lawrence Murray and Anthony Lee we have just arXived a new version of Parallel resampling in the particle filter. The setting is that, on modern computing architectures such as GPUs, thousands of operations can be performed in parallel (i.e. simultaneously) and therefore the rest of the calculations that cannot be parallelized quickly becomes the bottleneck. In the case of the particle filter (or any sequential Monte Carlo method such as SMC samplers), that bottleneck is the resampling step. The article investigates this issue and numerically compares different resampling schemes.
Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).
A recently arxived paper by Pier Bissiri, Chris Holmes and Steve Walker piqued my curiosity about “pseudo-Bayesian” approaches, that is, statistical approaches based on a pseudo-posterior:
where is some pseudo-likelihood. Pier, Chris and Steve use in particular
where is some empirical risk function. A good example is classification; then could be the proportion of properly classified points:
where is some score function parametrised by , and . (Side note: I find the ML convention for the more convenient than the stats convention.)
It turns out that this particular kind of pseudo-posterior has already been encountered before, but with different motivations:
- Chernozhukov and Hong (JoE, 2003) used it to define new Frequentist estimators based on moment estimation ideas (i.e. take above to be some empirical moment constraint). Focus is on establishing Frequentist properties of say the expectation of the pseudo-posterior. (It seems to me that few people have heard about this this paper in Stats).
- the PAC-Bayesian approach which originates from Machine Learning also relies on this kind of pseudo-posterior. To be more precise, PAC-Bayes usually starts by minimising the upper bound of an oracle inequality within a class of randomised estimators. Then, as a result, you obtain as a possible solution, say, a single draw for the pseudo-posterior defined above. A good introduction is this book by Olivier Catoni.
- Finally, Pier, Chris and Steve’s approach is by far the most Bayesian of these three pseudo-Bayesian approaches, in the sense that they try to maintain an interpretation of the pseudo-posterior as a representation on the uncertainty on . Crudely speaking, they don’t look only at the expectation, like the two approaches aboves, but also at the spread of the pseudo-posterior.
Let me mention briefly that quite a few papers have considered using other types of pseudo-likelihood in a pseudo-posterior, such as empirical likelihood, composite likelihood, and so on, but I will shamefully skip them for now.
To which extent this growing interest in “Pseudo-Bayes” should have an impact on Bayesian computation? For one thing, more problems to throw at our favourite algorithms should be good news. In particular, Chernozhukov and Hong mention the possibility to use MCMC as a big advantage for their approach, because typically the function they consider could be difficult to minimise directly by optimisation algorithms. PAC-Bayesians also seem to recommend MCMC, but I could not find so many PAC-Bayesian papers that go beyond the theory and actually implement it; an exception is this.
On the other hand, these pseudo posteriors might be quite nasty. First, given the way they are defined, they should not have the kind of structure that makes it possible to use Gibbs sampling. Second, many interesting choices for seem to be irregular or multimodal. Again, in the classification example, the 0-1 loss function is typically not continuous. Hopefully the coming years will witness some interesting research on which computational approaches are more fit for pseudo-Bayes computation, but readers will not be surprised if I put my Euros on (some form of) SMC!
Hi Statisfied readers,
I am Nicolas Chopin, a Professor of Statistics at the ENSAE, and my colleagues and good friends that manage Statisfaction kindly agreed that I would join their blog. I work mostly on “Bayesian Computation”, i.e. Monte Carlo and non-Monte Carlo methods to compute Bayesian quantities; a strong focus of my research is on Sequential Monte Carlo (aka particle filters).
I don’t plan to blog very regularly, and only on stuff related to my research, at least in some way. Well, that’s the idea for now. Stay tuned!
To illustrate generally complex probability density functions on continuous spaces, researchers always use the same examples, for instance mixtures of Gaussian distributions or a banana shaped distribution defined on with density function:
If we draw a sample from this distribution using MCMC we obtain a [scatter]plot like this one:
Clearly it doesn’t really look like a banana, even if you use yellow to colour the dots like here. Actually it looks more like a boomerang, if anything. I was worried about this for a while, until I came up with a more realistic banana shaped distribution:
See how the shape is well defined compared to the first figure? And there’s even the little tail, that proves so convenient when we want to peel off the fruit. More generally we might want to create target density functions based on general shapes. For this you can now try RShapeTarget, which you can install directly from R using devtools:
library(devtools) install_github(repo="RShapeTarget", username="pierrejacob")
The package parses SVG files representing shapes, and creates target densities from them. More precisely, a SVG files contains “paths”, which are sequence of points (for instance the above banana is a single closed path). The associated log density at any point is defined by where is the closest path of the shape from and is the distance between the point and the path. The parameter specifies the rate at which the density decays when the point goes away from the shape. With this you can define the maple leaf distribution, as a tribute to JSM 2013:
In the package you can get a distribution from a SVG file using the following code:
library(RShapeTarget) # create target from file my_shape_target <- create_target_from_shape(my_svg_file_name, lambda =1) # test the log density function on 25 randomly generated points my_shape_target$logd(matrix(rnorm(50), ncol = 2), my_shape_target$algo_parameters)
Since characters are just a bunch of paths, you can also define distributions based on words, for instance:
which is done as follows (warning you’re only allowed a-z and A-Z, no numbers no space no punctuation for now):
library(RShapeTarget) word_target <- create_target_from_word("Hodor")
For the words, I defined the target density function as before, except that it’s constant on the letters: so if a point is outside a letter its density is computed based on the distance to the nearest path; if it’s inside a letter it’s just constant, so that the letters are “filled” with some constant density. I thought it’d look better.
Now I’m not worried about the banana shaped distribution any more, but by the fact that the only word I could think of was “Hodor” (with whom you can chat over there).