Statisfaction

Statistics journals network

Posted in General, R, Statistics by Julyan Arbel on 16 April 2015
Statistical journals friendship (clic for SVG format)

Statistical journals friendship (clic for SVG format)

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite fascinating for us, academics fascinated by academic rankings, for better or for worse (ironic here). They consist in cross citations counts C = (C_{ij}) for 47 statistics journals (see list and abbreviations page 5): C_{ij} is the number of citations from articles published in journal j in 2010 to papers published in journal i in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include Bayesian Analysis (published from 2006), The Annals of Applied Statistics (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between Annals of Statistics and Biometrika. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

ratioexport-scores

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

  • for the data, I used the citations matrix C renormalized by the total number of citations made, which I denote by \tilde C. This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction, CSDA is way ahead of all the others.
  • the node size represents the Cited/Citing ratio
  • the edge width represents the renormalized \tilde C_{ij}. I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges \tilde C_{ij} and \tilde C_{ji}.
  • for a better visibility I kept only the first decile of heaviest edges.
  • the clusters identified by four colors are modularity classes obtained by the Louvain method.

Some remarks

The two software journals included in the dataset are quite outliers:

  • the Journal of Statistical Software (JSS) is disconnected from the others, meaning it has no normalized citations \tilde C_{ij} in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015).
  • the largest \tilde C_{ij} is the self citations of the STATA Journal (StataJ).

Centrality:

  • CSDA is the most central journal in the sense of the highest (unweighted) degree.

Some further thoughts

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix \tilde C that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

momentify R package at BAYSM14

Posted in General, R, Seminar/Conference, Statistics by Julyan Arbel on 20 September 2014

I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called marginal methods). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model

\displaystyle h(t)=\int k(t;y)\mu(dy)

where k is a kernel, and the mixing distribution \mu is random and discrete (Bayesian nonparametric approach).

We consider the survival function S which is recovered from the hazard rate h by the transform

\displaystyle S(t)=\exp\Big(-\int_0^t h(s)ds\Big)

and some possibly censored survival data having survival S. Then it turns out that all the posterior moments of the survival curve S(t) evaluated at any time t can be computed.

The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.

I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:

  • evaluates the approximate density
  • samples from it

A package example for a mixture of beta and 2 to 7 moments gives that result:

mixture

Moustache target distribution and Wes Anderson

Posted in Art, Geek, R by Pierre Jacob on 31 March 2014

Today I am going to introduce the moustache target distribution (moustarget distribution for brievety). Load some packages first.

library(wesanderson) # on CRAN
library(RShapeTarget) # available on https://github.com/pierrejacob/RShapeTarget/
library(PAWL) # on CRAN

Let’s invoke the moustarget distribution.

 shape <- create_target_from_shape(
file_name=system.file(package = "RShapeTarget", "extdata/moustache.svg"),
lambda=5)
rinit <- function(size) matrix(rnorm(2*size), ncol = 2)
moustarget <- target(name = "moustache", dimension = 2,
rinit = rinit, logdensity = shape$logd,
parameters = shape$algo_parameters)

This defines a target distribution represented by a SVG file using RShapeTarget. The target probability density function is defined on \mathbb{R}^2 and is proportional to 1 on the segments described in the SVG files, and decreases exponentially fast to 0 away from the segments. The density function of the moustarget is plotted below, a picture being worth a thousand words.

(more…)

Rasmus Bååth’s Bayesian first aid

Posted in Project, R, Statistics by Pierre Jacob on 23 January 2014

Besides having coded a pretty cool MCMC app in Javascript, this guy Rasmus Bååth has started the Bayesian first aid project. The idea is that if there’s an R function called blabla.test performing test “blabla”, there should be a function bayes.blabla.test performing a similar test in a Bayesian framework, and showing the output in a similar way so that the user can easily compare both approaches.This post explains it all. Jags and BEST seem to be the two main workhorses under the hood.

Kudos to Rasmus for this very practical approach, potentially very impactful. Maybe someday people will have to specify if they want a frequentist approach and not the other way around! (I had a dream, etc).

From SVG to probability distributions [with R package]

Posted in R, Statistics by Pierre Jacob on 25 August 2013

Hey,

To illustrate generally complex probability density functions on continuous spaces, researchers always use the same examples, for instance mixtures of Gaussian distributions or a banana shaped distribution defined on \mathbb{R}^2 with density function:

f(x,y) = \exp\left(-\frac{x^2}{200} - \frac{1}{2}(y+Bx^2-100B)^2\right)

If we draw a sample from this distribution using MCMC we obtain a [scatter]plot like this one:

A sample from the very lame banana shaped distribution

Fig. 1: a sample from the very lame banana shaped distribution

Clearly it doesn’t really look like a banana, even if you use yellow to colour the dots like here. Actually it looks more like a boomerang, if anything. I was worried about this for a while, until I came up with a more realistic banana shaped distribution:

A sample from the realistic banana shaped distribution

Fig. 2: a sample from the realistic banana shaped distribution

See how the shape is well defined compared to the first figure? And there’s even the little tail, that proves so convenient when we want to peel off the fruit. More generally we might want to create target density functions based on general shapes. For this you can now try RShapeTarget, which you can install directly from R using devtools:

library(devtools)
install_github(repo="RShapeTarget", username="pierrejacob")

The package parses SVG files representing shapes, and creates target densities from them. More precisely, a SVG files contains “paths”, which are sequence of points (for instance the above banana is a single closed path). The associated log density at any point x is defined by -1/(2\lambda) \times d(x, P) where P is the closest path of the shape from x and d(x,P) is the distance between the point and the path. The parameter \lambda specifies the rate at which the density decays when the point goes away from the shape. With this you can define the maple leaf distribution, as a tribute to JSM 2013:

Fig. 3: a sample the "O Canada" probability distribution.

Fig. 3: a sample the “O Canada” probability distribution.

In the package you can get a distribution from a SVG file using the following code:

library(RShapeTarget)
# create target from file
my_shape_target <- create_target_from_shape(my_svg_file_name, lambda =1)
# test the log density function on 25 randomly generated points
my_shape_target$logd(matrix(rnorm(50), ncol = 2), my_shape_target$algo_parameters)

Since characters are just a bunch of paths, you can also define distributions based on words, for instance:

Fig. 5: Hodor.

Hodor: Hodor.

which is done as follows (warning you’re only allowed a-z and A-Z, no numbers no space no punctuation for now):

library(RShapeTarget)
word_target <- create_target_from_word("Hodor")

For the words, I defined the target density function as before, except that it’s constant on the letters: so if a point is outside a letter its density is computed based on the distance to the nearest path; if it’s inside a letter it’s just constant, so that the letters are “filled” with some constant density. I thought it’d look better.

Now I’m not worried about the banana shaped distribution any more, but by the fact that the only word I could think of was “Hodor” (with whom you can chat over there).

Using R in LaTeX with knitr and RStudio

Posted in Geek, LaTeX, R by Julyan Arbel on 28 February 2013

Hi,

I presented today at INSEE R user group (FL\tauR) how to use knitr (Sweave evolution) for writing \LaTeX documents which are self contained with respect to the source code: your data changed? No big deal, just compile your .Rnw file again and you are done with an updated version of your paper![Ctrl+Shift+I] is easy. Some benefits with respect to having two separate .R and .tex files: it is integrated in a single software (RStudio), you can call variables in your text with the \Sexpr{} command. The slow speed at compilation is no more a real matter as one can put “cache=TRUE” in code chunk options not to reevaluate unchanged chunks, which fastens things.

I share the (brief) slides below. They won’t help much those who already use knitr, but they give the first steps for those who would like to give it a try.

Next R meeting in Paris INSEE: ggplot2 and parallel computing

Posted in R by Julyan Arbel on 12 June 2012

Hi,
our group of R users from INSEE, aka FL\tauR, meets monthly in Paris. Next meeting is on Wed 13 (tomorrow), 1-2 pm, room 539 (an ID is needed to come in,  map to access INSEE \tauR), about ggplot2 and parallel computing. Since the first meeting in February, presentations have included hot topics like webscrapping, C in R, RStudio, SQLite databases or cartography (most of them in French). See you there!

Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

library(ggplot2)
input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt",
               sep="\t",header=TRUE,stringsAsFactors=FALSE)
input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
                           labels = c("Not Friday", "Friday"))
input$Date=as.Date(input$History,"%d/%m/%Y")
input$WeekDays=factor(input$WeekDay,
                      levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend
qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1))
qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="")
qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2))
qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2))
qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays)

Created by Pretty R at inside-R.org

Tagged with: ,

Coming R meetings in Paris

Posted in R, Seminar/Conference by Julyan Arbel on 4 February 2012

If you live in Paris and are interested in R, there will be two meetings for you this week.

First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French).

The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and \LaTeX in R.

I guess anyone can join!

UPDATE: Here is a colorful map to access INSEE \tau R. Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left).

Psycho dice and Monte Carlo

Posted in R, Statistics by Julyan Arbel on 16 December 2011

Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt):

You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.

Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p:

f_X(k)=pq^{k-1},\quad F_X(k)=1-q^k.

Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is

F_Y(k)=F_X(k)^n=(1-q^k)^n.

Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation:

f_Y(k)=Cq^k(1-q^k)^{n-1}.

As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If X_i‘s are N independent Geom(p) variables, then

E(Y) \approx \frac{\sum_i X_i(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}} and E(Y^2) \approx \frac{\sum_i X_i^2(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}.

The following R lines produce the estimates \mu_Y=E(Y) = 11.4 and \sigma_Y=sd(Y) = 6.5.

p=1/6
q=1-p
n=4
rgeom1=function(n,p){rgeom(n,p)+1}
h=function(x){(1-q^x)^(n-1)}

N=10^6
X=rgeom1(N,p)
(C=1/mean(h(X)))
(m1_Y=C*mean(X*h(X)))
(m2_Y=C*mean(X^2*h(X)))
(sd_Y=sqrt(m2_Y-m1_Y^2))

Created by Pretty R at inside-R.org

Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing H_0\,:\,E(Y)=\mu_Y against H_1\,:\,E(Y)<\mu_Y, for repeated plays.

If the game is played k times, then one rejects H_0 if the sampled mean \bar{Y} is less than \mu_Y -\frac{\sigma_Y}{\sqrt{k}}q_{.95}, where q_{.95} is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing k=20 times should perform in 2 rolls less than the predicted value \mu_Y= 11.4 (in 1 roll less if playing k=80 times). I can’t wait, I’m going to grab a dice!

Follow

Get every new post delivered to your Inbox.

Join 62 other followers

%d bloggers like this: