Statisfaction

RIP library.nu

Posted in General by Pierre Jacob on 23 February 2012

Hey,

I needed to read a book, to learn something new, isn’t that exciting, and I discovered that library.nu was dead. Here you’ll find an excellent post about it, see also the comments for an interesting discussion.

It’s stupid and really sad, but bookfi dot org. So I guess it’s OK.

Pierre

Valentine Day and lonely people in France

Posted in Dataset, General by Julyan Arbel on 14 February 2012

Insee published recently a paper (in French), well in line with the Valentine Day, which characterizes people living alone or in couple by socio-professional category, along with the data.

Between 1990 and 2008 (two population surveys), the proportion of people living alone mostly increased for people under 60. After 60, 38% of women live alone, for only 17% of men,  because women are married to older men, and live longer than them, in average. See that proportion by age:

Spatially, there is a kind of South/North opposition. During the working life, lonely people live in the South (left), while lonely retired people live in the North (right), with an exception for Île-de-France (Paris) with a high proportion whatever the age:

Next stop: Sydney

Posted in General, Seminar/Conference by Pierre Jacob on 11 February 2012

Hey hey,

After a very interesting week in the Department of Statistics and Applied Probability at the National University of Singapore, where I’ve started some projects about the mathematics behind Sequential Monte Carlo methods with the expert Ajay Jasra, I’m heading towards Sydney, Australia. I love my life at the moment!

Tip for the travelers among you: ScribTeX is a very convenient website if you travel without a laptop. You can type LaTeX documents online and also compile them, and see the resulting PDF. They seem to have all the necessary packages and I could easily type beamer presentations. The editor is a bit simple but it could still save your life!

Speaking of slides, I will be talking next Thursday at the MCQMC conference. My office mate Christian Schaefer will be there too if you wish to speak to an expert in SMC on binary spaces. As for me, I’ll be talking about the properties of the stochastic schedule in the Wang-Landau algorithm, linked with our recent article with Robin Ryder.

Then on February 20th, I’ll give an informal talk at the Centre for Language Science in Macquarie University. That’s quite random, yes. Many thanks to Mark Johnson who contacted me through the blog! There I’ll speak about recent advances in computational Bayesian statistics. These linguists already use SMC for word segmentation and show interest in the latest advances. I’ll hence talk about PMCMC/SMC^2 and also ABC.

Bye bye.

Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

library(ggplot2)
input=read.csv("http://dl.dropbox.com/u/1391912/Blog%20statisfaction/data/syria.txt",
               sep="\t",header=TRUE,stringsAsFactors=FALSE)
input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
                           labels = c("Not Friday", "Friday"))
input$Date=as.Date(input$History,"%d/%m/%Y")
input$WeekDays=factor(input$WeekDay,
                      levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend
qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1))
qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="")
qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2))
qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2))
qplot(WeekDays,log(Number+1),data=input,geom="boxplot",xlab="",ylab="",colour=WeekDays)

Created by Pretty R at inside-R.org

Tagged with: ,

Coming R meetings in Paris

Posted in R, Seminar/Conference by Julyan Arbel on 4 February 2012

If you live in Paris and are interested in R, there will be two meetings for you this week.

First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French).

The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and \LaTeX in R.

I guess anyone can join!

UPDATE: Here is a colorful map to access INSEE \tau R. Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left).

Heading to the National University of Singapore

Posted in Seminar/Conference by Pierre Jacob on 3 February 2012

Hey there,

Next Wednesday (Feb 8th) I’ll give a talk on SMC² at the National University of Singapore (3pm, S16-06-118, DSAP Seminar Room if you’re interested).

By the way, the SMC² paper has been hugely revised, along with the python package and the supplement. Among the main changes, we made a thorough comparison with Liu & West SMC method (which is based on extending the hidden state to include the parameter), and we added a fair bit of justification of the computational cost of the method, which we believe is very reasonable considering the difficulty of the problem (ie estimation in HMM) and the cost of concurrent algorithms. We also considered the validity of SMC² on a sequence of annealed posterior densities, following a reviewer’s interesting comment. It’s good to be finished with it (for now at least), and to move to new projects!

I’ll spend next week in Singapore working with Ajay Jasra, and then I’ll head off to the Monte Carlo and Quasi Monte Carlo conference in Sydney, where I’ll stay for ~2 weeks. I’ll be worth another blog post of course.

Cheers!

GPUs in Computational Statistics

Posted in Geek, General, Seminar/Conference by Pierre Jacob on 17 January 2012

Hey,

Next week the Centre for Research in Statistical Methodology (CRiSM, in Warwick, UK) will be hosting a workshop on the use of graphics processing units in statistics, a quickly expanding area that I’ve blogged about here. Xian and I are going to talk about Parallel IMH and Parallel Wang Landau. We’ll be able to interact with top researchers in methodological statistics and early adopters of GPUs like Chris Holmes, whose talk at Valencia 9 was quite influential in the field (and for my PhD!), and Christophe Andrieu, who is one of the to-be-praised Particle MCMC guys (see e.g. here)

The programme for the workshop is here: http://www2.warwick.ac.uk/fac/sci/statistics/crism/workshops/gpu/programme

…and registration fee is only 50£, ie not even a half GPU!

By the way this page

http://www.louisaslett.com/Talks/GPU_Programming_Basics_Getting_Started.html

should help you a lot if you want to use plain CUDA C code within R.

As for me, thanks to Anthony Lee I will spend a week in Warwick prior to the meeting, working at CRiSM and enjoying the West Midlands.

Cheers!

New Year and many things to do

Posted in General by Pierre Jacob on 2 January 2012

Happy New Year to all of you, dear readers!

We’ve been quite busy lately, Julyan and I being nearer and nearer to the end of our PhDs and Robin being a new lecturer at Université Paris-Dauphine… but we hope to go on blogging in 2012. There is quite a few things to blog about, and we’re proud to have more and more readers with more than 500 views a week, even though the blog is not always very active.

2012 is going to be the year of the ISBA meeting in Japan, where a bunch of us at CREST and Dauphine intend to go. It’s also the year of MCQMC in Sydney, the largest meeting on Monte Carlo methods, which a couple of us PhD students will also attend. For some of us it will hopefully be our final year as students, and thus a particularly exciting (if busy) year.

Cheers,

Pierre

Psycho dice and Monte Carlo

Posted in R, Statistics by Julyan Arbel on 16 December 2011

Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt):

You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.

Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p:

f_X(k)=pq^{k-1},\quad F_X(k)=1-q^k.

Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is

F_Y(k)=F_X(k)^n=(1-q^k)^n.

Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation:

f_Y(k)=Cq^k(1-q^k)^{n-1}.

As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If X_i‘s are N independent Geom(p) variables, then

E(Y) \approx \frac{\sum_i X_i(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}} and E(Y^2) \approx \frac{\sum_i X_i^2(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}.

The following R lines produce the estimates \mu_Y=E(Y) = 11.4 and \sigma_Y=sd(Y) = 6.5.

p=1/6
q=1-p
n=4
rgeom1=function(n,p){rgeom(n,p)+1}
h=function(x){(1-q^x)^(n-1)}

N=10^6
X=rgeom1(N,p)
(C=1/mean(h(X)))
(m1_Y=C*mean(X*h(X)))
(m2_Y=C*mean(X^2*h(X)))
(sd_Y=sqrt(m2_Y-m1_Y^2))

Created by Pretty R at inside-R.org

Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing H_0\,:\,E(Y)=\mu_Y against H_1\,:\,E(Y)<\mu_Y, for repeated plays.

If the game is played k times, then one rejects H_0 if the sampled mean \bar{Y} is less than \mu_Y -\frac{\sigma_Y}{\sqrt{k}}q_{.95}, where q_{.95} is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing k=20 times should perform in 2 rolls less than the predicted value \mu_Y= 11.4 (in 1 roll less if playing k=80 times). I can’t wait, I’m going to grab a dice!

Create maps with maptools R package

Posted in Dataset, R by Julyan Arbel on 13 December 2011

Baptiste Coulmont explains on his blog how to use the R package maptools. It is based on shapefile files, for example the ones offered by the French geography agency IGN (at départements and communes level). Some additional material like roads and railways are provided by the OpenStreetMap project, here. For the above map, you need to dowload and dezip the files departements.shp.zip and ile-de-france.shp.zip. The red dots correspond to points of interest longitude / latitude, here churches stored in a vector eglises (use e.g. this to geolocalise places of interest). Then run this code from Baptiste’s tutorial

library(maptools)
france<-readShapeSpatial("departements.shp",
  proj4string=CRS("+proj=longlat"))
routesidf<-readShapeLines(
  "ile-de-france.shp/roads.shp",
  proj4string=CRS("+proj=longlat")
  )
trainsidf<-readShapeLines(
  "ile-de-france.shp/railways.shp",
  proj4string=CRS("+proj=longlat")
  )
plot(france,xlim=c(2.2,2.4),ylim=c(48.75,48.95),lwd=2)
plot(routesidf[routesidf$type=="secondary",],add=TRUE,lwd=2,col="lightgray")
plot(routesidf[routesidf$type=="primary",],add=TRUE,lwd=2,col="lightgray")
plot(trainsidf[trainsidf$type=="rail",],add=TRUE,lwd=1,col="burlywood3")
points(eglises$lon,eglises$lat,pch=20,col="red")

Created by Pretty R at inside-R.org

Follow

Get every new post delivered to your Inbox.