Coming R meetings in Paris
If you live in Paris and are interested in R, there will be two meetings for you this week.
First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French).
The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and in R.
I guess anyone can join!
UPDATE: Here is a colorful map to access INSEE . Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left).
Heading to the National University of Singapore

Hey there,
Next Wednesday (Feb 8th) I’ll give a talk on SMC² at the National University of Singapore (3pm, S16-06-118, DSAP Seminar Room if you’re interested).
By the way, the SMC² paper has been hugely revised, along with the python package and the supplement. Among the main changes, we made a thorough comparison with Liu & West SMC method (which is based on extending the hidden state to include the parameter), and we added a fair bit of justification of the computational cost of the method, which we believe is very reasonable considering the difficulty of the problem (ie estimation in HMM) and the cost of concurrent algorithms. We also considered the validity of SMC² on a sequence of annealed posterior densities, following a reviewer’s interesting comment. It’s good to be finished with it (for now at least), and to move to new projects!
I’ll spend next week in Singapore working with Ajay Jasra, and then I’ll head off to the Monte Carlo and Quasi Monte Carlo conference in Sydney, where I’ll stay for ~2 weeks. I’ll be worth another blog post of course.
Cheers!
GPUs in Computational Statistics

Hey,
Next week the Centre for Research in Statistical Methodology (CRiSM, in Warwick, UK) will be hosting a workshop on the use of graphics processing units in statistics, a quickly expanding area that I’ve blogged about here. Xian and I are going to talk about Parallel IMH and Parallel Wang Landau. We’ll be able to interact with top researchers in methodological statistics and early adopters of GPUs like Chris Holmes, whose talk at Valencia 9 was quite influential in the field (and for my PhD!), and Christophe Andrieu, who is one of the to-be-praised Particle MCMC guys (see e.g. here)
The programme for the workshop is here: http://www2.warwick.ac.uk/fac/sci/statistics/crism/workshops/gpu/programme
…and registration fee is only 50£, ie not even a half GPU!
By the way this page
http://www.louisaslett.com/Talks/GPU_Programming_Basics_Getting_Started.html
should help you a lot if you want to use plain CUDA C code within R.
As for me, thanks to Anthony Lee I will spend a week in Warwick prior to the meeting, working at CRiSM and enjoying the West Midlands.
Cheers!
New Year and many things to do

Happy New Year to all of you, dear readers!
We’ve been quite busy lately, Julyan and I being nearer and nearer to the end of our PhDs and Robin being a new lecturer at Université Paris-Dauphine… but we hope to go on blogging in 2012. There is quite a few things to blog about, and we’re proud to have more and more readers with more than 500 views a week, even though the blog is not always very active.
2012 is going to be the year of the ISBA meeting in Japan, where a bunch of us at CREST and Dauphine intend to go. It’s also the year of MCQMC in Sydney, the largest meeting on Monte Carlo methods, which a couple of us PhD students will also attend. For some of us it will hopefully be our final year as students, and thus a particularly exciting (if busy) year.
Cheers,
Pierre
Psycho dice and Monte Carlo
Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt):
You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.
Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p:
Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is
Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation:
As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If ‘s are N independent Geom(p) variables, then
and
The following R lines produce the estimates and
.
Created by Pretty R at inside-R.org
Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing against
, for repeated plays.
If the game is played k times, then one rejects if the sampled mean
is less than
, where
is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing
times should perform in 2 rolls less than the predicted value
(in 1 roll less if playing
times). I can’t wait, I’m going to grab a dice!
Create maps with maptools R package
Baptiste Coulmont explains on his blog how to use the R package maptools. It is based on shapefile files, for example the ones offered by the French geography agency IGN (at départements and communes level). Some additional material like roads and railways are provided by the OpenStreetMap project, here. For the above map, you need to dowload and dezip the files departements.shp.zip and ile-de-france.shp.zip. The red dots correspond to points of interest longitude / latitude, here churches stored in a vector eglises (use e.g. this to geolocalise places of interest). Then run this code from Baptiste’s tutorial
library(maptools) france<-readShapeSpatial("departements.shp", proj4string=CRS("+proj=longlat")) routesidf<-readShapeLines( "ile-de-france.shp/roads.shp", proj4string=CRS("+proj=longlat") ) trainsidf<-readShapeLines( "ile-de-france.shp/railways.shp", proj4string=CRS("+proj=longlat") ) plot(france,xlim=c(2.2,2.4),ylim=c(48.75,48.95),lwd=2) plot(routesidf[routesidf$type=="secondary",],add=TRUE,lwd=2,col="lightgray") plot(routesidf[routesidf$type=="primary",],add=TRUE,lwd=2,col="lightgray") plot(trainsidf[trainsidf$type=="rail",],add=TRUE,lwd=1,col="burlywood3") points(eglises$lon,eglises$lat,pch=20,col="red")
Google Fusion Tables
A quick post about another Google service that I discovered recently called Fusion Tables. There you can store, share and visualize data up to 250 MB, of course in the cloud. With Google Docs, Google Trends and Google Public Data Explore, it is another example of Google’s efforts to gain ground in data management. Has anyone tried it out?
France open data at data.gouv.fr

Today is launched the (beta version of the) brand new French website for open data, at data.gouv.fr (do not misunderstand the url, it is in French!). On prime minister’s initiative, it collects data from various ressources, among which the institute for statistics INSEE, most of the ministries (Finance, Culture, etc), several big companies (like the state-owned railway company SNCF) for an open, transparent and collaborative model of governance. Datasets are available in CSV or Excel spreadsheet, under open licence for most of them, and should be updated frequently (monthly).
Some of the datasets you can find: list of 3000+ train stations with geolocalisation (later with traffic?); geolocalisation of road accidents; the comprehensive list of books in Bibliothèque Nationale de France (which scans and stores each and every book published in France); the number of students in Latin and ancient Greek classes; France national budget, etc… among 350 000 others.
I have added the link in our datasets page.
Psycho dice

In a failed attempt to escape from statistics by reading a novel (Midnight in the Garden of Good and Evil, by John Berendt), I discovered a game called psycho dice. One of the main character, Jim Williams, explains it as follows.
“I believe in mind control,” he said. “I think you can influence events by mental concentration. I’ve invented a game called Psycho Dice. It’s very simple. You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice stand-ing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.”
Williams was sure he could improve the odds by sheer concentration. “Dice have six sides,” he said, “so you have a one-in-six chance of getting your number when you throw them. If you do any better than that, you beat the law of averages. Concentration definitely helps. That’s been proved. Back in the nineteen-thirties, Duke University did a study with a machine that could throw dice. First they had it throw dice when nobody was in the building, and the numbers came up strictly according to the law of averages. Then they put a man in the next room and had him concentrate on various numbers to see if that would beat the odds. It did. Then they put him in the same room, still concentrating, and the machine beat the odds again, by an even wider margin. When the man rolled the dice himself, using a cup, he did better still. When he finally rolled the dice with his bare hand, he did best of all.”
Power-laws: choose your x and y variables carefully
This is a follow-up of the post Power of running world records
As suggested by Andrew, plotting running world records could benefit from a change of variables. More exactly the use of different variables sheds light on a [now] well-known [to me] sports result provided in a 2000 Nature paper by Sandra Savaglio and Vincenzo Carbone (thanks Ken): the dependence between time and distance in log-log scale is not linear on the whole range of races, but piecewise linear. There is one break-point around time 2’33’’ (or equivalently distance around 1100 m). As mentioned in the article, this threshold corresponds to a physiological critical change in the athlete’s energy expenditure: in short races (less than 1000 m) the effort follows an anaerobic metabolism, whereas it switches to aerobic metabolism for middle and long distances (or longer…). Interestingly, the energy is more efficiently consumed in the second regime than in the first: the decay in speed slows down for endurance races.
The reason of this graphical/visual difference is simple. Denote distance, time and speed by D, T and S. I have plotted the log T~ log D relation, which gave with
. When using the speed S as one of the variables, the relations are
and
with
and
to the first order because
is close to 1. With Nature paper findings (with the opposite sign convention), the two
s are
(anaerobic) and
(aerobic), ie
and
. My improper
is indeed in between. The slope ratio is much larger (larger than 2) on a plot involving the speed, clearly showing the two regimes, than on my original plot (a few 10%), which is the reason why it appear almost linear (although afterthought, and with good goggles, two lines might have been detected).
Below is the S ~ log D relation (click to enlarge) on which it appears clearly that 100 m and 100 km races are two outliers. It takes time to erase the loss of time due to the start of the race (100 m and 200 m are run at the same speed…), whereas the 100 km suffers from a lack of interest among athletes.
Achim Zeileis also provides an extended world records table and R code in his comment.
As an aside, Andrew and Cosma Shalizi also comment and resolve an ambiguity of mine: one usually speaks about power-laws without much precision of context, but there are mainly two separate sets of power-law models. Either power-law regressions, where you plot y~x for two different variables (this is the case here); or power-law distributions, ie the probability distribution of a single variable x is , or extensions of that (with lots of natural examples, ranging from the size of cities to the number of deaths in attacks in wars).




5 comments