Coming R meetings in Paris

Posted in R, Seminar/Conference by Julyan Arbel on 4 February 2012

If you live in Paris and are interested in R, there will be two meetings for you this week.

First a Semin-R session, organized at the Muséum National d’Histoire Naturelle on Tuesday 7 Feb (too bad, the Museum is closed on Tuesdays). Presentations will be about colors, phylogenies and maps, while I will speak about (my beloved) RStudio. The slides of previous sessions can be found here (most of them are in French).

The following day, 8 Feb, a group of R users from INSEE will have its first meeting (13-14h, INSEE, room R12), about SQLite data in R, maps, and \LaTeX in R.

I guess anyone can join!

UPDATE: Here is a colorful map to access INSEE \tau R. Come with an ID, and say you are visiting the meeting organizer Matthieu Cornec. Room R12 is on the ground floor (left).

Psycho dice and Monte Carlo

Posted in R, Statistics by Julyan Arbel on 16 December 2011

Following Pierre’s post on psycho dice, I want here to see by which average margin repeated plays might be called influenced by mind will. The rules are the following (exerpt from the novel Midnight in the Garden of Good and Evil, by John Berendt):

You take four dice and call out four numbers between one and six–for example, a four, a three, and two sixes. Then you throw the dice, and if any of your numbers come up, you leave those dice standing on the board. You continue to roll the remaining dice until all the dice are sitting on the board, showing your set of numbers. You’re eliminated if you roll three times in succession without getting any of the numbers you need. The object is to get all four numbers in the fewest rolls.

Simplify the game by forgetting the elimination step. Suppose first one plays with an even dice of 1/p faces. The probability of it to show the right face is p (for somebody with no psy power). Denote X the time to first success with one dice, which follows, by independence, a geometric distribution Geom(p) (with the starting-to-1 convention). X has the following probability mass and cumulative distribution functions, with q=1-p:

f_X(k)=pq^{k-1},\quad F_X(k)=1-q^k.

Now denote Y the time to success in the game with n dice. This simultaneous case is the same as playing n times independently with 1 dice, and then taking Y as the sample maximum of the different times to success. So Y‘s cdf is


Its pmf can be obtained either exactly by difference, or up to a normalizing constant C by differentiation:


As it is not too far from the Geom(p) pmf, one can use the latter as the proposal in a Monte Carlo estimate. If X_i‘s are N independent Geom(p) variables, then

E(Y) \approx \frac{\sum_i X_i(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}} and E(Y^2) \approx \frac{\sum_i X_i^2(1-q^{X_i})^{n-1}}{\sum_i (1-q^{X_i})^{n-1}}.

The following R lines produce the estimates \mu_Y=E(Y) = 11.4 and \sigma_Y=sd(Y) = 6.5.



Created by Pretty R at

Now it is possible to use a test (from classical test theory) to estimate the average margin with which repeated games should deviate in order to detect statistical evidence of psy power. We are interested in testing H_0\,:\,E(Y)=\mu_Y against H_1\,:\,E(Y)<\mu_Y, for repeated plays.

If the game is played k times, then one rejects H_0 if the sampled mean \bar{Y} is less than \mu_Y -\frac{\sigma_Y}{\sqrt{k}}q_{.95}, where q_{.95} is the 95% standard normal quantile. To indicate the presence of a psy power, someone playing k=20 times should perform in 2 rolls less than the predicted value \mu_Y= 11.4 (in 1 roll less if playing k=80 times). I can’t wait, I’m going to grab a dice!

Create maps with maptools R package

Posted in Dataset, R by Julyan Arbel on 13 December 2011

Baptiste Coulmont explains on his blog how to use the R package maptools. It is based on shapefile files, for example the ones offered by the French geography agency IGN (at départements and communes level). Some additional material like roads and railways are provided by the OpenStreetMap project, here. For the above map, you need to dowload and dezip the files and The red dots correspond to points of interest longitude / latitude, here churches stored in a vector eglises (use e.g. this to geolocalise places of interest). Then run this code from Baptiste’s tutorial


Created by Pretty R at

Google Fusion Tables

Posted in Dataset, Geek by Julyan Arbel on 8 December 2011

A quick post about another Google service that I discovered recently called Fusion Tables. There you can store, share and visualize data up to 250 MB, of course in the cloud. With Google Docs, Google Trends and Google Public Data Explore, it is another example of Google’s efforts to gain ground in data management. Has anyone tried it out?

Tagged with:

France open data at

Posted in Dataset by Julyan Arbel on 5 December 2011

Today is launched the (beta version of the) brand new French website for open data, at (do not misunderstand the url, it is in French!). On prime minister’s initiative, it collects data from various ressources, among which the institute for statistics INSEE, most of the ministries (Finance, Culture, etc), several big companies (like the state-owned railway company SNCF) for an open, transparent and collaborative model of governance. Datasets are available in CSV or Excel spreadsheet, under open licence for most of them, and should be updated frequently (monthly).

Some of the datasets you can find: list of 3000+ train stations with geolocalisation (later with traffic?); geolocalisation of road accidents; the comprehensive list of books in Bibliothèque Nationale de France (which scans and stores each and every book published in France); the number of students in Latin and ancient Greek classes; France national budget, etc… among 350 000 others.

I have added the link in our datasets page.

Power-laws: choose your x and y variables carefully

Posted in R, Sport by Julyan Arbel on 16 November 2011

This is a follow-up of the post Power of running world records

As suggested by Andrew, plotting running world records could benefit from a change of variables. More exactly the use of different variables sheds light on a [now] well-known [to me] sports result provided in a 2000 Nature paper by Sandra Savaglio and Vincenzo Carbone (thanks Ken): the dependence between time and distance in log-log scale is not linear on the whole range of races, but piecewise linear. There is one break-point around time 2’33’’ (or equivalently distance around 1100 m). As mentioned in the article, this threshold corresponds to a physiological critical change in the athlete’s energy expenditure: in short races (less than 1000 m) the effort follows an anaerobic metabolism, whereas it switches to aerobic metabolism for middle and long distances (or longer…). Interestingly, the energy is more efficiently consumed in the second regime than in the first: the decay in speed slows down for endurance races.

The reason of this graphical/visual difference is simple. Denote distance, time and speed by D, T and S. I have plotted the log T~ log D relation, which gave T\propto D^{\alpha} with \alpha=1.11. When using the speed S as one of the variables, the relations are S\propto D^{\gamma} and S\propto T^{\beta} with \gamma=1-\alpha and \beta=\frac{1}{\alpha}-1\approx 1-\alpha to the first order because \alpha is close to 1. With Nature paper findings (with the opposite sign convention), the two \betas are \beta_{\text{an}}=-0.165 (anaerobic) and \beta_{\text{ac}}=-0.072 (aerobic), ie \alpha_{\text{an}}=1.20 and \alpha_{\text{ac}}=1.08. My improper \alpha=1.11 is indeed in between. The slope ratio is much larger (larger than 2) on a plot involving the speed, clearly showing the two regimes, than on my original plot (a few 10%), which is the reason why it appear almost linear (although afterthought, and with good goggles, two lines might have been detected).

Below is the S ~ log D relation (click to enlarge) on which it appears clearly that 100 m and 100 km races are two outliers. It takes time to erase the loss of time due to the start of the race (100 m and 200 m are run at the same speed…), whereas the 100 km suffers from a lack of interest among athletes.Achim Zeileis also provides an extended world records table and R code in his comment.

As an aside, Andrew and Cosma Shalizi also comment and resolve an ambiguity of mine: one usually speaks about power-laws without much precision of context, but there are mainly two separate sets of power-law models. Either power-law regressions, where you plot y~x for two different variables (this is the case here); or power-law distributions, ie the probability distribution of a single variable x is p(x)\propto x^{-a}, or extensions of that (with lots of natural examples, ranging from the size of cities to the number of deaths in attacks in wars).

World Statistics Day

Posted in General by Julyan Arbel on 21 October 2011

As recommended by the United Nations Statistics Commission, a World Statistics Day is observed every 5 years. The first one occured last year, so no stats day to celebrate this year. Nevertheless, cheers to all of you readers: recent activity (a post by Robin about his article with Pierre, adverts by Xian, and a reference in a statistics blogs list) made it the busiest day ever for the 1+ year old Statisfaction:

Random art on the web

Posted in Art, R by Julyan Arbel on 15 October 2011

Since we explored some statitics of an abstract painting with Pierre (we even have an article in Variances last issue!), I became more sensitive to art linked to randomness. Here are some pointers to related websites I have digged out., mentioned here by Pierre, is, at it reads, a true random number service that generates randomness via atmospheric noise. The American writer Eric Hoffer is quoted about creativity:

Creativity is the ability to introduce order into the randomness of nature.

You will find there contributed pages of users of the service about varied forms of arts, like pages which generate Samuel Beckett-like prose, or Jazz Scales. In visual arts, you can find for example the Bryce girl 1, a fractal landscape by Fuller Thompson of Bryce Canyon (with an extra sexy girl by the way); and nice pastel Richter-like pictures by Dave Nelson (to be compared with an excerpt of Richter’s 1024 colors): is a program which randomly generates a picture with a given character seed. You can visit the gallery, or make your own. Sadly, “Bayes” generates an ugly flashy pic:

Day to day data gather together artists who collect, list, database and absurdly analyse the data of everyday life. You can find there links to artists like Abigail Reynolds and her Mount Fear of crimes in London, and many others.

R users produced great outputs too. Interestingly, the two following graphs feel like 3D, although only made up of lines and curves. Paul Butler’s visualization of Facebook connections (with a bit of post processing):

and Christophe Ladroue’s representation of individual rankings across lags in a triathlon:

Check the R Graph Gallery if you want to enhance your data visualization in R!

Triathlon data with ggplot2

Posted in Dataset, Sport by Julyan Arbel on 11 October 2011

As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this

As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).

Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):

We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers  finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!

An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.

Tagged with: , ,

Artist view of crimes in London

Posted in Art, R by Julyan Arbel on 10 October 2011

At first sight, one could think this picture is a scale model of some narrow moutains, like Bryce Canyon… Actually it represents crimes in East London, an cardboard artwork by the Londoner artist Abigail Reynolds, called Mount Fear.  Here is what can be read on the artist’s webpage:

The terrain of Mount Fear is generated by data sets relating to the frequency and position of urban crimes. Precise statistics are provided by the police. Each individual incident adds to the height of the model, forming a mountainous terrain.

All Mount Fear models are built on the same principals. The imaginative fantasy space seemingly proposed by the scupture is subverted by the hard facts and logic of the criteria that shape it. The object does not describe an ideal other-worldly space separated from lived reality, but conversely describes in relentless detail the actuality of life on the city streets.

No mention of the statistical method used (kernel, Dirichlet process density estimation?). Some crime data can be found on UNdata for example, or here for an interactive map. It reminds a great work by David Kahle about crime in Houston, combining ggplot2 and GoogleMap. He won a ggplot2 case study competition for this. His code is available here. I like in particular the contour plot, with cool rainbow colors, where both the crime level and the map background are clearly visible.

Tagged with: ,

Get every new post delivered to your Inbox.

Join 55 other followers

%d bloggers like this: