Valentine Day and lonely people in France

Posted in Dataset, General by Julyan Arbel on 14 February 2012

Insee published recently a paper (in French), well in line with the Valentine Day, which characterizes people living alone or in couple by socio-professional category, along with the data.

Between 1990 and 2008 (two population surveys), the proportion of people living alone mostly increased for people under 60. After 60, 38% of women live alone, for only 17% of men,  because women are married to older men, and live longer than them, in average. See that proportion by age:

Spatially, there is a kind of South/North opposition. During the working life, lonely people live in the South (left), while lonely retired people live in the North (right), with an exception for Île-de-France (Paris) with a high proportion whatever the age:


Daily casualties in Syria

Posted in Dataset, R by Julyan Arbel on 9 February 2012

Every new day brings its statistics of new deaths in Syria… Here is an attempt to learn about the Syrian uprising by the figures. Data vary among sources: the Syrian opposition provides the number of casualties by day (here on Dropbox), updated on 8 February 2012, with a total exceeding 8 000.

We note first that the attacks accelerate, as the cumulated graph is mostly convex (click to enlarge):

Plotting the numbers by day shows the bloody situation of  Fridays, a gathering day in the Muslin calendar. This point was especially true at the beginning of the uprising, but lately any other day can be equally deadly:

There are almost twice as much deaths on Fridays as any other day in average:
Here are boxplots for the logarithm of daily casualties by day of the week:

and their density estimates, first coloured by day of the week, then by Friday vs rest of the week:

Here is the code (with clumsy parts for fitting the data frames for ggplot, do not hesitate to comment on it)

input$LogicalFriday=factor(input$WeekDay =="Friday",levels = c(FALSE, TRUE),
                           labels = c("Not Friday", "Friday"))
                      levels=unique(as.character(input$WeekDay[7:13]))) # trick to sort the legend
qplot(x=Date,y=cumsum(Number), data=input, geom="line",color=I("red"),xlab="",ylab="",lwd=I(1))
qplot(x=as.factor(Date),y=Number, data=input, geom="bar",fill=LogicalFriday,xlab="",ylab="")
qplot(log(Number+1), data=input, geom="density",fill=LogicalFriday,xlab="",ylab="",alpha=I(.2))
qplot(log(Number+1), data=input, geom="density",fill=WeekDay,xlab="",ylab="",alpha=I(.2))

Created by Pretty R at

Tagged with: ,

Create maps with maptools R package

Posted in Dataset, R by Julyan Arbel on 13 December 2011

Baptiste Coulmont explains on his blog how to use the R package maptools. It is based on shapefile files, for example the ones offered by the French geography agency IGN (at départements and communes level). Some additional material like roads and railways are provided by the OpenStreetMap project, here. For the above map, you need to dowload and dezip the files and The red dots correspond to points of interest longitude / latitude, here churches stored in a vector eglises (use e.g. this to geolocalise places of interest). Then run this code from Baptiste’s tutorial


Created by Pretty R at

Google Fusion Tables

Posted in Dataset, Geek by Julyan Arbel on 8 December 2011

A quick post about another Google service that I discovered recently called Fusion Tables. There you can store, share and visualize data up to 250 MB, of course in the cloud. With Google Docs, Google Trends and Google Public Data Explore, it is another example of Google’s efforts to gain ground in data management. Has anyone tried it out?

Tagged with:

France open data at

Posted in Dataset by Julyan Arbel on 5 December 2011

Today is launched the (beta version of the) brand new French website for open data, at (do not misunderstand the url, it is in French!). On prime minister’s initiative, it collects data from various ressources, among which the institute for statistics INSEE, most of the ministries (Finance, Culture, etc), several big companies (like the state-owned railway company SNCF) for an open, transparent and collaborative model of governance. Datasets are available in CSV or Excel spreadsheet, under open licence for most of them, and should be updated frequently (monthly).

Some of the datasets you can find: list of 3000+ train stations with geolocalisation (later with traffic?); geolocalisation of road accidents; the comprehensive list of books in Bibliothèque Nationale de France (which scans and stores each and every book published in France); the number of students in Latin and ancient Greek classes; France national budget, etc… among 350 000 others.

I have added the link in our datasets page.

Triathlon data with ggplot2

Posted in Dataset, Sport by Julyan Arbel on 11 October 2011

As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this

As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).

Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):

We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers  finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!

An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.

Tagged with: , ,

World Tourism Day, and Google Public Data Explore

Posted in Dataset, R by Julyan Arbel on 27 September 2011

Today is the World Tourism Day! So let’s speak about some tourism related datasets – and others.

Among other nice functions, Google offers a Public Data Explore in a beta version which provides a collection of datasets from OECD, IMF, Eurostat, World Bank, US Census Bureau, etc (cf. our datasets page as well). It is possible to plot these data directly online, with the following (limited) types: lines, barplots, maps and scatterplots.

The page reads “Data visualizations for a changing world“… nothing less! As someone writes on Andrew’s blog, it reminds a lot of Hans Rosling work with Gapminder‘s motion charts: “Unveiling the beauty of statistics for a fact based world view“.

It is really easy to use, and a good opportunity for math highschool teachers to show nice graphs to students before they learn how to use R. The pointer to R is straightforward as it displays the same plots as the googleVis R package. For example for Tourism data, the number of nights spent in European countries looks like this in 2009 (click for getting the motion chart version!)

The barplot goes like this:

Power of running world records

Posted in Dataset, R, Sport by Julyan Arbel on 8 August 2011

Following a few entries on sports here and there, I was wondering what kind of law follow the running records with respect to the distance. The data are available on Wikipedia, or here for a tidied version. It collects 18 distances, from 100 meters to 100 kilometers. A log-log scale is in order:

It is nice to find a clear power law: the relation between the logarithms of time T and of distance D is linear. Its slope (given by the lm function) defines the power in the following relation:

T\propto D^{1.11}

Another type of race consists in running backwards (or retrorunning). The linear link is similar

with a slightly larger power

T\propto D^{1.13}

So it gets harder to run longer distances backwards than forwards…

It would be interesting to compare the powers for other sports like swimming and cycling.

Tagged with: ,

Triathlon in three colors

Posted in Dataset, Sport by Julyan Arbel on 23 November 2010

With Jérôme Lê we are planning to swim/bike/run Paris triathlon next July. Before begining the trainning, we want to know where to concentrate efforts. Let us look at some data.

The race distance is known as Intermediate, or Standard, or Olympic distance, with 1.5 km swim, 40 km ride and 10 km run. Data for 2010 Open race (ie not the Elite race) can be found on a site of running races results called Ipitos, after free registration. It consist in 1412 finisher times, for the three parts of the race. Gender is available. Histograms normalized as probabilities are as follows, for time in minutes:



Times for swimming are shorter than the two other parts (resp. 30, 70 and 50 minutes in average). The larger standard deviation is for cycling (resp. 4, 8 and 7 minutes). So larger differences in time are done in this part of the race.

It appears that the skew is positive for the three parts of the race: it sounds usual for that kind of event. It is open to everyone, and most of newcomers enlarge the bulk of the right tail. The cycling histogram is the most skewed (resp. .5, 1.3 and .9). We can see that with boxplots and density estimates. These are done with centered data:

As expected, no outlier is found on the left of the distributions: this is the “no-superman” effect. On the contrary, the otherside of the box outliers are overcrowded, the “nowcomer” effect.

As an aside I have plotted the normalized 3 dimensional data in a square array, with squares of a color defined by data in the RGB model. Sampling 1024 of the 1412 finishers, this provides this (pointless) Richter-like plot:


The following triangle is obtained as in this post:

The fact that the points cloud is on the left illustrates the massive skewness of cycling. The few points outside the cloud correspond to poor performers in the corresponding sport, with swimming at the bottom left, cycling at the bottom right, and running at the top. For example, for the three light green points, loosy bikers, but rather good at swimming and running.

Random Colours (part 2)

Posted in Art, Dataset by Pierre Jacob on 22 September 2010

In this previous post, Julyan presented the paintings of Gerhard Richter, and asked whether the colours were really “randomly chosen”, as claimed by the painter. To answer the question from a statistical point of view (ie whether the colours are uniformly distributed in the (r,g,b) space or in the (x, y, r, g, b) space for instance, where x, y is the position and r, g, b the 3 colour components), we need to extract the data. Let’s take for example the following 1024 colours painting.

The data corresponding to this painting would be a 32*32 table, and in each cell of the table there would be a colour, represented for instance by 3 numbers, like in the RGB colour model. Tonight I’ve made a python script that extracts this data, with Julien‘s help. I took the marginal mean colour along both axis and converted it into grey scale. This gives two lines with white segments and grey segments. From that it is easy to find the middle of the segments, which gives the squares’ centres. Once the square centres were found, I simply took the mean colour of a smaller square around each centre.

As an output the script creates a BMP file with one pixel per colour (so it’s a tiny image, obviously 32*32 pixels), and a R file with 3 matrices called “R”, “G” and “B”, available here. This format is usually convenient since it’s plain text but if you want another one just ask me. If we zoom on the output BMP file we get:

The script is available here if you want to try it or modify it. I fear that there might be a slight mistake in the script because the colours don’t seem to be exactly the same in the output as in the input, but hopefully it’s close enough. The script needs an image to work on, for instance you can try on the pictures from this gallery. I tested it on two other pictures:

So now we have the data for three pictures (10, 192 and 1024 colours), and we can start to do some real stats. Are we going to find the same results in the RGB model as in the HSV model for instance ? If not, which colour model should we use?

To be continued!


%d bloggers like this: