With Jérôme Lê we are planning to swim/bike/run Paris triathlon next July. Before begining the trainning, we want to know where to concentrate efforts. Let us look at some data.
The race distance is known as Intermediate, or Standard, or Olympic distance, with 1.5 km swim, 40 km ride and 10 km run. Data for 2010 Open race (ie not the Elite race) can be found on a site of running races results called Ipitos, after free registration. It consist in 1412 finisher times, for the three parts of the race. Gender is available. Histograms normalized as probabilities are as follows, for time in minutes:
Times for swimming are shorter than the two other parts (resp. 30, 70 and 50 minutes in average). The larger standard deviation is for cycling (resp. 4, 8 and 7 minutes). So larger differences in time are done in this part of the race.
It appears that the skew is positive for the three parts of the race: it sounds usual for that kind of event. It is open to everyone, and most of newcomers enlarge the bulk of the right tail. The cycling histogram is the most skewed (resp. .5, 1.3 and .9). We can see that with boxplots and density estimates. These are done with centered data:
As expected, no outlier is found on the left of the distributions: this is the “no-superman” effect. On the contrary, the otherside of the box outliers are overcrowded, the “nowcomer” effect.
As an aside I have plotted the normalized 3 dimensional data in a square array, with squares of a color defined by data in the RGB model. Sampling 1024 of the 1412 finishers, this provides this (pointless) Richter-like plot:
The following triangle is obtained as in this post:
The fact that the points cloud is on the left illustrates the massive skewness of cycling. The few points outside the cloud correspond to poor performers in the corresponding sport, with swimming at the bottom left, cycling at the bottom right, and running at the top. For example, for the three light green points, loosy bikers, but rather good at swimming and running.
Thanks to Pierre, we now have a new playground for saptial stats, see this post. Before that, let’s see if we can see basic stuff without spatial information.
Data consist in three 32*32 tables, R, G and B, of numbers between 0 and 255. Certainly, the tables should be considered together as a 32*32 table of (r,g,b) vectors. Still, the first basic thing to do is to plot three separate histograms for R, G and B:
compared to uniformly simulated data
We see that the painter has a bias for darker colors, and rather misses light green and light blue ones.
Then, what can we do for representing (r,g,b) vectors? I guess that a good visualization is the color triangle
A few words to explain where it comes from. Say (r,g,b) data is normalized in the unit cube. Then the corners of the color triangle correspond to (plain) red, green and blue, from bottom left, right, to top. It is a section of the cube, with two opposite and equidistant points: black (0,0,0) and white (1,1,1). This triangle is said to be a simplex: any of its points’ coordinates sum to 1. Now the data in the triangle is obtained as (r,g,b) points, diveded by (r+g+b). It took me a while to compute the coordinates (x,y) of those points in a basis of the triangle (I did that stuff more easily back in highschool!). It should give something like that:
What do we see? The colors do not look like uniformly distributed, because 1) points are much more concentrated in the center, and 2) the painter favoured red colors in comparison with green ones (very few in the bottom right corner) and blue ones (in a minor extent). Arguing aginst point 1) could be that projecting (r,g,b) points on the simplex naturally implies a higher density in the center. That is right, but it would not be that dense, as we see with uniformly simulated data:
So colors are not uniform in the RGB model. There should be a cognitive interpretation out there, no? It is not obvious that human eyes comprehend colors on the same scale as the RGB model does. If not, there is no reason for human sight to comprehend uniformity in the same way as a computer. As Pierre pointed out, what we find in the RGB model might be different in the HSV model.We’ll see this model later.
Next step, spatial autocorrelation tests?