Triathlon data with ggplot2
As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this
As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).
Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):
We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!
An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.