Champions League eight of finals’ draw: what are the odds?

Posted in General, Sport by Julyan Arbel on 11 December 2015


[This is a guest post by my friend and colleague Bernardo Nipoti from Collegio Carlo Alberto, Juventus Turin.]

The matches of the group stage of the UEFA Champions league have just finished and next Monday, the 14th of December 2015, in Nyon, there will be a round of draws for deciding the eight matches that will compose the first round of the knockout phase.

As explained on the UEFA website, rules are simple:

  1. two seeding pots have been formed: one consisting of group winners and the other of runners-up;
  2. no team can play a club from their group or any side from their own association;
  3. due to a decision by the UEFA Executive Committee, teams from Russia and Ukraine cannot meet.

The two pots are:

Group winners: Real Madrid (ESP), Wolfsburg (GER), Atlético Madrid (ESP), Manchester City (ENG), Barcelona (ESP, holders), Bayern München (GER), Chelsea (ENG), Zenit (RUS);
Group runners-up: Paris Saint-Germain (FRA), PSV Eindhoven (NED), Benfica (POR), Juventus (ITA), Roma (ITA), Arsenal (ENG), Dynamo Kyiv (UKR), Gent (BEL).

Giving these few constraints, are there some matches that are more likely to be drawn than others? For example, supporters of Barcelona might wonder whether the seven possible teams (PSG, PSV, Benfica, Juventus, Arsenal, Dynamo Kyiv and Gent) are all equally likely to be the next opponent of their favorite team. (more…)

Tagged with:

Power-laws: choose your x and y variables carefully

Posted in R, Sport by Julyan Arbel on 16 November 2011

This is a follow-up of the post Power of running world records

As suggested by Andrew, plotting running world records could benefit from a change of variables. More exactly the use of different variables sheds light on a [now] well-known [to me] sports result provided in a 2000 Nature paper by Sandra Savaglio and Vincenzo Carbone (thanks Ken): the dependence between time and distance in log-log scale is not linear on the whole range of races, but piecewise linear. There is one break-point around time 2’33’’ (or equivalently distance around 1100 m). As mentioned in the article, this threshold corresponds to a physiological critical change in the athlete’s energy expenditure: in short races (less than 1000 m) the effort follows an anaerobic metabolism, whereas it switches to aerobic metabolism for middle and long distances (or longer…). Interestingly, the energy is more efficiently consumed in the second regime than in the first: the decay in speed slows down for endurance races.

The reason of this graphical/visual difference is simple. Denote distance, time and speed by D, T and S. I have plotted the log T~ log D relation, which gave T\propto D^{\alpha} with \alpha=1.11. When using the speed S as one of the variables, the relations are S\propto D^{\gamma} and S\propto T^{\beta} with \gamma=1-\alpha and \beta=\frac{1}{\alpha}-1\approx 1-\alpha to the first order because \alpha is close to 1. With Nature paper findings (with the opposite sign convention), the two \betas are \beta_{\text{an}}=-0.165 (anaerobic) and \beta_{\text{ac}}=-0.072 (aerobic), ie \alpha_{\text{an}}=1.20 and \alpha_{\text{ac}}=1.08. My improper \alpha=1.11 is indeed in between. The slope ratio is much larger (larger than 2) on a plot involving the speed, clearly showing the two regimes, than on my original plot (a few 10%), which is the reason why it appear almost linear (although afterthought, and with good goggles, two lines might have been detected).

Below is the S ~ log D relation (click to enlarge) on which it appears clearly that 100 m and 100 km races are two outliers. It takes time to erase the loss of time due to the start of the race (100 m and 200 m are run at the same speed…), whereas the 100 km suffers from a lack of interest among athletes.Achim Zeileis also provides an extended world records table and R code in his comment.

As an aside, Andrew and Cosma Shalizi also comment and resolve an ambiguity of mine: one usually speaks about power-laws without much precision of context, but there are mainly two separate sets of power-law models. Either power-law regressions, where you plot y~x for two different variables (this is the case here); or power-law distributions, ie the probability distribution of a single variable x is p(x)\propto x^{-a}, or extensions of that (with lots of natural examples, ranging from the size of cities to the number of deaths in attacks in wars).

Comment courir un semi-marathon ou un 20 km?

Posted in French, Sport by Jérôme Lê on 28 October 2011

Dimanche 9 octobre 2011, j’ai couru le 20km de Paris. Sur la ligne de départ à attendre dans le froid et la pluie, un « personal coach » jeune et dynamique nous donnait quelques conseils à suivre durant la course. Le principal d’entre eux était de commencer lentement sa course et d’accélérer sur la fin. Pour les gens dont c’est la première course, le conseil est sans doute justifié pour éviter tout problème ou abandon. Toutefois, est ce vraiment la stratégie optimale à suivre ?


Triathlon data with ggplot2

Posted in Dataset, Sport by Julyan Arbel on 11 October 2011

As Jérôme and I like so much to play with triathlon data, it is a pleasure to see that we are not alone. Christophe Ladroue, from the university of Bristol, wrote this post yesterday: An exercise in plyr and ggplot2 using triathlon results, followed by part II, way better than ours, here and here. For example, the time distributions by age, “faceted” by discipline (swim, cycle, run and total), look like this

As the number of participants to the Stratford triathlon (400 or so) is a bit small for this number of age categories, it would be nice to compare with the Paris triathlon results (about 4000).

Here is the rank for the 3 disciplines and for the total time, “colored” by the final quartile (check the full part II post for colors by quartile in the 3 disciplines):

We see that the rank at the swim lag is not much informative for the final performance, all 4 colors being pretty mixed at that stage, and that it is tidied by the cycle lag. It is the longer one, and as such, the more predictive for the final perf. It is nice to see that some of the poor swimmers  finally reach the first quartile (in orange). Check those ones whit sawtooth patterns: first quartile at swimming, last cycling, first running, and last at the end!

An interesting thing to do with that kind of sports databases would be to build panel data. As most race websites provide historical data with the participants names and age, identification is possible. It is the case for Ipitos, or for Paris 20 km race, with data from 2004 to 2010 (and soon 2011). Remains to check if enough people compete in all the races in a row, my guess is that the answer is yes. The next steps would be to study the impact of the age on the progress, and on the way ones manages the effort from the beginning to the end of the race (thanks to intermediate times in running races, or discipline times in triathlon). Well, maybe in a later post.

Tagged with: , ,

Power of running world records

Posted in Dataset, R, Sport by Julyan Arbel on 8 August 2011

Following a few entries on sports here and there, I was wondering what kind of law follow the running records with respect to the distance. The data are available on Wikipedia, or here for a tidied version. It collects 18 distances, from 100 meters to 100 kilometers. A log-log scale is in order:

It is nice to find a clear power law: the relation between the logarithms of time T and of distance D is linear. Its slope (given by the lm function) defines the power in the following relation:

T\propto D^{1.11}

Another type of race consists in running backwards (or retrorunning). The linear link is similar

with a slightly larger power

T\propto D^{1.13}

So it gets harder to run longer distances backwards than forwards…

It would be interesting to compare the powers for other sports like swimming and cycling.

Tagged with: ,

Quelques statistiques sur le Triathlon de Paris 2011

Posted in Addiction, French, Sport by Jérôme Lê on 12 July 2011

Ce dimanche 10 juillet 2011 nous nous sommes essayés avec Julyan au Triathlon de Paris. L’épreuve se composait cette année de 1.6 km de natation dans la Seine, de 38.5  km de vélo et 10 km de course à pied. Mis à part une eau très sale et froide, l’expérience s’est révélée fort sympathique !

Comme Juju le mentionne dans son post précédent, les résultats détaillés de ce genre de compétitions sont disponibles et téléchargeables sur On y apprend ainsi que sur près de 3000 inscrits seuls 2342 ont terminé la course et été classés. La grande majorité des participants sont des hommes (91,55%), de nationalité française (87,62%), généralement âgés de 30 à 45 ans (58% des concurrents). Par rapport aux autres événements de ce type (semi-marathon, marathon, trail…), le niveau moyen est plus relevé avec une proportion de licenciés relativement importante (45,6%).  Ceci tient sans doute au fait que le triathlon nécessite l’achat de matériels spécifiques et couteux : combinaison pour nager, vélo de course haut de gamme…Au niveau des temps de course, comme on pouvait s’y attendre, le vélo est prédominant sur les deux autres épreuves. Sur une durée totale moyenne de 2h38, il représente près de la moitié de l’effort (1h11). Cependant, lorsqu’on regarde de plus près, on remarque que c’est également l’épreuve dont la dispersion est relativement la moins importante : le coefficient de variation (=écart-type/moyenne) est de l’ordre de 13% contre 16,75% pour la natation et 15,55% pour la course à pied.  Autrement dit, rapportée à la durée de chaque épreuve, il est plus simple de creuser l’écart en natation qu’en vélo.

(NB : Les temps sur les épreuves de natation et de vélo comprennent les temps de « transition » d’une discipline à l’autre : soit la traversée du parc à vélo  de 800m et le changement de tenue)

Pour rebondir sur les conseils à l’entrainement de Julyan, comparons les performances des licenciés et des non licenciés à partir des graphiques ci-dessous. On observe tout d’abord que ceux qui s’entrainent en club sont en moyenne meilleurs sur chaque discipline: environ 6 minutes de mieux à la nage, 8 en vélo et 4’30 en course à pied. Si ces écarts sont effectivement importants, il faut cependant relativiser leur interprétation. Ils peuvent tout autant être dus au fait que l’entrainement en club fasse progresser ou que les individus en club soient « naturellement » meilleurs. Autrement dit, il peut s’agir d’un effet de sélection sur des personnes qui même sans entrainement auraient eu de bonnes performances. Par exemple, si vous n’acceptez que l’inscription de mannequins russes dans votre club de fitness, il n’est pas sûr que l’écart de poids observés ex-post entre vos inscrits et la femme lambda soit dû à votre fabuleux coaching !

Plus sérieusement,  si on compare la forme des distributions, on observe que la dispersion des performances est sensiblement la même entre amateurs et licenciés pour la natation et la course à pied, à une translation près. Par contre, pour le vélo, si les licenciés réalisent effectivement de meilleurs temps, ils sont surtout bien plus homogènes que ceux des amateurs. Une interprétation possible est que l’entrainement en club se concentre davantage sur le vélo que sur les autres disciplines. Les écarts en natation et course à pied pourraient alors refléter essentiellement un effet de sélection puisqu’on n’observe pas d’homogénéisation des performances avec la pratique en club. 

La même analyse par sexe révèle que l’écart entre hommes et femmes se fait surtout au niveau du vélo : les femmes accusent un retard moyen de près de 8 minutes et leurs performances sont bien moins homogènes que celles des hommes (écart-type de 11 minutes contre 9 pour les hommes). En course à pied, l’écart est d’environ 5 minutes mais pour une dispersion similaire. Etonnamment, c’est en natation que la différence hommes/femmes est la moins marquée. Comme quoi, de gros bras ne font pas tout !

(NB : la proportion de licenciés est la même chez les hommes et les femmes)

Tagged with:

Wilcoxon Champagne test

Posted in R, Sport by Julyan Arbel on 14 June 2011

As an appetizer for Paris triathlon, Jérôme and I ran as a team last week-end an adventure racing in Champagne region (it mainly consists in running, cycling, canoeing, with a flavor of orienteering, and Champagne is kept for the end). It was organized by Ecole Polytechnique students who, for the first time, divided Saturday’s legs in two parts: in order to reduce the traffic jam in each leg, the odd number teams were to perform part 1, then part 2, and even number teams in the reverse order.

As the results popped out, we wondered whether the order of performance had favored one of the groups or not. A very much crucial question for us as we were the only odd number team in the top five. Using ggplot and a dataframe donnees including time and Group variables, the code for the (non normalized) histograms of the two groups (even: 0, odd: 1) looks like this

qplot(time, data = donnees, geom = "histogram", binwidth = 2,
          colour = Group, facets = Group ~ ., xlim=c(0, 30))

There are roughly the same number of teams in each group (36 and 38). Time is in hours; the effective racing time is around 12 to 15 hours, but to this is added or substracted penalties or bonus, which explains total times between 5 and 30 hours. The whole impression is that the even group as a flatter histogram, and it might be that it is slightly more to the left that the odd one. To test this last hypothesis, I proceeded with a non-parametric test, the Wilcoxon test (or Mann-Whitney test): the null hypothesis is that the distributions of the two groups (say timeO and time1) do not differ by a location shift, and the alternative is that they differ by some non zero location shift.

> wilcox.test(time0,time1, paired=FALSE)

	Wilcoxon rank sum test

data:  time0 and time1
W = 640, p-value = 0.64
alternative hypothesis: true location shift is not equal to 0

The p-value is greater than 0.05, so we conclude that there is no significant location shift in time. This test is certainly not the most appropriate one since the two distributions do not have the same shape.

Triathlon in three colors

Posted in Dataset, Sport by Julyan Arbel on 23 November 2010

With Jérôme Lê we are planning to swim/bike/run Paris triathlon next July. Before begining the trainning, we want to know where to concentrate efforts. Let us look at some data.

The race distance is known as Intermediate, or Standard, or Olympic distance, with 1.5 km swim, 40 km ride and 10 km run. Data for 2010 Open race (ie not the Elite race) can be found on a site of running races results called Ipitos, after free registration. It consist in 1412 finisher times, for the three parts of the race. Gender is available. Histograms normalized as probabilities are as follows, for time in minutes:



Times for swimming are shorter than the two other parts (resp. 30, 70 and 50 minutes in average). The larger standard deviation is for cycling (resp. 4, 8 and 7 minutes). So larger differences in time are done in this part of the race.

It appears that the skew is positive for the three parts of the race: it sounds usual for that kind of event. It is open to everyone, and most of newcomers enlarge the bulk of the right tail. The cycling histogram is the most skewed (resp. .5, 1.3 and .9). We can see that with boxplots and density estimates. These are done with centered data:

As expected, no outlier is found on the left of the distributions: this is the “no-superman” effect. On the contrary, the otherside of the box outliers are overcrowded, the “nowcomer” effect.

As an aside I have plotted the normalized 3 dimensional data in a square array, with squares of a color defined by data in the RGB model. Sampling 1024 of the 1412 finishers, this provides this (pointless) Richter-like plot:


The following triangle is obtained as in this post:

The fact that the points cloud is on the left illustrates the massive skewness of cycling. The few points outside the cloud correspond to poor performers in the corresponding sport, with swimming at the bottom left, cycling at the bottom right, and running at the top. For example, for the three light green points, loosy bikers, but rather good at swimming and running.

Penalty shootout in FIFA World Cups

Posted in Sport by Julyan Arbel on 10 July 2010

While looking for a post on octopus Paul, Jérôme told me about this surprising football stat:

the team which begins a penalty shootout is more likely to win than the following team.

Here is a so bad souvenir, with Trézeguet’s shoot on the bar… And yes, Italy went fisrt at shooting!

What do data say about that?

Here are results on penalty shootout of five main tournaments in the past decades, World Cup, European Championships, Copa América, African Cup of Nations and the Asian Cup. The trouble is a crucial point is missing: there is no mention of which team began to shoot. I managed to gather 14 data points in the following way (only for the World Cups): either using the number of penalties taken column (which provides the order information when the numbers differ), or watching relative youtube video. The outcomes are shown bellow (please comment if I misinterpreted any result, or if you know a clever way to get more data, I no football scholar!): the beginning team won 11 (out of 14) times!

Now, the question is to see whether this difference is statistically significant or not.

A simple model is the following: the random variable X_i is 1 when the match labelled by i is won by the first team to shoot, and is 0 otherwise (n=14,\,i=1,...n). Denote p the probability that X_i=1. Then, testing H_0 :\,p=1/2 against H_1 :\,p\neq 1/2 can be done via a \chi^2 test (or a Wald test, with the same statistic in the case of the Bernoulli model).

Under hypothesis H_0, the statistic T_n=4n(\bar X-1/2)^2 folows a \chi^2(1) distribution asymptotically. We have T_n=4.57. Compared to the 95% quantile of a \chi^2(1) variable, q=3.8, we can state that the probability of success is significantly higher for the first team.

Any explanation? We can guess that the following team is more under pressure than the first, and fails more often when trying to equilize. Indeed, a player whose shoot makes win his team in case of success scores an average 93%, against an average 52% or so when it makes loose in case of failure…

PS: should Spain – Netherlands end up with a penalty shootout, football analysts say it would be in the interest of Spain. Indeed the Netherlands are among the 5 worse nations at thos (along with the UK). What if the Netherlands win the coin toss?

3 second-in winners:

1990 SF Argentina Italy 1-1 4-3 4 & 5
1990 SF West Germany England 1-1 4-3 4 & 5
1994 Final Brazil Italy 0-0 3-2 4 & 5

against 11 first-in winners:

1986 QF West Germany Mexico 0-0 4-1 4 & 3
1998 Last 16 Argentina England 2-2 4-3 5 each
1998 QF France Italy 0-0 4-3 5 each
1998 SF Brazil Netherlands 1-1 4-2 4 each
2002 QF South Korea Spain 0-0 5-3 5 & 4
2006 Qualifier Australia Uruguay 0-1 1-0 4-2 5 & 4
2006 Last 16 Ukraine Switzerland 0-0 3-0 4 & 3
2006 QF Germany Argentina 1-1 4-2 4 each
2006 QF Portugal England 0-0 3-1 5 & 4
2006 Final Italy France 1-1 5-3 5 & 4
2010 Last 6 Paraguay Japan 0-0 5-3 5 & 4
%d bloggers like this: