Categorical distribution, structure of the second kind and Gumbel-max trick

Hi all,

This post is about a way of sampling from a Categorical distribution, which appears in Arthur Dempter‘s approach to inference as a generalization of Bayesian inference (see Figure 1 in “A Generalization of Bayesian Inference”, 1968), under the name “structure of the second kind”. It’s the starting point of my on-going work with Ruobin Gong and Paul Edlefsen, which I’ll write about on another day. This sampling mechanism turns out to be strictly equivalent to the “Gumbel-max” trick that got some attention in machine learning see e.g. this blog post by Francis Bach.

Continue reading “Categorical distribution, structure of the second kind and Gumbel-max trick”

Course on Couplings and Monte Carlo

Hi everyone,

This short post is just to point to a course on “Couplings and Monte Carlo”, available here Versions of the course were given in Université Paris-Dauphine in February 2020 (thanks Robin Ryder and Christian P. Robert), at the University of Bristol in March 2020 (thanks Anthony Lee) and at the University of Torino for the M.Sc. in Stochastics and Data Science in May 2020 (thanks Matteo Ruggiero). I am grateful to these colleagues and their institutions for supporting this course. The course website points to about 100 pages of lecture notes, and 16 videos are available on youtube. It is intended for advanced undergraduate students or graduate students, with some previous exposure to Monte Carlo methods. This is work in progress, and as I am hoping to develop the course over the coming years, feedback would be much welcome.

Follow-up on my previous post on covid deaths in France

In my previous post, I compared two sources of data regarding death counts (one from SPF, Santé Publique France, for covid deaths in hospitals; one from INSEE, for all-cause deaths), in order to get a better idea of the actual death toll of covid-19 in France.

In this post, I would like to do the following:

  1. describe briefly a new, richer data-set recently published by INSEE (and do some graphs);
  2. use the updated data (from both sources) to repeat my analysis, with some variants (weekly aggregates, separating men and women);
  3. reply to a few comments I got on LinkedIn and elsewhere;
  4. provide a few pointers regarding death counts in other countries (particularly the UK).

New INSEE data

INSEE now provides every Friday an exhaustive data-set that records, for each death that has occurred since 01-01-2018, the following variables: date of birth, date of death, sex, département of death, and so on. Neat. Let’s take this opportunity to do a few plots, such as this one:

Number of deaths in France each day in 2018, 2019, 2020, until 13th Apr 2020 (as recored in INSEE dataset described above)

(it’s nice to observe this sharp drop) or that one:

Death counts during weeks 13, 14 and 15 per age, for years 2018-20 (same INSEE dataset

The latter plot covers the same period (weeks 13 to 15, 23rd March to 12th April) as in the analysis below. As expected, over-mortality seems to affect mostly people above 60.

Updated analysis

Ok, now let’s repeat my previous analysis, based on merging the SPF data (daily covid death counts in hospitals, in each département and each sex) and the aforementioned INSEE data (all-cause deaths). Except this time:

  1. The overlap between the two datasets now covers more than three weeks (18th March, first date in SPF dataset, to 12th April, latest date in INSEE dataset) so I decided to consider weekly aggregates, for two reasons: they are more stable than daily aggregates, and less affected by artifacts such as delays (e.g. a death occurring during a week-end is reported to the next Monday).
  2. I also separated men and women.
  3. I am going to simplify a bit the model, and simply regress excess deaths (number of deaths in 2020 minus the average over 2018 and 2019) on hospital deaths.

First, a joint plot:

So, to recap, each point in this plot corresponds to a pair of death counts, for each département in France, each week between 13 and 15, and for each sex. The corresponding linear regression (without an intercept) gives a slope estimate of 1.79 (95% confidence interval: [1.73, 1.85]). The basic interpretation would be: in each département, when 100 covid deaths occur in hospitals, the number of covid-related (see below) deaths should be approximately 179. The current total number of covid deaths reported by SPF is 22, 614, which is 60% above that the number of covid deaths in hospitals (14050). So this estimate suggests the actual death toll might be a tad larger. More about the interpretation below.

Now for something more interesting: let’s redo the previous plot, but with a different colour for each sex:

Clearly the two linear trends are different; see below the OLS estimates.

sexslope estimateslope 95% confidence intervalR^2
F2.40[2.30, 2.50]89%
M1.56[1.50, 1.62]90%

What is going on? Well, women tend to live longer than men. And the proportion of women in EHPADS (French retirement homes) is 74%. Since the main reason behind the discrepancy between hospital deaths and excess deaths is covid death occurring in pension homes, these results make sense.

Reply to previous comments

What’s the point?

Fair enough, since 4th April, SPF does include in its total estimate both hospital deaths and pension home deaths, and the proportion of the latter is not too far from my estimate. Note that however that:

  • it’s really hard to estimate properly the number of covid deaths occurring in pension homes. Apparently several pension homes did not provide any data, while others marked as “covid” all the deaths that have occurred after the first covid deaths.
  • My estimate might measure other direct or indirect effects of the pandemic, such as people dying at home, people not receiving proper care because the health system is at capacity and so on.
  • The fact that data from two different institutions may be compared, and seem to be somehow consistent, is, in my opinion, a good piece of news which deserves to be reported!

Accounting for car accidents

Boy, that one was popular. Please have a look a the plot on the front page of ONIRS (click on “tués”): yes, the number car-related deaths dropped sharply thanks to the lock-down… But in March of last year, this number was around 250, that is, 1% of the current covid death count. “Fun” fact: this point would have been quite relevant in the 70s! In those years, the number of car-related deaths was about five times larger (18 034 deaths in 1972).

Better predictive model

The idea of comparing the 2020 deaths to the average of the two previous years is a bit crude, and demographers have better models to predict death counts based on age repartition and so on. That said, the notion of “excess deaths” seems quite popular in various countries, as I explain below, so I guess that my approach is not so daft after all.

In other countries

To be honest, I was hoping to apply the same approach to the UK, a country where the official estimate is still limited to hospital deaths, and thus clearly quite biased; see e.g. this Guardian paper. Sadly, Public Health England only reports daily hospital death counts … per nation (nation=England, Scotland, Wales, or Northern Ireland). On the other hand, the Office of National Statistics reports every week the number of “excess deaths” (relative to the five year average), and the proportion of these deaths where the word “covid” is mentioned on the death certificate.

Interestingly, the Guardian paper I mentioned above first complains that the UK only reports hospital deaths, and then claims erroneously that the UK is still behind France in terms of covid mortality. It’s not, if you compare in terms of hospital deaths (UK: 20,319 on Saturday, while France: 14,050). The fact that even journalists reporting on this issue may get it wrong seems indicative of how confusing are covid death data.

More generally, my impression is that looking at “excess deaths” makes far more sense for most countries at the moment: it’s easier to measure (albeit with a delay of course), and easier to interpret. This is also more or less the point made by this NYT paper. (Notice how their plot for France only covers January to April; for the complete plot, see my first plot above!).

A quick, preliminary study of COVID death under-reporting in France

I guess I am not the only data scientist who cannot help checking frantically how COVID data evolve daily, looking at this dashboard, or this nice visualisation.

However, case counts per country are not very reliable, given that countries have very different policies regarding testing and so on; see e.g. Nate Silver’s opinion on case counts here.

You would think that death counts are far more reliable. In France, however,
Santé Publique France got criticized for reporting only COVID deaths that occurred in hospitals. Very recently, they started to include also deaths that occurred in retirement homes. However, they do so only at the national level (current count as of April 12th: 13832; 66% from hospitals). At a finer level (i.e. “régions” or “départements”), the data they provide (here) remains restricted to hospitals.

INSEE (French institute of official Statistics) decided to publish at the same time daily death counts at the département level. Note that INSEE is not a public health institute; the death counts they report are for all deaths, whatever the cause. See also this authoritative post (in French) explaining the challenges behind death counts reporting. In case, you wonder, a “département” is a regional unit (we have about 100 of those), see this wikipedia article.

I decided to compare both datasets using a very, very simple methodology. First, I merged both datasets, so as to obtain, for each département, and each day with a certain period:

  • the number of covid deaths reported in a hospital, call it h_{it} (where i is the département, t is the day);
  • the total number of deaths d_{it} (whatever the cause) on the same day t, in département i;
  • the total number of death d_{it}^{(19)}, d_{it}^{(18)}, on the same day, respectively one year ago (in 2019), and two years ago (in 2018).

SPF data starts on the 18 of March, and INSEE publishes its data every Friday with a one week delay, so my merged dataset currently covers the period: 18 to 30 of March (13 days). And we have about 90 départements in the dataset; the sample size is 1200.

The model I have in mind is simply: d_{it} = \frac{1}{2}(d_{it}^{(18)}+d_{it}^{(19)}) + \beta \times h_{it} + \varepsilon_{it}.

The first term is a basic predictor of 2020 counts, in case they were no pandemic. It is pretty basic, but counts deaths are quite stable over the years. Granted, there is some variation in winter, due to the flu, but this seems to affect mostly February. For the record, here is a plot of the daily number of deaths in France in 2018, 2019 and 2020, for the period covered by the data:

Daily mortality in France in March and April, in 2018, 2019 and 2020. Source: INSEE

The coefficient \beta of course measure under-reporting.

Thus, I fitted a linear regression model to predict 2020 deaths as a function of the 2018 and 2019 deaths, and the CH deaths (no intercept). Here are the results:

Output of statsmodels.OLS in Python. Variable h_it is denoted dh here.

Look in particular at the estimate of \beta: 1.596 (95% confidence interval: [1.51, 1.68] ). In other words, on average, one should add something between 50% and 70% to the reported number of covid deaths in hospitals to get an estimate of all covid deaths.

I tried other models; for instance by forcing the coefficients of the two years to be exactly equal to one half (how to do this is left as a simple exercise!). I got similar results. I’d like to repeat the analysis on weekly aggregated data. We don’t have yet two full weeks of data, so it’s too early for that. The usual caveats regarding linear regression apply; e.g. there should be some heteroscedasticity, given that the size of département vary significantly.

I will update these results as I get more data. I find it interesting that merging these two datasets already gives results that are reasonable and easy to interpret. In particular, I got similar results using only the first six days that were available one week ago. The secret here is we compensate the small number of days by a large number of “départements”.

I am not an expert on public health data, so I do not want to comment on why SPF reports only hospital data; I guess it is much harder to determine that a death is covid-related outside of a hospital, but again I am out of my depth here.

On the other hand, I think it is commendable that INSEE decided to make their own reporting. Of course, both institutions report different things. But the fact that we are able to compare and combine two sources of data potentially gives a clearer picture.

Comments more welcome. I would be curious in particular to know whether other countries provide this kind of double reporting.

SIR models with Kermack and McKendrick

Time to bake and to read Kermack and McKendrick!


It seems about the right time to read Kermack & McKendrick, 1927, “A contribution to the Mathematical Theory of Epidemics”. It is an early article on the “Susceptible-Infected-Removed” or “SIR” model, a milestone in the mathematical modelling of infectious disease. In this blog post, I will go through the article, describe the model and the data considered by the authors (plague in Bombay in 1905-1906), which will turn out to be a questionable choice. Some references and R code are given at the end of the article. All of this comes with the disclaimer that I have no expertise in epidemiology.

Continue reading “SIR models with Kermack and McKendrick”

Remote seminars


Hi all,

It seems like the current environment is perfect for the growth of remote seminars. Most of them seem to be free to attend, some of them require registration. I’ve collected some links to seminars with topics related to statistics on this page: I will try to keep the page up to dates with more links as new seminars are being created, with an emphasis on topics at least loosely related to the topics usually covered in our blog posts. Don’t hesitate to send links via comments or emails.

Urgent call for modellers to support epidemic modelling


Hi all,

As many scientists who are not usually working in epidemiology are trying to contribute to the fight against the current pandemic while getting magnets stuck in their nose,  the Royal Society has a call here:  for “modellers to support epidemic modelling” with a deadline on April 2nd (5pm British Summer Time).

More details are given here: and they specifically welcome non-UK based scientists.

Statistical inference on MCMC traces


Hi everyone,

and Happy New Year! This post is about some statistical inferences that one can do using as “data” the output of MCMC algorithms. Consider the trace plot above. It has been generated by Metropolis–Hastings using a Normal random walk proposal, with a standard deviation “sigma”, on a certain target. Suppose that you are given a function that evaluates the pdf of that target. Can you retrieve the value of sigma used to generate that chain?

Continue reading “Statistical inference on MCMC traces”

Course on Bayesian machine learning in Paris

Trends for papers appearing in JMLR and at NeurIPS.

Rémi Bardenet and I are starting a new course on Bayesian machine learning at the MVA master (Mathématiques, Vision et Apprentissage) at ENS Paris-Saclay. Details on the syllabus can be found on the MVA webpage and on this Github repository. In this post, I shortly describe what motivated us for proposing this course and provide results of topic modeling of recent machine learning (conference) papers mentioning Bayesian in their abstract.Continue reading “Course on Bayesian machine learning in Paris”

Turning a sum of expectations into a single expectation with geometric series

Screenshot 2020-01-07 at 17.11.07
A technical lemma that is eager to learn how to turn a sum of expectations into a single expectation with geometric series.

At the dawn of 2020, in case anyone in the stat/ML community is not aware yet of Francis Bach’s blog started last year: this is a great place to learn about general tricks in machine learning explained with easy words. This month’s post The sum of a geometric series is all you need! shows how ubiquitous geometric series are in stochastic gradient descent, among others. In this post, I describe just another situation where the sum of a geometric series can be useful in statistics.

Turning a sum of expectations into a single expectation

I also found the sum of geometric series useful for turning a sum of expectations into a single expectation, by the linearity of expectation. More specifically, for a random variable X compactly supported on [-1,1],

\sum_{k=1}^\infty \mathbb{E}[X^k] = \mathbb{E}\big[\sum_{k=1}^\infty X^k\big] = \mathbb{E}\big[\frac{X}{1-X}\big],

where the sum can be infinite. Continue reading “Turning a sum of expectations into a single expectation with geometric series”

%d bloggers like this: