A quick, preliminary study of COVID death under-reporting in France

I guess I am not the only data scientist who cannot help checking frantically how COVID data evolve daily, looking at this dashboard, or this nice visualisation.

However, case counts per country are not very reliable, given that countries have very different policies regarding testing and so on; see e.g. Nate Silver’s opinion on case counts here.

You would think that death counts are far more reliable. In France, however,
Santé Publique France got criticized for reporting only COVID deaths that occurred in hospitals. Very recently, they started to include also deaths that occurred in retirement homes. However, they do so only at the national level (current count as of April 12th: 13832; 66% from hospitals). At a finer level (i.e. “régions” or “départements”), the data they provide (here) remains restricted to hospitals.

INSEE (French institute of official Statistics) decided to publish at the same time daily death counts at the département level. Note that INSEE is not a public health institute; the death counts they report are for all deaths, whatever the cause. See also this authoritative post (in French) explaining the challenges behind death counts reporting. In case, you wonder, a “département” is a regional unit (we have about 100 of those), see this wikipedia article.

I decided to compare both datasets using a very, very simple methodology. First, I merged both datasets, so as to obtain, for each département, and each day with a certain period:

  • the number of covid deaths reported in a hospital, call it h_{it} (where i is the département, t is the day);
  • the total number of deaths d_{it} (whatever the cause) on the same day t, in département i;
  • the total number of death d_{it}^{(19)}, d_{it}^{(18)}, on the same day, respectively one year ago (in 2019), and two years ago (in 2018).

SPF data starts on the 18 of March, and INSEE publishes its data every Friday with a one week delay, so my merged dataset currently covers the period: 18 to 30 of March (13 days). And we have about 90 départements in the dataset; the sample size is 1200.

The model I have in mind is simply: d_{it} = \frac{1}{2}(d_{it}^{(18)}+d_{it}^{(19)}) + \beta \times h_{it} + \varepsilon_{it}.

The first term is a basic predictor of 2020 counts, in case they were no pandemic. It is pretty basic, but counts deaths are quite stable over the years. Granted, there is some variation in winter, due to the flu, but this seems to affect mostly February. For the record, here is a plot of the daily number of deaths in France in 2018, 2019 and 2020, for the period covered by the data:

Daily mortality in France in March and April, in 2018, 2019 and 2020. Source: INSEE

The coefficient \beta of course measure under-reporting.

Thus, I fitted a linear regression model to predict 2020 deaths as a function of the 2018 and 2019 deaths, and the CH deaths (no intercept). Here are the results:

Output of statsmodels.OLS in Python. Variable h_it is denoted dh here.

Look in particular at the estimate of \beta: 1.596 (95% confidence interval: [1.51, 1.68] ). In other words, on average, one should add something between 50% and 70% to the reported number of covid deaths in hospitals to get an estimate of all covid deaths.

I tried other models; for instance by forcing the coefficients of the two years to be exactly equal to one half (how to do this is left as a simple exercise!). I got similar results. I’d like to repeat the analysis on weekly aggregated data. We don’t have yet two full weeks of data, so it’s too early for that. The usual caveats regarding linear regression apply; e.g. there should be some heteroscedasticity, given that the size of département vary significantly.

I will update these results as I get more data. I find it interesting that merging these two datasets already gives results that are reasonable and easy to interpret. In particular, I got similar results using only the first six days that were available one week ago. The secret here is we compensate the small number of days by a large number of “départements”.

I am not an expert on public health data, so I do not want to comment on why SPF reports only hospital data; I guess it is much harder to determine that a death is covid-related outside of a hospital, but again I am out of my depth here.

On the other hand, I think it is commendable that INSEE decided to make their own reporting. Of course, both institutions report different things. But the fact that we are able to compare and combine two sources of data potentially gives a clearer picture.

Comments more welcome. I would be curious in particular to know whether other countries provide this kind of double reporting.

Published by Nicolas Chopin

Professor of Statistics at ENSAE, IPP

8 thoughts on “A quick, preliminary study of COVID death under-reporting in France

    1. Cool dashboard! Yes, that ratio and my coefficient are very close but note that:
      a) even on this dashboard, the only time series data you get are covid deaths *in hospitals*;
      b) the number of COVID death in pension homes is apparently very difficult to estimate, as explained in e.g. this Le Monde paper:
      https://www.lemonde.fr/les-decodeurs/article/2020/04/17/infections-tests-courbes-ou-donnees-brutes-bien-lire-les-chiffres-sur-le-coronavirus_6036957_4355770.html
      which is why SPF offers an estimate only at the national level, and only since April 2. (And other countries like the UK do not…)
      c) on the other hand, the all-causes number of death is pretty reliable.

      So, by and large, I think it’s useful to compare both datasets, and observe that, despite the difficulties regarding the estimation of ephad covid deaths:
      (a) the current total estimate of SPF is very reasonable;
      (b) beyond ephad deaths, there is a currently no observable effect of the epidemic on total number of deaths.

      Anyway, I plan to update my post this week-end, when more data is available, and I may add extra comments.

  1. bonjour,
    une remarque sur le modele. seul le coronavirus expliquerait la différence de mortalité selon vous dans votre modèle. C’est peut etre trop simplifié. L’effet du confinement peut avoir un role non négligeable (voire majeur) également sur la mortalité en France? L’arret de l’activité économique joue un role majeur sur:
    – les accidents de la route (10 morts par jour): baisse de -40% a cause du confinemnet
    – La qualité de l’air: la mortalité due à la qualité de l’air est estimée entre 100 et 150 morts par jour! Quel impact?
    – perte d’emploi et chomage: estimé à entre 25 et 50 morts par jour. Quel imapct du confinement?
    – crimes et violences ordinaires (violences conjugales, etc).
    Il y a donc pas mal d’effets qui se croisent et qui peuvent altérer ce résultat (a mon avis plutot à la hausse)
    Qu’en pensez vous.

    1. Pour les accidents de la route, voir le nouveau post. Pour les autres facteurs, si vous avez des sources, je suis preneur. Il faut cependant faire attention aux moyennes journalières. Supposons qu’il y ait 36 500 morts par an dus à la qualité de l’air; la moyenne par jour serait alors de 100 morts par jour. MAIS cela ne prouve pas du tout qu’améliorer la qualité de l’air un seul jour sauverait 100 vies. Ces effets sont difficiles à mesurer précisément, et sont sans doute de long terme.

Leave a comment