Andras Fulop, Jeremy Heng (both ESSEC), and me (Nicolas Chopin, ENSAE, IPP) are currently advertising a post-doc position to work on developing SMC methods for challenging models found in Finance and Econometrics. If you are interested, click here for more details, and get in touch with us.

]]>Ever wanted to learn more about particle filters, sequential Monte Carlo, state-space/hidden Markov models, PMCMC (particle MCMC) , SMC samplers, and related topics?

In that case, you might want to check the following book from Omiros Papaspiliopoulos and I, which has just been released by Springer:

and which may be ordered from their web-site, or from your favourite book store.

The aim of the book is to cover the many facets of SMC: the algorithms, their practical uses in different areas, the underlying theory, how they may be implemented in practice, etc. Each chapter contains a “Python corner” which discusses the practical implementation of the covered methods in Python, a set of exercises, and bibliographical notes. Speaking of chapters, here is the table of contents:

- Introduction
- Introduction to state-space models
- Beyond state-space models
- Introduction to Markov processes
- Feynman-Kac models: definition, properties and recursions
- Finite state-spaces and hidden Markov models
- Linear-Gaussian state-space models
- Importance sampling
- Importance resampling
- Particle filtering
- Convergence and stability of particle filters
- Particle smoothing
- Sequential quasi-Monte Carlo
- Maximum likelihood estimation of state-space models
- Markov chain Monte Carlo
- Bayesian estimation of state-space models and particle MCMC
- SMC samplers
- SMC^2, sequential inference in state-space models
- Advanced topics and open problems

And here is one fancy plot taken from the book. (For some explanation, you will have to read it!)

A big thanks to all the colleagues who took the time to read draft versions and send feedback (see the introduction for a list of names). Also, don’t write books, folks. Seriously, it takes WAY too much time…

]]>Hi all,

This post is about a way of sampling from a Categorical distribution, which appears in Arthur Dempter‘s approach to inference as a generalization of Bayesian inference (see Figure 1 in “A Generalization of Bayesian Inference”, 1968), under the name “structure of the second kind”. It’s the starting point of my on-going work with Ruobin Gong and Paul Edlefsen, which I’ll write about on another day. This sampling mechanism turns out to be strictly equivalent to the “Gumbel-max” trick that got some attention in machine learning see e.g. this blog post by Francis Bach.

Let’s look at the figure above: the encompassing triangle is equivalent to the “simplex” with 3 vertices (K vertices more generally). Any point within the triangle is a convex combination of the vertices, where are non-negative “weights” summing to one, and where are the vertices. The weights are the “barycentric coordinates” of the point. Any point in the triangle induces a partition into K sets . Each “sub-simplex” can be obtained by considering the entire simplex and replacing vertex by . It has a volume equal to relative to the volume of the entire simplex. Can you see why? If not, it’s OK, great scientific endeavors require a certain degree of trust and optimism.

Since the volume of each is , if we sample a point uniformly within the encompassing simplex, it will land within with probability . In other words we can sample from a Categorical distribution with probabilities by sampling uniformly within the simplex, and by identifying which index k is such that the point lands in . This appears in various places in Arthur Dempster’s articles (see references below), because Categorical distributions provide a pedagogical setting for new methods of statistical inference, and because this sampling mechanism does not rely on any arbitrary ordering of the categories (contrarily to “inverse transform sampling”).

How does this relate to the Gumbel-max trick? One way of sampling uniformly within the simplex is to sample Exponentials(1) and to define weights . Furthermore, a point is within for a given , if and only if for all . The next figure illustrates such inequalities: the points with coordinates satisfying are under/above some line that originates from the vertex opposite the segment and goes through .

An Exponential(1) is also minus the logarithm of a Uniform(0,1). Putting all these pieces together, a Uniform point in the simplex is within if and only if, for all ,

.

Since is a Gumbel variable, the above mechanism is equivalent to where are independent Gumbel variables. It’s the Gumbel-max trick!

- It’s hard to trace back the first instance of this sampling mechanism, but it appears in various of Arthur Dempster’s articles, e.g. “New methods for reasoning towards posterior distributions based on sample data”, 1966, and it is discussed at length in “A class of random convex polytopes”, 1972.
- The connection occurred to me while reading Xi’an’s blog post, which points to this interesting article on Emil Gumbel, academic in Heidelberg up to his exile in 1932, “pioneer of modern data journalism” and active opponent to the nazis. Quoting from the article, “His fate was sealed when, at a speech in memory of the 700,000 who had perished of hunger in the winter of 1916/17, he remarked that a rutabaga would certainly be a better memorial than a scantily clad virgin with a palm frond”.
- The Gumbel-max trick is interesting for many reasons, it amounts to viewing sampling as an optimization program, it can be “relaxed” in various useful ways, etc. In Art Dempster’s work that sampling mechanism is appealing because of its invariance by relabeling of the categories (“category 2” is not between “category 1” and “category 3”). This matters when performing inference with Categorical distributions (i.e. with count data) using Art Dempster’s approach, because the estimation depends on the choice of sampling mechanism and not simply on the likelihood function.

Hi everyone,

This short post is just to point to a course on “Couplings and Monte Carlo”, available here https://sites.google.com/site/pierrejacob/cmclectures. Versions of the course were given in Université Paris-Dauphine in February 2020 (thanks Robin Ryder and Christian P. Robert), at the University of Bristol in March 2020 (thanks Anthony Lee) and at the University of Torino for the M.Sc. in Stochastics and Data Science in May 2020 (thanks Matteo Ruggiero). I am grateful to these colleagues and their institutions for supporting this course. The course website points to about 100 pages of lecture notes, and 16 videos are available on youtube. It is intended for advanced undergraduate students or graduate students, with some previous exposure to Monte Carlo methods. This is work in progress, and as I am hoping to develop the course over the coming years, feedback would be much welcome.

]]>In this post, I would like to do the following:

- describe briefly a new, richer data-set recently published by INSEE (and do some graphs);
- use the updated data (from both sources) to repeat my analysis, with some variants (weekly aggregates, separating men and women);
- reply to a few comments I got on LinkedIn and elsewhere;
- provide a few pointers regarding death counts in other countries (particularly the UK).

INSEE now provides every Friday an exhaustive data-set that records, for each death that has occurred since 01-01-2018, the following variables: date of birth, date of death, sex, département of death, and so on. Neat. Let’s take this opportunity to do a few plots, such as this one:

(it’s nice to observe this sharp drop) or that one:

The latter plot covers the same period (weeks 13 to 15, 23rd March to 12th April) as in the analysis below. As expected, over-mortality seems to affect mostly people above 60.

Ok, now let’s repeat my previous analysis, based on merging the SPF data (daily covid death counts in hospitals, in each département and each sex) and the aforementioned INSEE data (all-cause deaths). Except this time:

- The overlap between the two datasets now covers more than three weeks (18th March, first date in SPF dataset, to 12th April, latest date in INSEE dataset) so I decided to consider
**weekly aggregates**, for two reasons: they are more stable than daily aggregates, and less affected by artifacts such as delays (e.g. a death occurring during a week-end is reported to the next Monday). - I also separated
**men and women**. - I am going to simplify a bit the model, and simply regress
**excess deaths**(number of deaths in 2020 minus the average over 2018 and 2019) on**hospital deaths**.

First, a joint plot:

So, to recap, each point in this plot corresponds to a pair of death counts, for each département in France, each week between 13 and 15, and for each sex. The corresponding linear regression (without an intercept) gives a slope estimate of 1.79 (95% confidence interval: [1.73, 1.85]). The basic interpretation would be: in each département, when 100 covid deaths occur in hospitals, the number of covid-related (see below) deaths should be approximately 179. The current total number of covid deaths reported by SPF is 22, 614, which is 60% above that the number of covid deaths in hospitals (14050). So this estimate suggests the actual death toll might be a tad larger. More about the interpretation below.

Now for something more interesting: let’s redo the previous plot, but with a different colour for each sex:

Clearly the two linear trends are different; see below the OLS estimates.

sex | slope estimate | slope 95% confidence interval | R^2 |

F | 2.40 | [2.30, 2.50] | 89% |

M | 1.56 | [1.50, 1.62] | 90% |

What is going on? Well, women tend to live longer than men. And the proportion of women in EHPADS (French retirement homes) is 74%. Since the main reason behind the discrepancy between hospital deaths and excess deaths is covid death occurring in pension homes, these results make sense.

Fair enough, since 4th April, SPF does include in its total estimate both hospital deaths and pension home deaths, and the proportion of the latter is not too far from my estimate. Note that however that:

- it’s really hard to estimate properly the number of covid deaths occurring in pension homes. Apparently several pension homes did not provide any data, while others marked as “covid” all the deaths that have occurred after the first covid deaths.
- My estimate might measure other direct or indirect effects of the pandemic, such as people dying at home, people not receiving proper care because the health system is at capacity and so on.
- The fact that data from two different institutions may be compared, and seem to be somehow consistent, is, in my opinion, a good piece of news which deserves to be reported!

Boy, that one was popular. Please have a look a the plot on the front page of ONIRS (click on “tués”): yes, the number car-related deaths dropped sharply thanks to the lock-down… But in March of last year, this number was around 250, that is, 1% of the current covid death count. “Fun” fact: this point would have been quite relevant in the 70s! In those years, the number of car-related deaths was about five times larger (18 034 deaths in 1972).

The idea of comparing the 2020 deaths to the average of the two previous years is a bit crude, and demographers have better models to predict death counts based on age repartition and so on. That said, the notion of “excess deaths” seems quite popular in various countries, as I explain below, so I guess that my approach is not so daft after all.

To be honest, I was hoping to apply the same approach to the UK, a country where the official estimate is still limited to hospital deaths, and thus clearly quite biased; see e.g. this Guardian paper. Sadly, Public Health England only reports daily hospital death counts … per nation (nation=England, Scotland, Wales, or Northern Ireland). On the other hand, the Office of National Statistics reports every week the number of “excess deaths” (relative to the five year average), and the proportion of these deaths where the word “covid” is mentioned on the death certificate.

Interestingly, the Guardian paper I mentioned above first complains that the UK only reports hospital deaths, and then claims erroneously that the UK is still behind France in terms of covid mortality. It’s not, if you compare in terms of hospital deaths (UK: 20,319 on Saturday, while France: 14,050). The fact that even journalists reporting on this issue may get it wrong seems indicative of how confusing are covid death data.

More generally, my impression is that looking at “excess deaths” makes far more sense for most countries at the moment: it’s easier to measure (albeit with a delay of course), and easier to interpret. This is also more or less the point made by this NYT paper. (Notice how their plot for France only covers January to April; for the complete plot, see my first plot above!).

]]>However, case counts per country are not very reliable, given that countries have very different policies regarding testing and so on; see e.g. Nate Silver’s opinion on case counts here.

You would think that death counts are far more reliable. In France, however,

Santé Publique France got criticized for reporting only COVID deaths that occurred in hospitals. Very recently, they started to include also deaths that occurred in retirement homes. However, they do so only at the national level (current count as of April 12th: 13832; 66% from hospitals). At a finer level (i.e. “régions” or “départements”), the data they provide (here) remains restricted to hospitals.

INSEE (French institute of official Statistics) decided to publish at the same time daily death counts at the département level. Note that INSEE is not a public health institute; the death counts they report are for *all* deaths, whatever the cause. See also this authoritative post (in French) explaining the challenges behind death counts reporting. In case, you wonder, a “département” is a regional unit (we have about 100 of those), see this wikipedia article.

I decided to compare both datasets using a very, very simple methodology. First, I merged both datasets, so as to obtain, for each département, and each day with a certain period:

- the number of covid deaths reported in a hospital, call it (where is the département, is the day);
- the total number of deaths (whatever the cause) on the same day , in département ;
- the total number of death , , on the same day, respectively one year ago (in 2019), and two years ago (in 2018).

SPF data starts on the 18 of March, and INSEE publishes its data every Friday with a one week delay, so my merged dataset currently covers the period: 18 to 30 of March (13 days). And we have about 90 départements in the dataset; the sample size is 1200.

The model I have in mind is simply: .

The first term is a basic predictor of 2020 counts, in case they were no pandemic. It is pretty basic, but counts deaths are quite stable over the years. Granted, there is some variation in winter, due to the flu, but this seems to affect mostly February. For the record, here is a plot of the daily number of deaths in France in 2018, 2019 and 2020, for the period covered by the data:

The coefficient of course measure under-reporting.

Thus, I fitted a linear regression model to predict 2020 deaths as a function of the 2018 and 2019 deaths, and the CH deaths (no intercept). Here are the results:

Look in particular at the estimate of : 1.596 (95% confidence interval: [1.51, 1.68] ). In other words, on average, one should add something between 50% and 70% to the reported number of covid deaths in hospitals to get an estimate of all covid deaths.

I tried other models; for instance by forcing the coefficients of the two years to be exactly equal to one half (how to do this is left as a simple exercise!). I got similar results. I’d like to repeat the analysis on weekly aggregated data. We don’t have yet two full weeks of data, so it’s too early for that. The usual caveats regarding linear regression apply; e.g. there should be some heteroscedasticity, given that the size of département vary significantly.

I will update these results as I get more data. I find it interesting that merging these two datasets already gives results that are reasonable and easy to interpret. In particular, I got similar results using only the first *six* days that were available one week ago. The secret here is we compensate the small number of days by a large number of “départements”.

I am not an expert on public health data, so I do not want to comment on why SPF reports only hospital data; I guess it is much harder to determine that a death is covid-related outside of a hospital, but again I am out of my depth here.

On the other hand, I think it is commendable that INSEE decided to make their own reporting. Of course, both institutions report different things. But the fact that we are able to compare and combine two sources of data potentially gives a clearer picture.

Comments more welcome. I would be curious in particular to know whether other countries provide this kind of double reporting.

]]>Hi!

It seems about the right time to read Kermack & McKendrick, 1927, “A contribution to the Mathematical Theory of Epidemics”. It is an early article on the “Susceptible-Infected-Removed” or “SIR” model, a milestone in the mathematical modelling of infectious disease. In this blog post, I will go through the article, describe the model and the data considered by the authors (plague in Bombay in 1905-1906), which will turn out to be a questionable choice. Some references and R code are given at the end of the article. All of this comes with the disclaimer that I have no expertise in epidemiology.

The article starts by crediting Ronald Ross and Hilda Hudson for their articles in the 1910s on malaria; other related works are cited in Anderson (1991) and Bacaër (2011b) (full references are given below). The topic is the following. Some individuals in a closed population get infected by some new disease. Over time they go through various stages, might infect other people, and eventually recover, or die. People who get infected might, in turn, infect other people, and thus the disease can spread to a large part of the population. But it is also possible that the first infected individuals recover before infecting anyone, and the disease could rapidly disappear. The goal here is to try to understand what drives the spread of a disease: why some become large epidemics and some don’t, how many people get affected, etc. The paper is about a model, so “to understand” here means something like “to propose a convincing, simple and intuitive model that is still rich enough to describe faithfully some aspects of reality”.

The authors first explain a general model before delving into special cases including the celebrated SIR model. An infected person eventually recovers or dies, but never gets infected again. The population contains N individuals in a given area where people are in contact with one another, e.g. we can think of a (medieval walled) city. It is really helpful to think of N as indicating a population density in the context of the article, rather than just a population size. What does this mean? It means that the area under consideration remains fixed as we imagine variations in N, thus the density varies linearly with the number of individuals. Let’s assume that at time zero, a single person is infected: . The other N-1 individuals are “susceptible” of becoming infected, , and initially no one has yet recovered, . Throughout the disease outbreak, the population size remains constant, .

The general model of the article is first formulated in discrete time. It differentiates infected people according to the duration of their infection. At time t, the number of infected individuals can be written . Here counts the people who have been infected for units of time already. Over one time interval, for each , we assume that individuals generate new infections and removals.

- Indeed can be understood as the product of a rate of transmission , which accounts for both infectiousness of the pathogen and contact rate between individuals, with counting all pairs of individuals with one susceptible and one infected for time units, i.e. all possible contacts that can lead to a new infection.
- Meanwhile is the product of a rate of recovery/death and the number of people infected for time units.

The rates of transmission and recovery, and , are generally allowed to vary with the length of infection . Indeed an individual infected for seven days might be more or less infectious than an individual infected for one day, for instance. Counting all the transfers of individuals from and to the different “compartments” (susceptible, infected for one unit of time, infected for two units of time, …, or recovered) the paper gives formulae that describe what happens to these numbers of individuals as time progresses.

From there the authors send the time period to zero. This means that they look at the time period in the eyes and they say: go to zero! Thus they go from discrete to continuous time. The numbers , representing the number of susceptible, infected and removed individuals, are then shown to follow a system of differential equations. These equations do not have an analytical solution, so there are no explicit formulae giving as a function of t. The authors comment on various aspects of the equations, including connections with Volterra integral equations, and Fredholm integral equations. There are also some remarks on limits as time goes to infinity, how these limits depend on the parameters of the model, and how the equations behave for small time t and large population density N.

The celebrated SIR model is obtained as a special case, when the rates of transmission and removal are assumed constant. The system of differential equations becomes:

Often the letters S,I,R are used in place of x,y,z. Sometimes the model is written in terms of the proportions of individuals of each type, whereas here it is describing the numbers of individuals of each type (per unit area). In my experience, it is easy to get confused by this. Since a product x y appears on the right-hand side, replacing all numbers (x,y,z) by proportions (x/N,y/N,z/N) requires an extra “N” to multiply on the right-hand side. This is sometimes referred to as “density-dependent versus frequency-dependent”, see e.g. this blog post.

An interesting aspect of the equations is that the sign of the change in numbers of infected individuals, , depends on being larger or smaller than one. At the start of the epidemic this is very close to . If this is less than one, the number of infected people will decrease; if it is larger than one, it will increase… until and then it will decrease.

Thus what drives the occurrence of an epidemic is 1) the parameters , that appear directly in the equations, but also 2) the population density N, in this model. An epidemic occurs or not according to how large N is relative to . There is a “critical threshold” of population density for any , above which epidemics occur. In later works, would be called the basic reproduction number “R0”, and epidemics occur when it is larger than one. To illustrate this, here are curves of against time, all obtained with , and varying between 200 and 1200.

The graph serves to illustrate several crucial points:

- the population density N plays a key role in the occurrence of an epidemic under the model,
- the model is rich enough to generate widespread epidemics or non-epidemics depending on the parameters,
- contrarily to Kermack & McKendrick, we might not be too concerned about the lack of analytical solutions for the differential equations; we’re used to it and our computers can compute accurate numerical solutions, e.g. using deSolve,
- it’s fun to play with gganimate.

I was initially surprised about the emphasis on population density in the article. Kermack and McKendrick speculate (towards the end of the article) that epidemics might “regulate” population densities and that many cities in the world might have population densities around the critical threshold (for many pathogens?)… otherwise they would be liable to catastrophic epidemics. Some sentences are quite chilling: “The longer the epidemic is withheld the greater will be the catastrophe, provided that the population continues to increase, and the threshold density remains unchanged”. Does it apply to the current pandemic? Well, if you take the first letter of each of the first thirteen paragraphs of the article, you get “coronavirus”. Or maybe you don’t.

Anderson (1991) provides some useful context: “Two explanations for the termination of an epidemic were most in favour amongst medical circles at that time [circa 1927], namely: (1) that the supply of susceptible people had been exhausted and (2) that during the course of the epidemic the virulence of the infectious agent had gradually (or rapidly) decreased.”

In this debate, the model of Kermack & McKendrick describes an alternative hypothesis: the removal of susceptible people lowers the density below some critical threshold, leading to the termination of the epidemic, even when the number of susceptible individuals might remain large in absolute terms, and even if the virulence of the pathogen remains constant throughout the epidemic.

Let’s now look at the data set considered by Kermack & McKendrick, shown above. It shows the weekly deaths from the plague during thirty weeks over 1905-1906 in Bombay. Some context is provided by Bacaër (2011), who mentions some interesting concerns about the use of a simple SIR model for this particular outbreak. The plague appeared in Bombay in 1896 and reappeared with “strong seasonal character” in the following years. This seasonal aspect is not accounted for by a simple SIR model, but could play a big role in the decrease of the infections after week ~20. With the simple SIR model, it is possible to obtain a good fit for the curve to the data points, but the associated parameter values are unrealistic. For example you will find a nice fit with , but the population of Bombay was around one million individuals at that time. Bacaër (2011) proposes a fix: a modified SIR model with seasonal components that provides more satisfactory results. The SIR model seems to provide a useful template for more sophisticated models that account for various other factors, specific to each outbreak, but might not be an adequate model out of the box.

The basic SIR model can be an OK model for certain data sets. An example is the classical “boarding school” data set. This data set was reported in the British Medical Journal in 1978, and concerns an influenza outbreak in a boarding school in the north of England. The data include the number of children confined in bed day after day. There were N=763 boys in that school. The curve of infected individuals can be made to match the data quite closely with .

Some final thoughts:

- As illustrated by the two data examples above, measurements about disease outbreaks can be in the form of case counts per time unit, or numbers of individuals “removed”; we might know the size of the susceptible population exactly or not; we might know the exact times of infection, or not, etc. There seem to be as many scenarios as disease outbreaks.
- In the early works, models are either in discrete or continuous time and they are often deterministic. We can interpret some quantities probabilistically if we want (e.g. the chance that some individual gets infected in some small time interval), but there are no random variables in the description of the model. We can fit curves to data by minimizing least squares, but there are no stochastic processes, likelihood functions or probabilistic models for measurement errors.
- The choice of example made by Kermack and McKendrick is questionable: their data set would have been better modelled with the consideration of seasonal effects. Clearly that did not prevent the article from becoming extremely influential, and the SIR model from being widely used to this day.
- Brauer (2005) mentions that “One of the products of the SARS epidemic of 2002-2003 was a variety of epidemic models including general contact rates, quarantine, and isolation.” Hopefully, these developments are proving useful now? What modelling developments will follow from the current epidemic?
- According to Breda et al. (2012) the original article of Kermack & McKendrick is unfortunately hardly ever read. I have certainly found it useful to read it; certain parts of it, about differential equations, are a bit technical, but it is overall a very well-written article. It’s always interesting to see how pioneers explain their works themselves, and whether the writing style has aged. I was also glad to have, as reading companions, Anderson (1991) and Bacaër (2011).

Here’s a link to an R script producing the above figures and performing quick least squares fit: https://github.com/pierrejacob/statisfaction-code/blob/master/2020-04-sir.R

**To read more on the topic:**

- Ronald Ross and Hilda P. Hudson (1917) An Application of the Theory of Probabilities to the Study of a priori Pathometry. Part I, II and III.
- Roy Anderson (1991) Discussion: The Kermack-McKendrick epidemic threshold theorem. [link]
- Fred Brauer (2005) The Kermack–McKendrick epidemic model revisited. [link]
- Nicolas Bacaër (2011a) The model of Kermack and McKendrick for the plague epidemic in Bombay and the type reproduction number with seasonality. [link]
- Nicolas Bacaër (2011b) A Short History of Mathematical Population Dynamics. [Chapter 16 is on McKendrick and Kermack, link]
- D. Breda , O. Diekmann , W. F. de Graaf , A. Pugliese & R. Vermiglio (2012) On the formulation of epidemic models (an appraisal of Kermack and McKendrick). [link]
- H. Hesterbeek et al (2013) Modeling infectious disease dynamics in the complex landscape of global health [a fairly recent review on the topic by some of the world experts https://science.sciencemag.org/content/347/6227/aaa4339]
- Textbooks on the topic include
- Bailey (1975) The mathematical theory of infectious diseases and its applications.
- Anderson and May (1991) Infectious diseases of humans: dynamics and control.
- Britton and Pardoux (2019) Stochastic Epidemic Models with Inference.

**On the Internet:**

- There are interactive applications to play with the SIR model, e.g. https://www.public.asu.edu/~hnesse/classes/sir.html
- There’s a package with a Shiny app here: https://cran.r-project.org/web/packages/shinySIR/vignettes/Vignette.html
- Some python code here: https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/
- Arthur Charpentier’s blog series with R code: Modeling pandemics (1) and (2) and (3)
- Tom Britton (Stockholm University, author of one of the books listed above) on the modeling of epidemics including coronavirus https://www.youtube.com/watch?v=gSqIwXl6IjQ
- Nicholas P. Jewell (London School of Medicine and Tropical Medicine and UC Berkeley) in a video called “COVID-19: The Exponential Power of Now” https://www.youtube.com/watch?v=MZ957qhzcjI
- Robin Thompson (Oxford) in a video called “How do mathematicians model infectious disease outbreaks?” https://livestream.com/oxuni/thompson/videos/204239496
- Aimee Mann – Patient Zero https://www.youtube.com/watch?v=en8HZ6X20Og

Hi all,

It seems like the current environment is perfect for the growth of remote seminars. Most of them seem to be free to attend, some of them require registration. I’ve collected some links to seminars with topics related to statistics on this page: https://statisfaction.wordpress.com/remote-seminars/. I will try to keep the page up to dates with more links as new seminars are being created, with an emphasis on topics at least loosely related to the topics usually covered in our blog posts. Don’t hesitate to send links via comments or emails.

]]>Hi all,

As many scientists who are not usually working in epidemiology are trying to contribute to the fight against the current pandemic while getting magnets stuck in their nose, the Royal Society has a call here: https://royalsociety.org/news/2020/03/urgent-call-epidemic-modelling/ for “modellers to support epidemic modelling” with a deadline on April 2nd (5pm British Summer Time).

More details are given here: https://epcced.github.io/ramp/ and they specifically welcome non-UK based scientists.

]]>Hi everyone,

and Happy New Year! This post is about some statistical inferences that one can do using as “data” the output of MCMC algorithms. Consider the trace plot above. It has been generated by Metropolis–Hastings using a Normal random walk proposal, with a standard deviation “sigma”, on a certain target. Suppose that you are given a function that evaluates the pdf of that target. Can you retrieve the value of sigma used to generate that chain?

As a statistical problem this is a well-defined question. We view the chain as a time series, and, for once, the model is well-specified! But the difficulty comes from the likelihood function being intractable; see that classic paper by Tierney, equation (1), for an expression of the transition kernel of MH. Specifically, the issue occurs whenever two consecutive states in the chain are identical, which indicates that some proposal was rejected during the course of the algorithm. This results in a term in the likelihood equal to the “rejection probability” from that state, namely

where is the acceptance probability of state from state . That term is intractable because of the integral. But we can estimate r(x)!

A naive estimator is obtained by drawing from the Normal distribution in the integral, and evaluating . The issue with that estimator is that it can be exactly equal to zero, with a non-negligible probability. If many such estimators are multiplied together to estimate the full likelihood, then there is a large chance that at least one of these estimators will be zero, resulting in an overall likelihood estimator equal to zero. This is a bit problematic since we want to compare the likelihood associated with different values of sigma!

There’s a nice trick in “The Alive Particle Filter” by Jasra, Lee, Yau, Zhang which exploits a property of Negative Binomial variables established by Neuts and Zacks in 1967. The estimator is provided by the algorithm below.

The output of that algorithm has expectation r(x) and is guaranteed to never be equal to zero. Equipped with this, we can obtain unbiased, non-negative estimators of the full likelihood of sigma. In combination with some prior information, we can run a pseudo-marginal Metropolis-Hastings algorithm on the sigma space, the output of which is in the figure below.

At this point, a new “meta” problem would be the inference of the standard deviation used in the pseudo-marginal algorithm defined on the sigma space!…

The problem is related to some works on the modeling of animal movements, for instance, “Inference in MCMC step selection models” by Michelot, Blackwell, Chamaillé-Jammes and Matthiopoulos. There, MCMC-type algorithms are used as statistical models for animal movements. Their appeal is to provide simple mechanisms to describe local moves, while being also guaranteed to admit a specified global stationary distribution that might describe where animals roam “on average”.

The code producing the above figures is here: https://github.com/pierrejacob/statisfaction-code/blob/master/2020-01-inferenceMCMC.R

]]>