Hello hello,

I have just arXived a review article, written for ESAIM: Proceedings and Surveys, called Sequential Bayesian inference for implicit hidden Markov models and current limitations. The topic is sequential Bayesian estimation: you want to perform inference (say, parameter inference, or prediction of future observations), taking into account parameter and model uncertainties, using hidden Markov models. I hope that the article can be useful for some people: I have tried to stay at a general level, but there are more than 90 references if you’re interested in learning more (sorry in advance for not having cited your article on the topic!). Below I’ll comment on a few points.

Hidden Markov models are very flexible tools to model time series: the observations are assumed to be noisy measurements of a Markov process. The Markov process can represent the complex dynamics of the underlying phenomenon (in the example of the article, it is a prey-predator model for the population growth of planktons). The noise in the measurements accounts for the error of the measuring devices, the fact that the underlying process is partially observed, etc.

The term “implicit”, introduced in Time series analysis via mechanistic models, refers to models where the latent process is a “black box”: we can simulate it, but that’s it. On the other hand, we assume that we can evaluate the probability density function of the measurement distribution.

Sequential inference refers to the ability to update the estimation as new observations arrive. For instance, the observations might be acquired on a daily basis, and thus we might want to update our predictions every day. If the predictions were obtained using “batch techniques” (e.g. MCMC), we would need to re-run the algorithms “from scratch” every day. With sequential methods (such as SMC), we can assimilate the latest observation, for a hopefully small cost every day. Unfortunately, even the recent techniques reviewed in the article fail to be “truly online”, in the sense that the statistical error will eventually blow up when parameter uncertainty is taken into account. If the parameters are kept as fixed values, then the problem becomes easier and can be dealt with in a truly online way. This is one of the current challenges that I’m discussing in the article.

I also want to comment on model uncertainty: in many scenarios, we have a few possible models to deal with a given time series. People often quote George E. P. Box: “all models are wrong, but some are useful”. Wrong here means that the observations actually are not realizations of one of the models, and this is undoubtedly correct. A common belief is that statistical inference naively assumes that one of the models is true; this is far from correct. A lot of articles have investigated the “misspecified setting” in details, and many statistical procedures (including MLE, Bayesian inference and model comparison techniques) provide justifiable answers without assuming that the model is true. In the article, I discuss the Bayes factor between two models. It has a perfectly reasonable justification as a prior predictive criterion, in other words, it compares models on the grounds of how likely the observations are under the prior distributions. Thus one does not need to assume anything about the data-generating process in order to use Bayes factors.

Another interesting aspect of the Bayes factor, by the way, is Occam’s razor principle. The simpler models are favoured over more complex models until enough data have been gathered to say otherwise. The same principle motivates AIC and BIC criteria, which can be seen as asymptotic approximations of Bayes factors for particular choices of priors. In the figure, we can see that a “wrong model” is better than the true data generating model until about 50 to 100 observations are assimilated (in the prior predictive sense of the Bayes factor). The figure shows the estimated Bayes factors for five independent runs of one the numerical methods reviewed in the article (SMC^2): we see that the runs diverge because the errors accumulate over time, hence the method is not “online”.

]]>

- W. Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.
*Biometrika*, 57(1):97–109, 1970.

In this paper, we shall consider Markov chain methods of sampling that are generalizations of a method proposed by Metropolis et al. (1953), which has been used extensively for numerical problems in statistical mechanics.

- Dennis V. Lindley and Adrian F.M. Smith. Bayes estimates for the linear model.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology),*with discussion, 1–41, 1972.

From Prof. B. de Finetti discussion (note the *valliant* collaborator Smith!):

I think that the main point to stress about this interesting and important paper is its significance for the philosophical questions underlying the acceptance of the Bayesian standpoint as the true foundation for inductive reasoning, and in particular for statistical inference. So far as I can remember, the present paper is the first to emphasize the role of the Bayesian standpoint as a logical framework for the analysis of intricate statistical situation. […] I would like to express my warmest congratulations to my friend Lindley and his valiant collaborator Smith.

- Persi Diaconis and Donald Ylvisaker. Quantifying prior opinion.
*Bayesian Statistics*, 2:133–156, 1985.

About the binomial model of spinning a coin:

Let us distinguish three categories of Bayesians (certainly a crude distinction in light of Good’s (1971) 46,656 lower bound on the possible types of Bayesians).

- Classical Bayesians. (Like Bayes, Laplace and Gauss) took . A so called flat prior.
- Modern Parametric Bayesians. (Raifa, Lindley, Mosteller) took as a beta density […]
- Subjective Bayesians. (Ramsey, de Finetti, Savage) take the prior as a quantification of what is known about the coin and spinning process.

- Eric L. Lehmann. Model specification: the views of Fisher and Neyman, and later developments.
*Statistical Science*, 5(2):160–168, 1990.

Where do probability models come from? To judge by the resounding silence over this question on the part of most statisticians, it seems highly embarrassing. In general, the theoretician is happy to accept taht his abstract probability triple was found under a gooseberry bush, while the applied statistician’s model “just growed”. (quoting A. P. Dawid, 1982)

- William H. Jefferys and James O. Berger. Ockham’s razor and Bayesian analysis.
*American Scientist*, 64–72, 1992.

William of Ockham, the 14th-century English philosopher, stated the principle thus: “Pluralitas non est ponenda sine necessitate”, which can be translated as: “Plurality must not be posited without necessity.” […] Ironically, whereas Bayesian methods have been criticized for introducing subjectivity into statistical analysis, the Bayesian approach can turn Ockham’s razor into a

lesssubjective and even “automatic” rule of inference.

- Eugene Seneta. Lewis Carroll’s “pillow problems”: on the 1993 centenary.
*Statistical science*, 180–186, 1993.

All 72 [

Pillow Problems]are claimed to have been formulated and worked out at night while in bed, mentally, and the answer written down afterward. [C. L. Dodgson, a.k.a. Lewis Carroll] work reflects the nature, standing and understanding of probability within the wider English mathematical community of the time.

- James O. Berger. Bayesian analysis: A look at today and thoughts of tomorrow.
*Journal of the American Statistical Association*, 95(452):1269– 1276, 2000.

Life was simple when I became a Bayesian in the 1970s; it was possible to track virtually all Bayesian activity. Preparing this paper on Bayesian statistics was humbling, as I realized that I have lately been aware of only about 10% of the ongoing activity in Bayesian analysis.

Below, the seven presentations.

]]>

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite *fascinating* for us, *academics fascinated by academic rankings, for better or for worse *(ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include *Bayesian Analysis* (published from 2006), *The Annals of Applied Statistics* (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between *Annals of Statistics* and *Biometrika*. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction,
*CSDA*is way ahead of all the others. - the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.

**Some remarks**

The two software journals included in the dataset are quite outliers:

- the
*Journal of Statistical Software (JSS)*is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015). - the largest is the self citations of the
*STATA Journal (StataJ).*

Centrality:

*CSDA*is the most central journal in the sense of the highest (unweighted) degree.

**Some further thoughts**

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?

]]>

This week I’ll start my Bayesian Statistics master’s course at the Collegio Carlo Alberto. I realized that some of last year students got PhD positions in prestigious US universities. So I thought that letting this year’s students have a first grasp of some great Bayesian papers wouldn’t do harm. The idea is that in addition to the course, the students will pick a paper from a list and present it (or rather part of it) to the others and to me. Which will let them earn some extra points for the final exam mark. It’s in the spirit of Xian’s Reading Classics Seminar (his list here).

I’ve made up the list below, inspired by two textbooks references lists and biased by personal tastes: Xian’s Bayesian Choice and Peter Hoff’s First Course in Bayesian Statistical Methods. See the pdf list and zipped folder for papers. Comments on the list are much welcome!

Julyan

PS: reference n°1 isn’t a joke!

]]>

*Hello all,*

*This is an article intended for the ISBA bulletin, jointly written by us all at Statisfaction, Rasmus Bååth from Publishable Stuff, Boris Hejblum from Research side effects, Thiago G. Martins from tgmstat@wordpress, Ewan Cameron from Another Astrostatistics Blog and Gregory Gandenberger from gandenberger.org. *

Inspired by established blogs, such as the popular Statistical Modeling, Causal Inference, and Social Science or Xi’an’s Og, each of us began blogging as a way to diarize our learning adventures, to share bits of R code or LaTeX tips, and to advertise our own papers and projects. Along the way we’ve come to a new appreciation of the world of academic blogging: a never-ending international seminar, attended by renowned scientists and anonymous users alike. Here we share our experiences by weighing the pros and cons of blogging from the point of view of young researchers.

At least at face value blogging has some notable advantages over traditional academic communication: publication is instantaneous and thus proves efficient in sparking discussions and debates; it allows all sorts of technological sorcery (hyperlinks, animations, applications), while many journals are still adapting to grayscale plots; and it allows for humorous and colourful writing styles, freeing the writer from the constraints of the impersonal academic prose. Last but not least, it is acceptable to blog about almost any topic, from office politics to funding bodies, from complaints about the absurdity of p-values to debates on the net profits of publishing companies, not to mention quarrels about the term “data science”.

For young researchers, some aspects are particularly appealing. By putting academics directly in touch with one another through comments and replies, young researchers are given the opportunity to “talk” directly on technical subjects to some of the most renowned names in their fields—and indeed a surprising number of senior researchers are avid blog readers! This often proves much more efficient than trying to awkwardly stalk the same professors at conferences. Through such interactions, young academics can show off their many interests and skills, which can do much to fill out the picture painted by their academic CV.

Beyond those low and careerist considerations, we see blogging as a good tool to learn and to share scientific ideas. According to popular belief, only a third of all started research projects end up in a publication; but all of them can at least end up on a blog. So if you indulge in a bit of off-topic study or burn a few hours playing around with a new methodology it need not fuel your performance anxiety: a blog post explaining it will still feel like a delivered product. And you will very likely get some interesting feedback—though rarely to the depth given in journal reviews.

Finally, using blogs to advertise articles and packages seems particularly useful at the early stage of a career, where you might not be invited to that many conferences, or might only be given some dark corner of a giant poster session to talk about your work.

Some cautionary notes now, blogging can be risky! As the adage goes, “better to keep your mouth shut and appear a fool than to open it and remove all doubt”. Beyond the quality of the content being shared, blogs are also sometimes disregarded by academics as a frivolous medium; there is a risk that your colleagues will see your blogging hobby as a pure waste of time.

A second risk is to disclose too much information about promising research leads. There should be some balance between ideas shared and ideas kept secret, so that blogging does not jeopardize publication. Other platforms that formally establish precedence (such as arXiv) might be better suited for the initial presentation of new and exciting work. For this reason it seems wisest to blog a posteriori, though the interest of these blogs will be less than their potential to function as real-time research diaries.

A third risk is genuine time-wasting. For those who have never tried, it can be surprising to discover how many hours are needed to write each post. It can be frustrating in the beginning when reader statistics indicate an audience of just one or two spam-bots and some curious relatives. On the other hand there are still a limited number of academic blogs on statistics so far, so the market is far from saturation: any new blog can quickly garner a decent amount of attention. Of course it can be hard to keep a regular posting schedule, which is necessary to maintain a stable reading base.

To conclude, blogging can be a clever way to bypass the hierarchical structure of academia. It gives everyone a direct and fast access to everyone else. In some respects it helps to alleviate key problems affecting young researchers, such as the lengthy reviewing process of top journals and the lack of communication space.

]]>

but in the RSS version, it reads

.

Well, that’s a bummer. For now, I recommend anyone to read instead the arxiv version (updated on Monday).

]]>

Almost 10 months since my latest post? I guess bloggin’ ain’t my thing… In my defense, Mathieu Gerber and I were quite busy revising our SQMC paper. I am happy to announce that it has just been accepted as a read paper in JRSSB. If all goes as planned, we should present the paper at the RSS ordinary meeting on Dec 10. Everybody is welcome to attend, and submit an oral or written discussion (or both). More details soon, when the event is officially announced on the RSS web-site.

What is SQMC? It is a QMC (Quasi-Monte Carlo) version of particle filtering. For the same CPU cost, it typically generates much more accurate estimators. Interested? consider reading the paper here (more recent version coming soon), checking this video where I present SQMC, or, even better, attending our talk in London!

]]>

where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).

We consider the survival function which is recovered from the hazard rate by the transform

and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.

The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.

I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:

- evaluates the approximate density
- samples from it

A package example for a mixture of beta and 2 to 7 moments gives that result:

]]>

Hey hey,

With Alexandre Thiéry we’ve been working on non-negative unbiased estimators for a while now. Since I’ve been talking about it at conferences and since we’ve just arXived the second version of the article, it’s time for a blog post. This post is kind of a follow-up of a previous post from July, where I was commenting on Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade.

The setting is the combination of two components.

**1°)** There are techniques to “debias” consistent estimators. Consider a sequence converging to in the sense . Introduce an integer-valued random variable and the survival probabilities . Then the random variable is an unbiased estimator of , i.e. its expectation is . Under additional assumptions it has a finite variance and a finite expected computational time… wow. We’ve just removed the bias off a sequence of biased estimators. We’ve reached the limit, we’ve reached infinity, we’re beyond heaven. That random truncation trick has been invented and reinvented (from Von Neumann and Ulam!) over the years but the most thorough and general study is found in Rhee & Glynn (2013). See for instance Rychlik (1990) for an early example of the same trick.

**2°)** Now, since there’s one way to debias estimators, there might be others. In particular there might be some way to remove the bias *and* to guarantee some positivity constraint. That is, assume now that is in . We might want to have an unbiased estimator of that takes almost surely non-negative values. A motivating example is precisely the Russian Roulette paper mentioned above, and in general the pseudo-marginal methods. With those methods we can perform “exact inference” on a posterior distribution, as long as we have access to non-negative unbiased estimators of its probability density function point-wise evaluations.

Our results identify cases where non-negative unbiased estimators can be obtained, in the following sense. For instance, assume that we have access to a real-valued unbiased estimator of , from which we can draw independent copies. We show that there is no algorithm taking those estimators as input and producing almost surely non-negative unbiased estimators of that . So that it’s impossible to “positivate” an unbiased estimator just like that. To prove such a result we rely on a precise definition of algorithm, which we believe is not restrictive.

More generally we show that if we have unbiased estimators of and want to obtain non-negative unbiased estimators of for some function , well that’s impossible in general. We are sorry.

However if you have an unbiased estimator of taking values in an interval , then it can be possible to have a non-negative unbiased estimator of , depending on the function considered, and in this case the problem is very much related to the Bernoulli Factory problem of Von Neumann (again! Damn you v.N.). In other words, if you have more knowledge on your unbiased estimator used as input (in this case lower and upper bounds), the problem might have a solution. In practice this type of knowledge would be model specific.

When there isn’t any non-negative unbiased estimators available, pseudo-marginal methods cannot be directly applied. Since those methods have proven very successful in some important areas such as hidden Markov models, we believe it’s interesting to characterize the other settings in which they might be applied. In the paper we discuss exact simulation of diffusions, inference for big data, doubly intractable distribution and inference based on reference priors. In those fields (at least the first three) people have tried to come up with general non-negative unbiased estimators, so we hope to save them some time!

]]>

Hey there,

It’s been a while I haven’t written about parallelization and GPUs. With colleagues Lawrence Murray and Anthony Lee we have just arXived a new version of Parallel resampling in the particle filter. The setting is that, on modern computing architectures such as GPUs, thousands of operations can be performed in parallel (i.e. simultaneously) and therefore the rest of the calculations that cannot be parallelized quickly becomes the bottleneck. In the case of the particle filter (or any sequential Monte Carlo method such as SMC samplers), that bottleneck is the resampling step. The article investigates this issue and numerically compares different resampling schemes.

In the resampling step, given a vector of “weights” (non-negative real numbers), a vector of integers called “offspring counts”, , is drawn such that for all , . That is, in average a particle has a number of offprings proportional to its normalized weight. Most implementations of the resampling step require a collective operation, such as computing the sum of the weights to normalize them. On top of being a collective operation, computing the sum of the weights is not a numerically stable operation, if the weight vector is very large. Numerical results in the article show that in single precision floating point format (as preferred for fast execution on the GPU) and for vectors of size half a million or more, a typical implementation of the resampling step (multinomial, residual, systematic…) exhibits a non-negligible bias due to numerical instability.

Two resampling strategies come to the rescue: Metropolis and Rejection resampling. These methods, described in details in the article, rely only on pair-wise weight comparisons and thus 1) are numerically stable and 2) bypass collective operations. Interestingly enough, the Metropolis resampler is theoretically biased but, when numerical stability is taken into account in single precision, proves “less biased” than the traditional resampling strategies (which are theoretically unbiased!), again when using half a million particles or more. It’s not too crazy to imagine that particle filters will soon be commonly run with millions of particles, hence the interest of studying the behaviour of resampling schemes in that regime.

Other practical aspects of resampling implementations are discussed in the article, such as whether the resampling step should be done on the CPU or on the GPU, taking into account the cost of copying the vectors into memory. Decision matrices are given (figure above), giving some indication on which is the best strategy in terms of performing resampling on CPU or GPU, and which resampling scheme to use.

All the numerical results of the article can be reproduced using the Resampling package for Libbi.

]]>