“Grazie mille! Un grande piacere e un grande onore per me!”

I attended both. The reason why I attended the first being that I am acting as a research advisor for Math en Jeans groups. Villani spoke about his book, *Birth of a Theorem*, or *Théorème Vivant*. He also shared a list of se7en thoughts/tips about doing research, with illustrations. I find them quite inspiring, here they are.

**Documentation/literature**

Illustrating this by showing Faà di Bruno’s formula Wikipedia page. I like this quote, since the formula enters moment computation for objects I’m using everyday. And also because Faà di Bruno lived in Italian Piedmont, precisely in Turin.**Motivation**

*“The most important and the most mysterious.”***Favorable environment**

Showing pictures of several places where he worked, including Institut Henri Poincaré. Not sure that this one is the most favorable environment for scientific productivity (as a Director I mean).**Exchanges**

Meaning between scientists, not trade. Explaining briefly about polymath projects. And displaying a snapshot of Gowers’s Weblog as an illustration of how diverse exchanges he means. I also believe that blogs are a great information medium :)**Constraints**

With snapshots of Musica Ricercata sheet music. And a paragraph of*La disparition*, a novel without the letter*e*by Georges Perec. Writing this makes me realize how foolish such an enterprise would look like in mathematics.**Work & Intuition**

Interesting to see these two at the same level.**Perseverance & Luck**

Same comment as for point 6.

Julyan

]]>

El Capitan is a very nice mountain. It’s also the latest OS X version which messes things up with . Be aware of this before you update. I wasn’t!

I quote from a fix explained here:

Under OS X 10.11, El Capitan, writing to “/usr” is no longer allowed, even with Administrator privileges. The usual symbolic link to the active Distribution, “/usr/texbin”, is therefore removed (if it was there from a previous OS version) and cannot be installed. Many GUI applications have the path to those binaries set to “/usr/texbin” by default and will no longer find the binaries there.

I had to reinstall MacTex, then to update my GUI application (texmaker) for and finally to replace every “/usr/texbin” by “/Library/TeX/texbin”, as shown below.

Cheers

Julyan

]]>

Hello there !

While I was in Amsterdam, I took the opportunity to go and work with the Leiden crowd, an more particularly with Stéphanie van der Pas and Johannes Schmidt-Heiber. Since Stéphanie had already obtained neat results for the Horseshoe prior and Johannes had obtained some super cool results for the spike and slab prior, they were the fist choice to team up with to work on sparse models. And guess what ? we have just ArXived a paper in which we study the sparse Gaussian sequence

where only a small number of are non zero.

There is a rapidly growing literature on shrinking priors for such models, just look at Polson and Scott (2012), Caron and Doucet (2008), Carvalho, Polson, and Scott (2010) among many, many others, or simply have a look at the program of the last BNP conference. There is also an on growing literature on theoretical properties of some of these priors. The Horseshoe prior was studied in Pas, Kleijn, and Vaart (2014), an extention of the Horseshoe was then study in Ghosh and Chakrabarti (2015), and recently, the spike and slab Lasso was studied in Rocková (2015) (see also Xian ’Og)

All these results are super nice, but still we want to know **why do some shinking priors shrink so well and others do not?!** As we are *all* mathematicians here, I will reformulate this last question: **What would be the conditions on the prior under which the posterior contracts at the minimax rate ^{1}** ?

We considered a Gaussian scale mixture prior on the sequence

since this family of priors encomparse all the ones studied in the papers mentioned above (and more), so it seemed to be general enough.

Our main contribution is to give conditions on such that the posterior converge at the good rate. We showed that in order to recover the parameter that are non-zeros, the prior should have tails that decays at most exponentially fast, which is similar to the condition impose for the Spike and Slab prior. Another expected condition is that the prior should put enough mass around 0, since our assumption is that the vector of parameter is nearly black i.e. most of its components are 0.

More surprisingly, in order to recover 0 parameters correctly, one also need some conditions on the tail of the prior. More specifically, the prior’s tails cannot be too big, and if they are, we can then construct a prior that puts enough mass near 0 but which does not concentrate at the minimax rate.

We showed that these conditions are satisfied for many priors including the Horseshoe, the Horseshoe+, the Normal-Gamma and the Spike and Slab Lasso.

The Gaussian scale mixture are also quite simple to use in practice. As explained in Caron and Doucet (2008) a *simple* Gibbs sampler can be implemented to sample from the posterior. We conducted simulation study to evaluate the *sharpness* of our conditions. We computed the loss for the Laplace prior, the global-local scale mixture of gaussian (called hereafter *bad* prior for simplicity), the Horseshoe and the Normal-Gamma prior. The first two do not satisfy our condition, and the last two do. The results are reported in the following picture.

As we can see, priors that do and do not satisfy our condition show different behaviour (it seems that the priors that do not fit our conditions have a risk larger than the minimax rate of a factor of ). This seems to indicate that our conditions are sharp.

At the end of the day, our results expands the class of shrinkage priors with theoretical guarantees for the posterior contraction rate. Not only can it be used to obtain the optimal posterior contraction rate for the horseshoe+, the inverse-Gaussian and normal-gamma priors, but the conditions provide some characterization of properties of sparsity priors that lead to desirable behaviour. Essentially, the tails of the prior on the local variance should be at least as heavy as Laplace, but not too heavy, and there needs to be a sizable amount of mass around zero compared to the amount of mass in the tails, in particular when the underlying mean vector grows to be more sparse.

Caron, François, and Arnaud Doucet. 2008. “Sparse Bayesian Nonparametric Regression.” In *Proceedings of the 25th International Conference on Machine Learning*, 88–95. ICML ’08. New York, NY, USA: ACM.

Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” *Biometrika* 97 (2): 465–80.

Ghosh, Prasenjit, and Arijit Chakrabarti. 2015. “Posterior Concentration Properties of a General Class of Shrinkage Estimators Around Nearly Black Vectors.”

Pas, S.L. van der, B.J.K. Kleijn, and A.W. van der Vaart. 2014. “The Horseshoe Estimator: Posterior Concentration Around Nearly Black Vectors.” *Electron. J. Stat.* 8: 2585–2618.

Polson, Nicholas G., and James G. Scott. 2012. “Good, Great or Lucky? Screening for Firms with Sustained Superior Performance Using Heavy-Tailed Priors.” *Ann. Appl. Stat.* 6 (1): 161–85.

Rocková, Veronika. 2015. “Bayesian Estimation of Sparse Signals with a Continuous Spike-and-Slab Prior.”

- For those wondering why the heck with minimax rate here, just remember that a posterior that contracts at the minimax rate induces an estimator which converge at the same rate. It also gives us that confidence region will not be too large.↩

]]>

while everyone was away in July, James Ridgway and I posted our “leave (the) pima paper alone” paper on arxiv, in which we discuss to which extent probit/logit regression and not too big datasets (such as the now famous Pima Indians dataset) constitute a relevant benchmark for Bayesian computation.

The actual title of the paper is “Leave Pima Indians alone…”, but xian changed it to “Leave *the* Pima Indians alone…” when discussing it on his blog. Any opinion on whether it does sound better with “the”?

On a different note, one of our findings is that Expectation-Propagation works wonderfully for such models; yes it is an approximate method, but it is very fast, and the approximation error is consistently negligible on all the datasets we looked at.

James has just posted on CRAN the EPGLM package, which computes an EP approximation of the posterior of a logit or probit model. The documentation is a bit terse at the moment, but it is very straightforward to use.

Comments on the package, the paper, its grammar or Pima Indians are most welcome!

]]>

This very fine title quotes a pretty hilarious banquet speech by David Dunson at the last BNP conference held in Raleigh last June. The graph is by François Caron who used it in his talk there. See below for his explanation.

After the summer break, back to work. The academic year to come looks promising from a BNP point of view. Not least that three special issues have been announced, in Statistics & Computing (guest editors: Tamara Broderick (MIT), Katherine Heller (Duke), Peter Mueller (UT Austin)), the Electronic Journal of Statistics (guest editor: Subhashis Ghoshal (NCSU)), and in the International Journal of Approximate Reasoning (proposal deadline December 1st, guest editors: Alessio Benavoli (Lugano), Antonio Lijoi (Pavia) and Antonietta Mira (Lugano)).

BNP is also going to infiltrate MCMSki V, Lenzerheide, Switzerland, January 4-7 2016, with three sessions with a BNP flavor, in addition to plenary speakers David Dunson and Michael Jordan. The International Society for Bayesian Analysis World Meeting, 13 -17 June, 2016, should also host plenty of BNP sessions. And a De Finetti Lecture by Persi Diaconis (Stanford University).

Below, François’ description of his graph

- nodes are speakers at BNP9 and / or BNP10
- edges link co-authors
- node and text sizes are proportional to node degree (nb of co-authors)
- visualization with gephi (spatialization Yifan Hu)

Some comments (by François)

- it’s most probable that he missed connections
- there’s obviously a selection bias by only taking on speakers of the last two BNP meetings
- the graph is obtained by a simple “one-mode projection” of the bipartite graph authors-articles; this projection isn’t optimal since two authors of a six authors paper may not have really collaborated; Newman proposed another type of projection which weights by the number of co-authors (eg a weight of 1/3 each for a three authors publication)

Julyan

]]>

Adam Johansen, Thomas Schön and me co-organised SMC2015, a workshop on Sequential Monte Carlo method that took place at ENSAE last week. In case you missed it, I’ve just uploaded the slides of most talks here. Enjoy!

]]>

With colleagues Stefano Favaro and Bernardo Nipoti from Turin and Yee Whye Teh from Oxford, we have just arXived an article on discovery probabilities. If you are looking for some info on a space shuttle, a cycling team or a TV channel, it’s the wrong place. Instead, discovery probabilities are central to ecology, biology and genomics where data can be seen as a population of individuals belonging to an (ideally) infinite number of species. Given a sample of size , the -discovery probability is the probability that the next individual observed matches a species with frequency in the -sample. For instance, the probability of observing a new species is key for devising sampling experiments.

By the way, why Alan Turing? Because with his fellow researcher at Bletchley Park Irving John Good, starred in The Imitation Game too, Turing is also known for the so-called *Good-Turing estimator* of the discovery probability

which involves , the number of species with frequency in the sample (ie frequencies frequency, if you follow me). As it happens, this estimator defined in Good 1953 Biometrika paper became wildly popular among ecology-biology-genomics communities since then, at least in the small circles where wild popularity and probability aren’t mutually exclusive.

Simple explicit estimators of discovery probabilities in the Bayesian nonparametric (BNP) framework of Gibbs-type priors were given by Lijoi, Mena and Prünster in a 2007 Biometrika paper. The main difference between the two estimators of is that Good-Turing involves and only, while the BNP involves , (instead of ), and , the total number of observed species. It has been shown in the literature that the BNP estimators are more reliable than Good-Turing estimators.

How do we contribute? (i) we describe the posterior distribution of the discovery probabilities in the BNP model, which is pretty useful for deriving exact credible intervals of the estimates, and (ii) we investigate large asymptotic behavior of the estimators.

We are not aware of any non-asymptotic method for deriving credible interval for the BNP estimators. We fill this gap by describing the posterior distribution of . More specifically, we derive all posterior moments of . Since this random variable has a compact support, , it is characterized by its moments. So one can use a moment-based technique for sampling draws, see e.g. our momentify R package written for another article. We also show that the posterior distribution is explicit in two special cases of Gibbs-type priors known as the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior. The posterior distribution is in fact shamelessly simple (once you know it) since it essentially amounts to [[a random] fraction of] a Beta distribution [with random coefficients].

As for large asymptotic behavior of the estimators, we prove the following asymptotic equivalences, denoted by ,

and for ,

where is a parameter of the Gibbs-type prior. These can serve as approximations. In the cases of the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior, we provide also a second order term to the asymptotic expansion of the estimators

and for ,

where the second order is either a constant, or a quantity which converges almost surely to a random variable. In both cases, we show that it involves the second (and last) parameter of the priors, whereas the asymptotic equivalence given before involves only . Whether similar asymptotic expansions also hold in the whole Gibbs-type class remains an open problem!

If you have read till this point, then you may also be interested in listening to Stefano Favaro about it at the 10th Conference on Bayesian Nonparametrics next week in Raleigh, NC :-)

Cheers,

Julyan

]]>

Hi there !

Unfortunately this post is indeed about statistics…

If you are randomly walking around the statistics blogs, you probably have certainly heard of this new language called Julia. It is said by the developers to be as easy to write as R and as fast as C (!) which is quite a catchy way of selling their work. After talking with a Julia enthusiastic user in Amsterdam, I decided to give it a try. And here I am sharing my first impressions.

Fist thing first, the installation is as easy as any other language, plus there is a neat Package management that allows you to get started quite easily. In this respect it is very similar to R.

On the minus side I became a big fan of RStudio Julian (… oupsy Julyan) told you about a long time ago. These kind of programs really make your life easier. I thus tried Juno which turned out to be cumbersome and terribly slow. I would have loved to have an IDE for Julia that would be up to the RStudio standard. Nevermind.

No lets talk a little about what is really interesting : “Is their catch phrase false advertising or not?!”.

There is a bunch of relatively good tutorials online which are really helpful to learn the basic vocabulary, but indeed if like me you are use to code in R and/or Python, you should get it pretty fast and can almost copy-paste your favourite code into Julia and with a few adjustments, it will work. So as easy to write as R : quite so.

I then tried to compare computational times for some of my latest codes and there came the good surprise ! A code that would take a handful of minutes to run in R mainly due to unavoidable loops took a couple of seconds to run in Julia, without any other sorts of optimization. The handling of big objects is smooth and I did not ran into memory problems that R was suffering from.

So far so good ! But of course there has to be some drawbacks. The first one is the poor package repository compare to CRAN or even what you can get for Python. This might of course improve in the next few years as the language is still quite new. However, it is bothering to have to re-code something when you are used to simply load a package in R. Another, probably less important problem, is the lack of data visualization methods and especially the absence of ggplot2 that we have grown quite found of around here. There is of course Gadfly, which is quite close but once again, it is up to now very limited compared to what I was used to…

All in all, I am happy to have tried Julia, and I am quite sure that I will be using it quite a lot from now on. However, even if from a efficiency point of view, it is great, and it is way easier to learn than C (which I should have done a while ago), R and its tremendous package repository is far from beaten.

Oh and by the way, it uses PyPlot based on MatplotLib that allow you to make some xkcd-like plots, which can make your presentations a lot more fun.

]]>

Hello hello,

I have just arXived a review article, written for ESAIM: Proceedings and Surveys, called Sequential Bayesian inference for implicit hidden Markov models and current limitations. The topic is sequential Bayesian estimation: you want to perform inference (say, parameter inference, or prediction of future observations), taking into account parameter and model uncertainties, using hidden Markov models. I hope that the article can be useful for some people: I have tried to stay at a general level, but there are more than 90 references if you’re interested in learning more (sorry in advance for not having cited your article on the topic!). Below I’ll comment on a few points.

Hidden Markov models are very flexible tools to model time series: the observations are assumed to be noisy measurements of a Markov process. The Markov process can represent the complex dynamics of the underlying phenomenon (in the example of the article, it is a prey-predator model for the population growth of planktons). The noise in the measurements accounts for the error of the measuring devices, the fact that the underlying process is partially observed, etc.

The term “implicit”, introduced in Time series analysis via mechanistic models, refers to models where the latent process is a “black box”: we can simulate it, but that’s it. On the other hand, we assume that we can evaluate the probability density function of the measurement distribution.

Sequential inference refers to the ability to update the estimation as new observations arrive. For instance, the observations might be acquired on a daily basis, and thus we might want to update our predictions every day. If the predictions were obtained using “batch techniques” (e.g. MCMC), we would need to re-run the algorithms “from scratch” every day. With sequential methods (such as SMC), we can assimilate the latest observation, for a hopefully small cost every day. Unfortunately, even the recent techniques reviewed in the article fail to be “truly online”, in the sense that the statistical error will eventually blow up when parameter uncertainty is taken into account. If the parameters are kept as fixed values, then the problem becomes easier and can be dealt with in a truly online way. This is one of the current challenges that I’m discussing in the article.

I also want to comment on model uncertainty: in many scenarios, we have a few possible models to deal with a given time series. People often quote George E. P. Box: “all models are wrong, but some are useful”. Wrong here means that the observations actually are not realizations of one of the models, and this is undoubtedly correct. A common belief is that statistical inference naively assumes that one of the models is true; this is far from correct. A lot of articles have investigated the “misspecified setting” in details, and many statistical procedures (including MLE, Bayesian inference and model comparison techniques) provide justifiable answers without assuming that the model is true. In the article, I discuss the Bayes factor between two models. It has a perfectly reasonable justification as a prior predictive criterion, in other words, it compares models on the grounds of how likely the observations are under the prior distributions. Thus one does not need to assume anything about the data-generating process in order to use Bayes factors.

Another interesting aspect of the Bayes factor, by the way, is Occam’s razor principle. The simpler models are favoured over more complex models until enough data have been gathered to say otherwise. The same principle motivates AIC and BIC criteria, which can be seen as asymptotic approximations of Bayes factors for particular choices of priors. In the figure, we can see that a “wrong model” is better than the true data generating model until about 50 to 100 observations are assimilated (in the prior predictive sense of the Bayes factor). The figure shows the estimated Bayes factors for five independent runs of one the numerical methods reviewed in the article (SMC^2): we see that the runs diverge because the errors accumulate over time, hence the method is not “online”.

]]>

- W. Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.
*Biometrika*, 57(1):97–109, 1970.

In this paper, we shall consider Markov chain methods of sampling that are generalizations of a method proposed by Metropolis et al. (1953), which has been used extensively for numerical problems in statistical mechanics.

- Dennis V. Lindley and Adrian F.M. Smith. Bayes estimates for the linear model.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology),*with discussion, 1–41, 1972.

From Prof. B. de Finetti discussion (note the *valliant* collaborator Smith!):

I think that the main point to stress about this interesting and important paper is its significance for the philosophical questions underlying the acceptance of the Bayesian standpoint as the true foundation for inductive reasoning, and in particular for statistical inference. So far as I can remember, the present paper is the first to emphasize the role of the Bayesian standpoint as a logical framework for the analysis of intricate statistical situation. […] I would like to express my warmest congratulations to my friend Lindley and his valiant collaborator Smith.

- Persi Diaconis and Donald Ylvisaker. Quantifying prior opinion.
*Bayesian Statistics*, 2:133–156, 1985.

About the binomial model of spinning a coin:

Let us distinguish three categories of Bayesians (certainly a crude distinction in light of Good’s (1971) 46,656 lower bound on the possible types of Bayesians).

- Classical Bayesians. (Like Bayes, Laplace and Gauss) took . A so called flat prior.
- Modern Parametric Bayesians. (Raifa, Lindley, Mosteller) took as a beta density […]
- Subjective Bayesians. (Ramsey, de Finetti, Savage) take the prior as a quantification of what is known about the coin and spinning process.

- Eric L. Lehmann. Model specification: the views of Fisher and Neyman, and later developments.
*Statistical Science*, 5(2):160–168, 1990.

Where do probability models come from? To judge by the resounding silence over this question on the part of most statisticians, it seems highly embarrassing. In general, the theoretician is happy to accept taht his abstract probability triple was found under a gooseberry bush, while the applied statistician’s model “just growed”. (quoting A. P. Dawid, 1982)

- William H. Jefferys and James O. Berger. Ockham’s razor and Bayesian analysis.
*American Scientist*, 64–72, 1992.

William of Ockham, the 14th-century English philosopher, stated the principle thus: “Pluralitas non est ponenda sine necessitate”, which can be translated as: “Plurality must not be posited without necessity.” […] Ironically, whereas Bayesian methods have been criticized for introducing subjectivity into statistical analysis, the Bayesian approach can turn Ockham’s razor into a

lesssubjective and even “automatic” rule of inference.

- Eugene Seneta. Lewis Carroll’s “pillow problems”: on the 1993 centenary.
*Statistical science*, 180–186, 1993.

All 72 [

Pillow Problems]are claimed to have been formulated and worked out at night while in bed, mentally, and the answer written down afterward. [C. L. Dodgson, a.k.a. Lewis Carroll] work reflects the nature, standing and understanding of probability within the wider English mathematical community of the time.

- James O. Berger. Bayesian analysis: A look at today and thoughts of tomorrow.
*Journal of the American Statistical Association*, 95(452):1269– 1276, 2000.

Life was simple when I became a Bayesian in the 1970s; it was possible to track virtually all Bayesian activity. Preparing this paper on Bayesian statistics was humbling, as I realized that I have lately been aware of only about 10% of the ongoing activity in Bayesian analysis.

Below, the seven presentations.

]]>