[This is a guest post by my friend and colleague Bernardo Nipoti from Collegio Carlo Alberto,
The matches of the group stage of the UEFA Champions league have just finished and next Monday, the 14th of December 2015, in Nyon, there will be a round of draws for deciding the eight matches that will compose the first round of the knockout phase.
As explained on the UEFA website, rules are simple:
- two seeding pots have been formed: one consisting of group winners and the other of runners-up;
- no team can play a club from their group or any side from their own association;
- due to a decision by the UEFA Executive Committee, teams from Russia and Ukraine cannot meet.
The two pots are:
Group winners: Real Madrid (ESP), Wolfsburg (GER), Atlético Madrid (ESP), Manchester City (ENG), Barcelona (ESP, holders), Bayern München (GER), Chelsea (ENG), Zenit (RUS);
Group runners-up: Paris Saint-Germain (FRA), PSV Eindhoven (NED), Benfica (POR), Juventus (ITA), Roma (ITA), Arsenal (ENG), Dynamo Kyiv (UKR), Gent (BEL).
Giving these few constraints, are there some matches that are more likely to be drawn than others? For example, supporters of Barcelona might wonder whether the seven possible teams (PSG, PSV, Benfica, Juventus, Arsenal, Dynamo Kyiv and Gent) are all equally likely to be the next opponent of their favorite team. (more…)
I have been recently invited to referee a paper for a journal I had never heard of before: the International Journal of Biological Instrumentation, published by VIBGYOR Online Publishers. This publisher happens to be on the blacklist of predatory publishers by Jeffrey Beall which inventory:
Potential, possible, or probable predatory scholarly open-access publishers.
I have kindly declined the invitation. Thanks Igor for the link.
Some time ago, Cédric Villani came to Turin for delivering two talks. One intended for youngsters (high school level say), another one for a wider audience, as a recipient of the Peano Prize. He commented on live, in Italian per favore:
“Grazie mille! Un grande piacere e un grande onore per me!”
I attended both. The reason why I attended the first being that I am acting as a research advisor for Math en Jeans groups. Villani spoke about his book, Birth of a Theorem, or Théorème Vivant. He also shared a list of se7en thoughts/tips about doing research, with illustrations. I find them quite inspiring, here they are.
Illustrating this by showing Faà di Bruno’s formula Wikipedia page. I like this quote, since the formula enters moment computation for objects I’m using everyday. And also because Faà di Bruno lived in Italian Piedmont, precisely in Turin.
“The most important and the most mysterious.”
- Favorable environment
Showing pictures of several places where he worked, including Institut Henri Poincaré. Not sure that this one is the most favorable environment for scientific productivity (as a Director I mean).
Meaning between scientists, not trade. Explaining briefly about polymath projects. And displaying a snapshot of Gowers’s Weblog as an illustration of how diverse exchanges he means. I also believe that blogs are a great information medium :)
With snapshots of Musica Ricercata sheet music. And a paragraph of La disparition, a novel without the letter e by Georges Perec. Writing this makes me realize how foolish such an enterprise would look like in mathematics.
- Work & Intuition
Interesting to see these two at the same level.
- Perseverance & Luck
Same comment as for point 6.
El Capitan is a very nice mountain. It’s also the latest OS X version which messes things up with . Be aware of this before you update. I wasn’t!
I quote from a fix explained here:
Under OS X 10.11, El Capitan, writing to “/usr” is no longer allowed, even with Administrator privileges. The usual symbolic link to the active Distribution, “/usr/texbin”, is therefore removed (if it was there from a previous OS version) and cannot be installed. Many GUI applications have the path to those binaries set to “/usr/texbin” by default and will no longer find the binaries there.
I had to reinstall MacTex, then to update my GUI application (texmaker) for and finally to replace every “/usr/texbin” by “/Library/TeX/texbin”, as shown below.
This very fine title quotes a pretty hilarious banquet speech by David Dunson at the last BNP conference held in Raleigh last June. The graph is by François Caron who used it in his talk there. See below for his explanation.
After the summer break, back to work. The academic year to come looks promising from a BNP point of view. Not least that three special issues have been announced, in Statistics & Computing (guest editors: Tamara Broderick (MIT), Katherine Heller (Duke), Peter Mueller (UT Austin)), the Electronic Journal of Statistics (guest editor: Subhashis Ghoshal (NCSU)), and in the International Journal of Approximate Reasoning (proposal deadline December 1st, guest editors: Alessio Benavoli (Lugano), Antonio Lijoi (Pavia) and Antonietta Mira (Lugano)).
BNP is also going to infiltrate MCMSki V, Lenzerheide, Switzerland, January 4-7 2016, with three sessions with a BNP flavor, in addition to plenary speakers David Dunson and Michael Jordan. The International Society for Bayesian Analysis World Meeting, 13 -17 June, 2016, should also host plenty of BNP sessions. And a De Finetti Lecture by Persi Diaconis (Stanford University). (more…)
With colleagues Stefano Favaro and Bernardo Nipoti from Turin and Yee Whye Teh from Oxford, we have just arXived an article on discovery probabilities. If you are looking for some info on a space shuttle, a cycling team or a TV channel, it’s the wrong place. Instead, discovery probabilities are central to ecology, biology and genomics where data can be seen as a population of individuals belonging to an (ideally) infinite number of species. Given a sample of size , the -discovery probability is the probability that the next individual observed matches a species with frequency in the -sample. For instance, the probability of observing a new species is key for devising sampling experiments.
By the way, why Alan Turing? Because with his fellow researcher at Bletchley Park Irving John Good, starred in The Imitation Game too, Turing is also known for the so-called Good-Turing estimator of the discovery probability
which involves , the number of species with frequency in the sample (ie frequencies frequency, if you follow me). As it happens, this estimator defined in Good 1953 Biometrika paper became wildly popular among ecology-biology-genomics communities since then, at least in the small circles where wild popularity and probability aren’t mutually exclusive.
Simple explicit estimators of discovery probabilities in the Bayesian nonparametric (BNP) framework of Gibbs-type priors were given by Lijoi, Mena and Prünster in a 2007 Biometrika paper. The main difference between the two estimators of is that Good-Turing involves and only, while the BNP involves , (instead of ), and , the total number of observed species. It has been shown in the literature that the BNP estimators are more reliable than Good-Turing estimators.
How do we contribute? (i) we describe the posterior distribution of the discovery probabilities in the BNP model, which is pretty useful for deriving exact credible intervals of the estimates, and (ii) we investigate large asymptotic behavior of the estimators.
The students did a great job in presenting some Bayesian classics. I enjoyed reading the papers (pdfs can be found here), most of which I hadn’t read before, and enjoyed also the students’ talks. I share here some of the best ones, as well as some demonstrative excerpts from the papers. In chronological order (presentations on slideshare below):
- W. Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
In this paper, we shall consider Markov chain methods of sampling that are generalizations of a method proposed by Metropolis et al. (1953), which has been used extensively for numerical problems in statistical mechanics.
- Dennis V. Lindley and Adrian F.M. Smith. Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), with discussion, 1–41, 1972.
From Prof. B. de Finetti discussion (note the valliant collaborator Smith!):
I think that the main point to stress about this interesting and important paper is its significance for the philosophical questions underlying the acceptance of the Bayesian standpoint as the true foundation for inductive reasoning, and in particular for statistical inference. So far as I can remember, the present paper is the first to emphasize the role of the Bayesian standpoint as a logical framework for the analysis of intricate statistical situation. […] I would like to express my warmest congratulations to my friend Lindley and his valiant collaborator Smith.
Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite fascinating for us, academics fascinated by academic rankings, for better or for worse (ironic here). They consist in cross citations counts for 47 statistics journals (see list and abbreviations page 5): is the number of citations from articles published in journal in 2010 to papers published in journal in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include Bayesian Analysis (published from 2006), The Annals of Applied Statistics (published from 2007).
I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between Annals of Statistics and Biometrika. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).
I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:
- for the data, I used the citations matrix renormalized by the total number of citations made, which I denote by . This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction, CSDA is way ahead of all the others.
- the node size represents the Cited/Citing ratio
- the edge width represents the renormalized . I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges and .
- for a better visibility I kept only the first decile of heaviest edges.
- the clusters identified by four colors are modularity classes obtained by the Louvain method.
The two software journals included in the dataset are quite outliers:
- the Journal of Statistical Software (JSS) is disconnected from the others, meaning it has no normalized citations in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015).
- the largest is the self citations of the STATA Journal (StataJ).
- CSDA is the most central journal in the sense of the highest (unweighted) degree.
Some further thoughts
All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?
This week I’ll start my Bayesian Statistics master’s course at the Collegio Carlo Alberto. I realized that some of last year students got PhD positions in prestigious US universities. So I thought that letting this year’s students have a first grasp of some great Bayesian papers wouldn’t do harm. The idea is that in addition to the course, the students will pick a paper from a list and present it (or rather part of it) to the others and to me. Which will let them earn some extra points for the final exam mark. It’s in the spirit of Xian’s Reading Classics Seminar (his list here).
I’ve made up the list below, inspired by two textbooks references lists and biased by personal tastes: Xian’s Bayesian Choice and Peter Hoff’s First Course in Bayesian Statistical Methods. See the pdf list and zipped folder for papers. Comments on the list are much welcome!
PS: reference n°1 isn’t a joke!
I presented an arxived paper of my postdoc at the big success Young Bayesian Conference in Vienna. The big picture of the talk is simple: there are situations in Bayesian nonparametrics where you don’t know how to sample from the posterior distribution, but you can only compute posterior expectations (so-called marginal methods). So e.g. you cannot provide credible intervals. But sometimes all the moments of the posterior distribution are available as posterior expectations. So morally, you should be able to say more about the posterior distribution than just reporting the posterior mean. To be more specific, we consider a hazard (h) mixture model
where is a kernel, and the mixing distribution is random and discrete (Bayesian nonparametric approach).
We consider the survival function which is recovered from the hazard rate by the transform
and some possibly censored survival data having survival . Then it turns out that all the posterior moments of the survival curve evaluated at any time can be computed.
The nice trick of the paper is to use the representation of a distribution in a [Jacobi polynomial] basis where the coefficients are linear combinations of the moments. So one can sample from [an approximation of] the posterior, and with a posterior sample we can do everything! Including credible intervals.
I’ve wrapped up the few lines of code in an R package called momentify (not on CRAN). With a sequence of moments of a random variable supported on [0,1] as an input, the package does two things:
- evaluates the approximate density
- samples from it
A package example for a mixture of beta and 2 to 7 moments gives that result: