# Statisfaction

## Turing revisited in Turin, and Oxford

Posted in General by Julyan Arbel on 18 June 2015

Are you paying attention? Good. If you are not listening carefully, you will miss things. Important things.

With colleagues Stefano Favaro and Bernardo Nipoti from Turin and Yee Whye Teh from Oxford, we have just arXived an article on discovery probabilities. If you are looking for some info on a space shuttle, a cycling team or a TV channel, it’s the wrong place. Instead, discovery probabilities are central to ecology, biology and genomics where data can be seen as a population of individuals belonging to an (ideally) infinite number of species. Given a sample of size $n$, the $l$-discovery probability $D_{n}(l)$ is the probability that the next individual observed matches a species with frequency $l$ in the $n$-sample. For instance, the probability of observing a new species $D_{n}(0)$ is key for devising sampling experiments.

By the way, why Alan Turing? Because with his fellow researcher at Bletchley Park Irving John Good, starred in The Imitation Game too, Turing is also known for the so-called Good-Turing estimator of the discovery probability

$(l+1)\frac{m_{l+1,n}}{n}$

which involves $m_{l+1,n}$, the number of species with frequency $l+1$ in the sample (ie frequencies frequency, if you follow me). As it happens, this estimator defined in Good 1953 Biometrika paper became wildly popular among ecology-biology-genomics communities since then, at least in the small circles where wild popularity and probability aren’t mutually exclusive.

Simple explicit estimators $\hat{\mathcal{D}}_{n}(l)$ of discovery probabilities in the Bayesian nonparametric (BNP) framework of Gibbs-type priors were given by Lijoi, Mena and Prünster in a 2007 Biometrika paper. The main difference between the two estimators of $D_{n}(l)$ is that Good-Turing involves $n$ and $m_{l+1,n}$ only, while the BNP involves $n$, $m_{l,n}$ (instead of $m_{l+1,n}$), and $k_n$, the total number of observed species. It has been shown in the literature that the BNP estimators are more reliable than Good-Turing estimators.

How do we contribute? (i) we describe the posterior distribution of the discovery probabilities in the BNP model, which is pretty useful for deriving exact credible intervals of the estimates, and (ii) we investigate large $n$ asymptotic behavior of the estimators.

We are not aware of any non-asymptotic method for deriving credible interval for the BNP estimators. We fill this gap by describing the posterior distribution of $D_{n}(l)$. More specifically, we derive all posterior moments of $D_{n}(l)$. Since this random variable has a compact support, $[0,1]$, it is characterized by its moments. So one can use a moment-based technique for sampling draws, see e.g. our momentify R package written for another article. We also show that the posterior distribution is explicit in two special cases of Gibbs-type priors known as the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior. The posterior distribution is in fact shamelessly simple (once you know it) since it essentially amounts to [[a random] fraction of] a Beta distribution [with random coefficients].

As for large $n$ asymptotic behavior of the estimators, we prove the following asymptotic equivalences, denoted by $\simeq$,

$\widehat{\mathcal{D}}_{n}(0) \simeq\frac{\sigma k_n}{n}$  and  $\widehat{\mathcal{D}}_{n}(l)\simeq (l-\sigma)\frac{m_{l,n}}{n}$ for $l\geq 1$,

where $\sigma\in(0,1)$ is a parameter of the Gibbs-type prior. These can serve as approximations. In the cases of the two parameter Poisson-Dirichlet prior and the normalized generalized Gamma prior, we provide also a second order term to the asymptotic expansion of the estimators

$\widehat{\mathcal{D}}_{n}(0)=\frac{\sigma k_n}{n}+\frac{S_n}{n}+o\big(\frac{1}{n}\big)$  and  $\widehat{\mathcal{D}}_{n}(l)=(l-\sigma)\frac{m_{l,n}}{n}\big(1-\frac{S_n}{n}+o\big(\frac{1}{n}\big)\big)$ for $l\geq 1$,

where the second order $S_n$ is either a constant, or a quantity which converges almost surely to a random variable. In both cases, we show that it involves the second (and last) parameter of the priors, whereas the asymptotic equivalence given before involves only $\sigma$. Whether similar asymptotic expansions also hold in the whole Gibbs-type class remains an open problem!

If you have read till this point, then you may also be interested in listening to Stefano Favaro about it at the 10th Conference on Bayesian Nonparametrics next week in Raleigh, NC 🙂

Cheers,
Julyan