Intractable likelihoods, unbiased estimators and sign problem

Posted in Statistics by Pierre Jacob on 1 July 2013
Computing a Taylor expansion with random truncation can be done Swiftly.

Computing a Taylor expansion with random truncation can be done Swiftly.

Hey all,

We’re at the Big Data era blablabla, but the advanced computational methods usually don’t scale well enough to match the increasing sizes of datasets. For instance, even in a simple case of i.i.d. data y_1, y_2, \ldots y_n and an associated likelihood function \mathcal{L}(\theta; y_1, y_2, \ldots, y_n), the cost of evaluating the likelihood function at any parameter \theta is typically growing at least linearly with n. If you then plug that likelihood into an optimization technique to find the Maximum Likelihood Estimate, or into a sampling technique such as Metropolis-Hastings to sample from the posterior distribution, the computational cost grows accordingly for a fixed number of iterations. However you can get unbiased estimates of the log-likelihood by drawing m < n points i_1, \ldots, i_m uniformly in the index set \{1, \ldots, n\} and by computing (n/m) \log \mathcal{L}(\theta; y_{i_1}, \ldots, y_{i_m}). This way you sub-sample from the whole dataset, and you can choose m according to your computational budget. However is it possible to perform inference with these estimates instead of the complete log-likelihood?

An approach to this problem is provided in a particularly remarkable paper that studies a Metropolis-Hastings algorithm using unbiased estimates of the target density instead of the target density itself, and shows that the result is *correct* in the sense that you get a Markov chain converging towards the exact target distribution. The problem is that it is not so trivial to get unbiased likelihood estimates. In the i.i.d case mentioned above, you get unbiased estimates of the log-likelihood, but of course the unbiased property is lost if you simply take the exponential of your estimates. Abstracting from the statistical context, suppose you have a random variable X with expectation \mu. Then \exp(X) doesn’t have expectation \exp(\mu), but is it still possible to construct an unbiased estimate of \exp(\mu) by doing something else? More generally, for any function f and not only for the exponential?

This very relevant article appeared recently on arXiv: Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, Anne-Marie Lyne, Heiko Strathmann, Daniel Simpson, Yves Atchade, see also Xian’s comments on it. The motivation of the paper is slightly different than the one described above but the techniques are similar. Let me summarize the main ideas used in that paper. First develop the function f into a Taylor series:

f(\mu) = \sum_{i=0}^\infty a_i \mu^i

and take an infinite triangular array (X_{i,j})_{1\leq i \leq \infty, 1\leq j \leq i} of independent copies of X. Then the random variable

a_0 + \sum_{i=1}^\infty a_i \left(\prod_{j=1}^i X_{i,j}\right)

has expectation f(\mu). Of course it is not practical because the series is infinite, so the second idea is about making the series finite. As my new colleague Alex Thiery remarked, if you want to compute an infinite series

\sum_{i=0}^\infty u_i

you can always draw an integer-valued random variable N and compute

\sum_{i=0}^N \frac{u_i}{\mathbb{P}(N \geq i)}


\mathbb{E}\left[ \sum_{i=0}^N \frac{u_i}{\mathbb{P}(N \geq i)}\right] = \mathbb{E}\left[ \sum_{i=0}^\infty \frac{u_i}{\mathbb{P}(N \geq i)} 1_{N \geq i} \right] = \sum_{i=0}^\infty \frac{u_i}{\mathbb{P}(N \geq i)}\mathbb{E} [1_{N \geq i}] =\sum_{i=0}^\infty u_i

Combining the Taylor expansion and the random truncation, you can get unbiased estimate of f(\mu) for any function f, given the ability to sample independent copies of a random variable X with expectation \mu. The mentioned paper deals with more specific functions f and different truncation schemes, but essentially it’s the same deal. So is that it? Problem solved?

One big remaining issue with this scheme is that you do not control the sign of the resulting estimates; so that if you are estimating a positive quantity such as a likelihood in the i.i.d. case mentioned above, applying the expansion/truncation scheme does not guarantee that your estimates are going to take almost-surely non-negative values. In the Russian Roulette paper it is not a big issue but it could potentially be one. So let me conclude with this problem which is still open to the best of my knowledge.

Suppose you have a real-valued random variable X with expectation \mu and a function f:\mathbb{R}\to (0,\infty). You know how to sample any finite number of independent copies of X. Can you construct an almost-surely non-negative random variable Y such that \mathbb{E}[Y] = f(\mu)?

Spoiler alert: the answer is no in general… more about this soon!


15 Responses

Subscribe to comments with RSS.

  1. Umberto Picchini said, on 1 July 2013 at 09:22

    Thanks for the great post. Papers closely related to the first paragraph of your post are and
    Christan Robert also commented on the first one:

    • Pierre Jacob said, on 1 July 2013 at 10:23

      Ah thanks a lot, somehow I had missed the austerity papers!

  2. xi'an said, on 1 July 2013 at 16:28

    Euh…, why the picture?!

    • Dan Simpson said, on 1 July 2013 at 17:51

      It’s Taylor Swift!

      • xi'an said, on 1 July 2013 at 17:54


        • Dan Simpson said, on 1 July 2013 at 18:27

          She’s a very young country singer from America. Among other things, she has a song called Red which would be genuinely useful if you ever needed to teach the concept of “similes” to someone. As a song, it kinda skates by on charm.

  3. Mark Girolami said, on 1 July 2013 at 18:19

    I fear xi’an this may be a generational issue – I have no idea who Taylor Swift is – nor have I the interest to look her up on Google…. a picture of the be-wigged Colin Maclaurin would have been so much better!! – I really like Pierre’s suggestion about using the ideas of unbiased estimators and Russian Roulette from our paper to perform exact Bayesian / Likelihood based inference in the so-called “BIG DATA” regime – the good thing is that exactness would be preserved which I don’t think is the case in the austerity paper…. would be interesting to see how it actually works in practice — might just take a look.

    • Dan Simpson said, on 1 July 2013 at 18:31

      Hmmm… but could you bound the cost? There’d also be a problem with variance explosion if you tried to keep it too low, i would suspect. But it would maybe be very useful to “fix” the “bias” in aggressive lasso pre-selection schemes. It turns out you can work out a priori which covariates will be zero, but the “safe” scheme is too expensive, so people use “lossy” ones. Maybe that could be the “big data” version of “don’t truncate until the kth term”…..

      • Alex Thiery said, on 2 July 2013 at 03:58

        Hi Dan, do you mind to expand on your comment:
        ‘maybe be very useful to “fix” the “bias” in aggressive lasso pre-selection schemes”
        I am intrigued…

        • Dan Simpson said, on 2 July 2013 at 11:44

          It’s not much of a thing, really – I was just fee-associating…

          It turns out that there are a bunch of pre-selection routines for lasso-type estimators that are necessary in the situation where you have so many covariates that you could never compute the estimate in the standard way (that’s to memory/ linear algebra issues). The first provably “correct” one was called SAFE, but some more aggressive schemes that occasionally drop variables that would otherwise be selected have been proposed (one recently in Series B by a bunch of famous people). So I was thinking in a Bayesian context that you could probably do something similar – instead of truncating a series, you would be truncating a dataset. Instead of the first terms being ranked highest, it would be those with highest pre-selection weights. I don’t think the current ranking technology is good enough for this – as said below, you’d want a lot of control over the quality (some sort of epsilon-alpha bounds would be perfect), but I suspect that there’s some nice work that could be done on this.

          (nb – bias was a silly word to use…. lasso is biased, that’s the whole point :p)

  4. Pierre Jacob said, on 2 July 2013 at 07:45

    I’d also be interested to know more, Dan!

    xi’an, I admit having trouble sometimes to find suitable pictures for the posts.

    Mark, if you try this out I would be interested to see the frequencies of negative values in the likelihood estimates as a function of the size “m” of the sub-sample.

    • Dan Simpson said, on 2 July 2013 at 11:15

      So, we didn’t talk about it in the paper, but in some cases you can really really control this. For example, in the log-determinant case, there is some old work on sharp-ish epsilon-alpha bounds ( So you could probably balance the expected number of negatives with the cost of the estimator and the variance inflation due to the extra negatives.

  5. Robin Ryder said, on 5 July 2013 at 11:31

    Best illustrative photo ever!

  6. Radu Craiu said, on 13 December 2013 at 16:15

    Very interesting post!

  7. […] version of the article, it’s time for a blog post. This post is kind of a follow-up of a previous post from July, where I was commenting on Playing Russian Roulette with Intractable Likelihoods by Mark Girolami, […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: