Derivative-free estimate of derivatives
Arnaud Doucet, Sylvain Rubenthaler and I have just put a technical report on arXiv about estimating the first- and second-order derivatives of the log-likelihood (also called the score and the observed information matrix respectively) in general (intractable) statistical models, and in particular in (non-linear non-Gaussian) state-space models. We call them “derivative-free” estimates because they can be computed even if the user cannot compute any kind of derivatives related to the model (as opposed to e.g. this paper and this paper). Actually in some cases of interest we cannot even evaluate the log-likelihood point-wise (we do not have a formula for it), so forget about explicit derivatives. Would you like to know more?
Our tech report builds heavily upon the Iterated Filtering series of papers (see this first PNAS paper, then this technical Annals of Stats paper). It simply extends it to the second-order derivatives (actually we also propose an alternate estimate for the score). The main idea can be interpreted in terms of Bayesian asymptotics. Say, you have some (univariate, for clarity) parameter , and you want to evaluate the derivatives of the log-likelihood at some point ; introduce a prior distribution with mean and variance . Consider the behaviour of posterior distribution when 1) the dataset is fixed (hence the likelihood is also fixed) and 2) the prior distribution concentrates, that is . What happens to the posterior distribution?
As you can imagine it also shrinks: it looks more and more like the prior distribution when decreases. Now interestingly enough, under mild regularity assumptions and a Gaussian prior distribution we have the two following inequalities:
for some . It means that the shift from the prior mean to the posterior mean is proportional to the first derivative of the log-likelihood (already known from the IF papers), while the shift from the prior variance to the posterior variance is proportional to its second order derivative (new stuff). All of this up to an error term going to zero at the speed of .
In practical terms, it means that the problem of computing the first two derivatives is turned into a problem of computing posterior expectations, which can be tackled with Monte Carlo methods for a very broad class of models.