Statistical learning in models made of modules

 

2017-09-modularization
Graph of variables in a model made of two modules: the first with parameter theta1 and data Y1, and the second with parameter theta2 and data Y2, defined conditionally upon theta1.

 

Hi,

With Lawrence Murray, Chris Holmes and Christian Robert, we have recently arXived a paper entitled “Better together? Statistical learning in models made of modules”. Christian blogged about it already. The context is the following: parameters of a first model appear as inputs in another model. The question is whether to consider a “joint model approach”, where all parameters are estimated simultaneously with all of the data. Or if one should instead follow a “modular approach”, where the first parameters are estimated with the first model only, ignoring the second model. Examples of modular approaches include the “cut distribution“, or “two-step estimators” (e.g. Chapter 6 of Newey & McFadden (1994)). In many fields, modular approaches are preferred, because the second model is suspected of being more misspecified than the first one. Misspecification of the second model can “contaminate” the joint model, with dire consequences on inference, as described e.g. in Bayarri, Berger & Liu (2009). Other reasons include computational constraints and the lack of simultaneous availability of all models and associated data. In the paper, we try to make sense of the defects of the joint model approach and we propose a principled, quantitative way of choosing between joint and modular approaches.

Lawrence and I started wondering about these questions in the context of models of plankton population growth, back in 2012. Plankton growth is affected by ocean temperatures. These temperatures are not measured everywhere at all times, but one can first use a geophysics model to infer temperatures at desired locations and times. Then these estimated temperatures can be used as inputs (or “forcings”) to model plankton growth. We were wondering whether we should instead define a joint model of temperatures + plankton, to take the uncertainty of temperatures into account. Parslow, Cressie, Campbell, Jones & Murray (2013) provide an example of plankton model where temperatures are considered fixed, which is common practice.

An initial difficulty with the joint model approach is of a computational nature: if the two stages (e.g. plankton and geophysics) both involve large models, requiring weeks of computation, the joint model might simply be impossible to deal with. Indeed, the number of parameters adds up, and computational methods typically have super-linear costs in the number of parameters, which is the dimension of the space to explore (i.e. to sample or to optimize). It’s even worse than that since extra difficulties accumulate, such as multimodality, intractable likelihoods, etc.

Interestingly, the difficulties are not only computational but also statistical: the joint model approach is not always preferable in terms of estimation. This has been reported in various contexts, such as pharmacokinetics-pharmacodynamics, where the PD part is usually considered misspecified relative to the PK part. A good reference is Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009). There seems to have been enough demand from practitioners for WinBUGS to include a “cut” function, which is an attempt at estimating some parameters irrespective of other parts of the model (see the WinBUGS manual). Martyn Plummer (of JAGS) wrote a very interesting paper on the cut distribution, on issues associated with existing algorithms to sample from it, and on proposals to fix them. Another super relevant article is Bayarri, Berger & Liu (2009) that considers the issue in multiple cases of Bayesian inference in misspecified settings. Another link between modular approaches and model misspecification has been thoroughly investigated in the context of causal inference with propensity scores by Corwin Zigler and others.

Departing from the joint model approach seems to pose difficulties for some statisticians. Indeed the cut distribution is strange: it inserts directions in the graph relating variables in the model. Thus A can impact B without B impacting A. This is strange because information is usually modeled as a flow going both ways. In the paper, we argue that the cut distribution can be a reasonable choice, in terms of decision-theoretic principles such as predictive performance assessed with the logarithmic scoring rule. This leads to a quantitative criterion that can be computed to decide whether to “cut” or not. Following this type of decision-theoretic reasoning seems to me in the flavor of the Bayesian paradigm, contrarily to always trusting joint models and associated posterior distributions.

On the other hand, the issue might appear trivial to econometricians, who are used to two-step estimators, and misspecification in general; see e.g. White (1982) and Robustness by Hansen & Sargent, and two-step estimators in e.g. Pagan (1984). The cut distribution is a probabilistic version of these estimators, so I wonder why it has not been studied in more details earlier. If you know about early references, please let me know! In passing, in the recent unbiased MCMC paper (blog post here), we describe new ways of approximating the cut distribution, which hopefully resolve some of the issues raised by Plummer.

In the end, our article is an attempt at starting a discussion on modular versus joint approaches. It is likely that more and more situations will require the combination of models (e.g. merging heterogeneous data sources), in ways that take model misspecification into account.

Published by Pierre Jacob

Professor of statistics, ESSEC Business School

3 thoughts on “Statistical learning in models made of modules

Leave a comment