Parallel resampling in the particle filter
It’s been a while I haven’t written about parallelization and GPUs. With colleagues Lawrence Murray and Anthony Lee we have just arXived a new version of Parallel resampling in the particle filter. The setting is that, on modern computing architectures such as GPUs, thousands of operations can be performed in parallel (i.e. simultaneously) and therefore the rest of the calculations that cannot be parallelized quickly becomes the bottleneck. In the case of the particle filter (or any sequential Monte Carlo method such as SMC samplers), that bottleneck is the resampling step. The article investigates this issue and numerically compares different resampling schemes.
In the resampling step, given a vector of “weights” (non-negative real numbers), a vector of integers called “offspring counts”, , is drawn such that for all , . That is, in average a particle has a number of offprings proportional to its normalized weight. Most implementations of the resampling step require a collective operation, such as computing the sum of the weights to normalize them. On top of being a collective operation, computing the sum of the weights is not a numerically stable operation, if the weight vector is very large. Numerical results in the article show that in single precision floating point format (as preferred for fast execution on the GPU) and for vectors of size half a million or more, a typical implementation of the resampling step (multinomial, residual, systematic…) exhibits a non-negligible bias due to numerical instability.
Two resampling strategies come to the rescue: Metropolis and Rejection resampling. These methods, described in details in the article, rely only on pair-wise weight comparisons and thus 1) are numerically stable and 2) bypass collective operations. Interestingly enough, the Metropolis resampler is theoretically biased but, when numerical stability is taken into account in single precision, proves “less biased” than the traditional resampling strategies (which are theoretically unbiased!), again when using half a million particles or more. It’s not too crazy to imagine that particle filters will soon be commonly run with millions of particles, hence the interest of studying the behaviour of resampling schemes in that regime.
Other practical aspects of resampling implementations are discussed in the article, such as whether the resampling step should be done on the CPU or on the GPU, taking into account the cost of copying the vectors into memory. Decision matrices are given (figure above), giving some indication on which is the best strategy in terms of performing resampling on CPU or GPU, and which resampling scheme to use.
All the numerical results of the article can be reproduced using the Resampling package for Libbi.