Playing with pyCUDA: fast(er) generation of normal variables
Some Monte Carlo methods (especially MCMC and SMC) involve drawing a large amount of random variables. While trying to make some of my Python code faster, I realized that generating normal variables takes a significant time. Especially if you want 10^10 of them or more. There are mainly two reasons why it takes time: first NumPy uses the Mersenne Twister to generate uniform drawings (which are used to generate the other random variables), which is a good generator but fairly expensive compared to others (see this discussion). Second, NumPy uses Box-Muller transform to generate normal drawings using the uniforms. There are faster algorithms (like Ziggurat) but there are not implemented in the NumPy library. Box-Muller transform is easy to code but involves computing transcendental functions such as logarithm and cosinus, which are expensive for the CPU.
As shown on the graph above, numpy.random.normal (brown line) is twice slower than numpy.random.uniform (green line). On the vertical axis is the time spent to compute 100,000 variables in seconds. On the horizontal axis are the vector sizes I’ve tried: 10^5, 10^6 and 10^7. The time was measured using the cProfile module.
Since it’s my temporary hobby (see here for instance), I’ve coded the Box-Muller transform with pyCUDA to take advantage of the GPU. The result is the blue line in the graph above. Clearly if you want to draw 100,000 normal variables or less, don’t bother using the GPU. However, from 1,000,000 onwards the GPU starts to be an interesting option. The decreasing line stems from the huge overhead when using the GPU, since the vectors have to be copied from the standard RAM to the GPU memory. For big vectors, as the graph shows it’s about twice faster than numpy.random.normal. It’s somewhat of a small improvement, but it’s “free” in the sense that it doesn’t mess with the random number generator itself, and it’s easy to implement.
Indeed this code uses numpy.random.uniform and performs the Box-Muller transformation in parallel; the uniform generation itself is not parallelized. This way, you can use the standard numpy random generator in the whole program, whereas if the uniform generator was coded in pyCUDA, you’d have to stick with it and not use the numpy generator anymore; otherwise you’d be using two random generators simultaneously, and that’s BAD.
I’ve put the code on-line if you’re interested, but it’s just for testing purpose, you’d better check it closely before actually using it (because I didn’t!).