## Playing with pyCUDA: fast(er) generation of normal variables

Hi again!

Some Monte Carlo methods (especially MCMC and SMC) involve drawing a large amount of random variables. While trying to make some of my Python code faster, I realized that generating normal variables takes a significant time. Especially if you want 10^10 of them or more. There are mainly two reasons why it takes time: first NumPy uses the Mersenne Twister to generate uniform drawings (which are used to generate the other random variables), which is a good generator but fairly expensive compared to others (see this discussion). Second, NumPy uses Box-Muller transform to generate normal drawings using the uniforms. There are faster algorithms (like Ziggurat) but there are not implemented in the NumPy library. Box-Muller transform is easy to code but involves computing transcendental functions such as logarithm and cosinus, which are expensive for the CPU.

As shown on the graph above, numpy.random.normal (brown line) is twice slower than numpy.random.uniform (green line). On the vertical axis is the time spent to compute 100,000 variables in seconds. On the horizontal axis are the vector sizes I’ve tried: 10^5, 10^6 and 10^7. The time was measured using the cProfile module.

Since it’s my temporary hobby (see here for instance), I’ve coded the Box-Muller transform with pyCUDA to take advantage of the GPU. The result is the blue line in the graph above. Clearly if you want to draw 100,000 normal variables or less, don’t bother using the GPU. However, from 1,000,000 onwards the GPU starts to be an interesting option. The decreasing line stems from the huge overhead when using the GPU, since the vectors have to be copied from the standard RAM to the GPU memory. For big vectors, as the graph shows it’s about twice faster than numpy.random.normal. It’s somewhat of a small improvement, but it’s “free” in the sense that it doesn’t mess with the random number generator itself, and it’s easy to implement.

Indeed this code uses numpy.random.uniform and performs the Box-Muller transformation in parallel; the uniform generation itself is not parallelized. This way, you can use the standard numpy random generator in the whole program, whereas if the uniform generator was coded in pyCUDA, you’d have to stick with it and not use the numpy generator anymore; otherwise you’d be using two random generators simultaneously, and that’s BAD.

I’ve put the code on-line if you’re interested, but it’s just for testing purpose, you’d better check it closely before actually using it (because I didn’t!).

xi'ansaid, on 24 October 2010 at 06:21Cool stuff! Do you use Box-Muller without transcendental functions like the following version?

1. Generate

until .

2. Generate

and take

And there is this other algorithm

pierrejacobsaid, on 25 October 2010 at 03:24Hi,

Thanks for your comment! These algorithms don’t rely on transcendental functions indeed, but once they’re parallelized on the GPU transcendental functions are fast. Moreover they would be less convenient to parallelize since they both use an “until” statement, which implies that all the parallel tasks (“threads”) might take a different time to be finished with the transformation of their respective uniform / exponential variable. In the end, instead of (roughly) dividing your time by the number of cores, you would replace the total time by the maximum of each individual transformation time.

The same would hold for the Ziggurat method.

xi'ansaid, on 25 October 2010 at 08:58I see the trouble with “until” but would not many runs on each thread tend to equalise the cost across tasks and give you the average number of attemps?

pierrejacobsaid, on 26 October 2010 at 18:51I guess it would, yes! Although, from what I understood GPUs are efficient on computations, including complicated/transcendental functions, but struggle with “if” statements (and more generally branch instructions), see here for instance:

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html

GPUs are not faster than CPUs for everything, and branch instructions are one of those things where CPU is superior.

In the end it’s hard to predict which algorithm is going to be worth coding on the GPU and which one isn’t, in my opinion.

And everything could change if CPUs and GPUs merge into a common device, such as this:

http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx

quantum probabilitysaid, on 13 November 2010 at 08:35This is fascinating. But why is the GPU involved in particular?

pierrejacobsaid, on 13 November 2010 at 17:46Because the GPU is made of hundreds of cores that can work in parallel. See e.g. here:

http://www.nvidia.com/object/GPU_Computing.html

CPU and GPU trends over time « Why?said, on 25 January 2011 at 18:05[...] seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could be [...]

Reviewing a paper that uses GPUs « Why?said, on 17 July 2011 at 23:53[...] processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word [...]