Playing with pyCUDA: fast(er) generation of normal variables

Pierre Jacob19 October 201019 October 2010Geek

Hi again!

Some Monte Carlo methods (especially MCMC and SMC) involve drawing a large amount of random variables. While trying to make some of my Python code faster, I realized that generating normal variables takes a significant time. Especially if you want 10^10 of them or more. There are mainly two reasons why it takes time: first NumPy uses the Mersenne Twister to generate uniform drawings (which are used to generate the other random variables), which is a good generator but fairly expensive compared to others (see this discussion). Second, NumPy uses Box-Muller transform to generate normal drawings using the uniforms. There are faster algorithms (like Ziggurat) but there are not implemented in the NumPy library. Box-Muller transform is easy to code but involves computing transcendental functions such as logarithm and cosinus, which are expensive for the CPU.

As shown on the graph above, numpy.random.normal (brown line) is twice slower than numpy.random.uniform (green line). On the vertical axis is the time spent to compute 100,000 variables in seconds. On the horizontal axis are the vector sizes I’ve tried: 10^5, 10^6 and 10^7. The time was measured using the cProfile module.

Since it’s my temporary hobby (see here for instance), I’ve coded the Box-Muller transform with pyCUDA to take advantage of the GPU. The result is the blue line in the graph above. Clearly if you want to draw 100,000 normal variables or less, don’t bother using the GPU. However, from 1,000,000 onwards the GPU starts to be an interesting option. The decreasing line stems from the huge overhead when using the GPU, since the vectors have to be copied from the standard RAM to the GPU memory. For big vectors, as the graph shows it’s about twice faster than numpy.random.normal. It’s somewhat of a small improvement, but it’s “free” in the sense that it doesn’t mess with the random number generator itself, and it’s easy to implement.

Indeed this code uses numpy.random.uniform and performs the Box-Muller transformation in parallel; the uniform generation itself is not parallelized. This way, you can use the standard numpy random generator in the whole program, whereas if the uniform generator was coded in pyCUDA, you’d have to stick with it and not use the numpy generator anymore; otherwise you’d be using two random generators simultaneously, and that’s BAD.

I’ve put the code on-line if you’re interested, but it’s just for testing purpose, you’d better check it closely before actually using it (because I didn’t!).

Download the code | See the code online.

Published by Pierre Jacob

Professor of statistics, ESSEC Business School View more posts

9 thoughts on “Playing with pyCUDA: fast(er) generation of normal variables”

xi'an says:

24 October 2010 at 06:21
Cool stuff! Do you use Box-Muller without transcendental functions like the following version?
1. Generate
$Y_1,Y_2\sim\mathcal{E}(1)$
until $Y_2 >(1-Y_1)^2/2$ .
2. Generate
$U\sim\mathcal{U}([0,1])$
and take
$X=\begin{cases}Y_1 \text{ if }\; U \geq 0.5 \\ -Y_1 \text{ otherwise} \end{cases}$

And there is this other algorithm
1. Generate $U_1,U_2\sim U([-1,1])$ until $S=U_1^2+U_2^2\le 1$ .
2. Define $Z=\sqrt{-2\log(S)/S}$ and take $X_1=Z\; U_1,\qquad X_2=Z\; U_2$
Reply
1. pierrejacob says:
  
  25 October 2010 at 03:24
  
  Hi,
  
  Thanks for your comment! These algorithms don’t rely on transcendental functions indeed, but once they’re parallelized on the GPU transcendental functions are fast. Moreover they would be less convenient to parallelize since they both use an “until” statement, which implies that all the parallel tasks (“threads”) might take a different time to be finished with the transformation of their respective uniform / exponential variable. In the end, instead of (roughly) dividing your time by the number of cores, you would replace the total time by the maximum of each individual transformation time.
  
  The same would hold for the Ziggurat method.
  
  Reply
xi'an says:

25 October 2010 at 08:58

I see the trouble with “until” but would not many runs on each thread tend to equalise the cost across tasks and give you the average number of attemps?

Reply
1. pierrejacob says:
  
  26 October 2010 at 18:51
  
  I guess it would, yes! Although, from what I understood GPUs are efficient on computations, including complicated/transcendental functions, but struggle with “if” statements (and more generally branch instructions), see here for instance:
  http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html
  
  GPUs are not faster than CPUs for everything, and branch instructions are one of those things where CPU is superior.
  
  In the end it’s hard to predict which algorithm is going to be worth coding on the GPU and which one isn’t, in my opinion.
  
  And everything could change if CPUs and GPUs merge into a common device, such as this:
  http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx
  
  Reply
quantum probability says:

13 November 2010 at 08:35

This is fascinating. But why is the GPU involved in particular?

Reply
1. pierrejacob says:
  
  13 November 2010 at 17:46
  
  Because the GPU is made of hundreds of cores that can work in parallel. See e.g. here:
  http://www.nvidia.com/object/GPU_Computing.html
  
  Reply
Pingback: CPU and GPU trends over time « Why?
Pingback: Reviewing a paper that uses GPUs « Why?
caroline says:

14 June 2018 at 04:41

any chance these codes are still available? I have just set up my cuda and want to adapt my Monte Carlo codes. Thank you so much. The links to see/download the code are not working. feel free to email Caroline.gorham@gmail

Reply