Playing with pyCUDA: fast(er) generation of normal variables

Hi again!

Some Monte Carlo methods (especially MCMC and SMC) involve drawing a large amount of random variables. While trying to make some of my Python code faster, I realized that generating normal variables takes a significant time. Especially if you want 10^10 of them or more. There are mainly two reasons why it takes time: first NumPy uses the Mersenne Twister to generate uniform drawings (which are used to generate the other random variables), which is a good generator but fairly expensive compared to others (see this discussion). Second, NumPy uses Box-Muller transform to generate normal drawings using the uniforms. There are faster algorithms (like Ziggurat) but there are not implemented in the NumPy library. Box-Muller transform is easy to code but involves computing transcendental functions such as logarithm and cosinus, which are expensive for the CPU.

As shown on the graph above, numpy.random.normal (brown line) is twice slower than numpy.random.uniform (green line). On the vertical axis is the time spent to compute 100,000 variables in seconds. On the horizontal axis are the vector sizes I’ve tried: 10^5, 10^6 and 10^7. The time was measured using the cProfile module.

Since it’s my temporary hobby (see here for instance), I’ve coded the Box-Muller transform with pyCUDA to take advantage of the GPU. The result is the blue line in the graph above. Clearly if you want to draw 100,000 normal variables or less, don’t bother using the GPU. However, from 1,000,000 onwards the GPU starts to be an interesting option. The decreasing line stems from the huge overhead when using the GPU, since the vectors have to be copied from the standard RAM to the GPU memory. For big vectors, as the graph shows it’s about twice faster than numpy.random.normal. It’s somewhat of a small improvement, but it’s “free” in the sense that it doesn’t mess with the random number generator itself, and it’s easy to implement.

Indeed this code uses numpy.random.uniform and performs the Box-Muller transformation in parallel; the uniform generation itself is not parallelized. This way, you can use the standard numpy random generator in the whole program, whereas if the uniform generator was coded in pyCUDA, you’d have to stick with it and not use the numpy generator anymore; otherwise you’d be using two random generators simultaneously, and that’s BAD.

I’ve put the code on-line if you’re interested, but it’s just for testing purpose, you’d better check it closely before actually using it (because I didn’t!).

Download the code | See the code online.

Published by Pierre Jacob

Professor of statistics, ESSEC Business School

9 thoughts on “Playing with pyCUDA: fast(er) generation of normal variables

  1. Cool stuff! Do you use Box-Muller without transcendental functions like the following version?
    1. Generate
    Y_1,Y_2\sim\mathcal{E}(1)
    until Y_2 >(1-Y_1)^2/2.
    2. Generate
    U\sim\mathcal{U}([0,1])
    and take
    X=\begin{cases}Y_1 \text{ if }\; U \geq 0.5 \\ -Y_1 \text{ otherwise} \end{cases}

    And there is this other algorithm

    1. Generate U_1,U_2\sim U([-1,1]) until S=U_1^2+U_2^2\le 1.
    2. Define Z=\sqrt{-2\log(S)/S} and take X_1=Z\; U_1,\qquad X_2=Z\; U_2
    1. Hi,

      Thanks for your comment! These algorithms don’t rely on transcendental functions indeed, but once they’re parallelized on the GPU transcendental functions are fast. Moreover they would be less convenient to parallelize since they both use an “until” statement, which implies that all the parallel tasks (“threads”) might take a different time to be finished with the transformation of their respective uniform / exponential variable. In the end, instead of (roughly) dividing your time by the number of cores, you would replace the total time by the maximum of each individual transformation time.

      The same would hold for the Ziggurat method.

  2. I see the trouble with “until” but would not many runs on each thread tend to equalise the cost across tasks and give you the average number of attemps?

    1. I guess it would, yes! Although, from what I understood GPUs are efficient on computations, including complicated/transcendental functions, but struggle with “if” statements (and more generally branch instructions), see here for instance:
      http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html

      GPUs are not faster than CPUs for everything, and branch instructions are one of those things where CPU is superior.

      In the end it’s hard to predict which algorithm is going to be worth coding on the GPU and which one isn’t, in my opinion.

      And everything could change if CPUs and GPUs merge into a common device, such as this:
      http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx

  3. any chance these codes are still available? I have just set up my cuda and want to adapt my Monte Carlo codes. Thank you so much. The links to see/download the code are not working. feel free to email Caroline.gorham@gmail

Leave a comment