Statisfaction

Playing with pyCUDA: fast(er) generation of normal variables

Posted in Geek by Pierre Jacob on 19 October 2010

Hi again!

Some Monte Carlo methods (especially MCMC and SMC) involve drawing a large amount of random variables. While trying to make some of my Python code faster, I realized that generating normal variables takes a significant time. Especially if you want 10^10 of them or more. There are mainly two reasons why it takes time: first NumPy uses the Mersenne Twister to generate uniform drawings (which are used to generate the other random variables), which is a good generator but fairly expensive compared to others (see this discussion). Second, NumPy uses Box-Muller transform to generate normal drawings using the uniforms. There are faster algorithms (like Ziggurat) but there are not implemented in the NumPy library. Box-Muller transform is easy to code but involves computing transcendental functions such as logarithm and cosinus, which are expensive for the CPU.

As shown on the graph above, numpy.random.normal (brown line) is twice slower than numpy.random.uniform (green line). On the vertical axis is the time spent to compute 100,000 variables in seconds. On the horizontal axis are the vector sizes I’ve tried: 10^5, 10^6 and 10^7. The time was measured using the cProfile module.

Since it’s my temporary hobby (see here for instance), I’ve coded the Box-Muller transform with pyCUDA to take advantage of the GPU. The result is the blue line in the graph above. Clearly if you want to draw 100,000 normal variables or less, don’t bother using the GPU. However, from 1,000,000 onwards the GPU starts to be an interesting option. The decreasing line stems from the huge overhead when using the GPU, since the vectors have to be copied from the standard RAM to the GPU memory. For big vectors, as the graph shows it’s about twice faster than numpy.random.normal. It’s somewhat of a small improvement, but it’s “free” in the sense that it doesn’t mess with the random number generator itself, and it’s easy to implement.

Indeed this code uses numpy.random.uniform and performs the Box-Muller transformation in parallel; the uniform generation itself is not parallelized. This way, you can use the standard numpy random generator in the whole program, whereas if the uniform generator was coded in pyCUDA, you’d have to stick with it and not use the numpy generator anymore; otherwise you’d be using two random generators simultaneously, and that’s BAD.

I’ve put the code on-line if you’re interested, but it’s just for testing purpose, you’d better check it closely before actually using it (because I didn’t!).

Download the code | See the code online.

About these ads

8 Responses

Subscribe to comments with RSS.

  1. xi'an said, on 24 October 2010 at 06:21

    Cool stuff! Do you use Box-Muller without transcendental functions like the following version?
    1. Generate
    Y_1,Y_2\sim\mathcal{E}(1)
    until Y_2 >(1-Y_1)^2/2.
    2. Generate
    U\sim\mathcal{U}([0,1])
    and take
    X=\begin{cases}Y_1 \text{ if }\; U \geq 0.5 \\ -Y_1 \text{ otherwise} \end{cases}

    And there is this other algorithm

    1. Generate U_1,U_2\sim U([-1,1]) until S=U_1^2+U_2^2\le 1.
    2. Define Z=\sqrt{-2\log(S)/S} and take X_1=Z\; U_1,\qquad X_2=Z\; U_2
    • pierrejacob said, on 25 October 2010 at 03:24

      Hi,

      Thanks for your comment! These algorithms don’t rely on transcendental functions indeed, but once they’re parallelized on the GPU transcendental functions are fast. Moreover they would be less convenient to parallelize since they both use an “until” statement, which implies that all the parallel tasks (“threads”) might take a different time to be finished with the transformation of their respective uniform / exponential variable. In the end, instead of (roughly) dividing your time by the number of cores, you would replace the total time by the maximum of each individual transformation time.

      The same would hold for the Ziggurat method.

  2. xi'an said, on 25 October 2010 at 08:58

    I see the trouble with “until” but would not many runs on each thread tend to equalise the cost across tasks and give you the average number of attemps?

    • pierrejacob said, on 26 October 2010 at 18:51

      I guess it would, yes! Although, from what I understood GPUs are efficient on computations, including complicated/transcendental functions, but struggle with “if” statements (and more generally branch instructions), see here for instance:

      http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html

      GPUs are not faster than CPUs for everything, and branch instructions are one of those things where CPU is superior.

      In the end it’s hard to predict which algorithm is going to be worth coding on the GPU and which one isn’t, in my opinion.

      And everything could change if CPUs and GPUs merge into a common device, such as this:

      http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx

  3. quantum probability said, on 13 November 2010 at 08:35

    This is fascinating. But why is the GPU involved in particular?

  4. CPU and GPU trends over time « Why? said, on 25 January 2011 at 18:05

    […] seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could be […]

  5. […] processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 53 other followers

%d bloggers like this: