Source Information¶


Created by: Abe Stern

Updated by: October 25, 2024 by Gloria Seo

Resources: http://numba.pydata.org/


Goal¶

The goal of the Jupyter Notebook is to demonstrate how to compute the value of Pi using the Monte Carlo method with GPU acceleration, specifically leveraging CUDA programming through the Numba library in Python.

CUDA-Python with Numba¶

Computing the value of Pi with GPU-accelerated Monte Carlo¶

Numba exposes the CUDA programming API which makes it easy to program GPUs directly from python. Here we will compute the value of Pi via Monte Carlo. Recall that the area of a circle is $\pi r^2$. If we circumscribe a square about the circle of radius $r$, the circumscribed square will have a sidelength of $2r$ and an area $4r^2$. So the ratio $R$ of the area of the circle to the area of the square is $\pi/4$. Note that this is independent of the radius.

To numerically compute this ratio $R$, we will construct our own circle inscribed within a square and, since we are doing this numerically, we choose a convenient radius of 1, and choose for the circle to be located at the origin. Sampling points in this 2D space means that we draw two uniformly distributed random numbers in the range [0, 1). This corresponds to sampling in one quadrant of our figure, but the ratio is unchanged.

To determine if a point falls within the circle, we compute the distance - or vector norm - from the origin. For a 2D point with coordinates $x$ and $y$, this is $(x^2 + y^2)^{1/2}$. If the vector norm is less than 1, then the point falls within the circle. If it's greater than 1, then it falls outside the circle. The ratio $R$ is simply the fraction of points that fall within the circle over the total number of points that we have sampled. Pi is then $4*R$.

Required Modules for the Jupyter Notebook¶

To run this notebook, make sure the following Python libraries are installed in your environment:

Module:numba, math, numpy, cuda

In [1]:
import numba
import math
import numpy as np
from numba import cuda

Define CUDA Kernels¶

compute_norm takes an array of 2D points and returns an array which indicates if the points fall within the circle of unit 1.

CUDA-provided variables cuda.threadIdx.x, cuda.blockIdx.x, and cuda.blockDim.x are used to index into the input array, ar. Note the work performed by each thread. In this kernel, each thread operates on a single point, computes its norm, and stores a single result.

sum_reduce is a GPU-accelerated reduction operation.

We define two CUDA kernels to perform the computation:

  1. compute_norm: This kernel checks if the points fall within the unit circle by calculating the distance of each point from the origin.

  2. sum_reduce: A reduction kernel to sum the results of compute_norm and calculate how many points are inside the circle.

In [2]:
@cuda.jit
def compute_norm(ar, res):
    '''
    determine if points fall within a circle of radius 1.0
    ---
    From a 2D array of shape (N,2), compute the vector norm
    of each point.  If norm is less than 1.0, store True,
    else, store False.
    
    Result array must be shaped (N)
    '''
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    bx = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    px = tx + bx * bw
    
    if px < ar.size:  # Check array boundaries
        
        # compute vector norm (distance from origin)
        norm = math.sqrt(ar[px, 0]**2 + ar[px, 1]**2)
        res[px] = norm<1.0
        
@cuda.reduce
def sum_reduce(a, b):
    return a + b

Generate Input Data¶

N holds the total number of points that we will sample. The input data will be shaped (N, 2).

We generate a large array of random points, where N represents the total number of points:

In [ ]:
# set the size of our input array
N = int(5e8)

# generate our input array of random numbers
P = np.random.random(size=(N, 2))

Numpy Implementation(CPU)¶

Before we use the GPU, we first implement the Monte Carlo method using only NumPy to confirm our algorithm works. We'll time this to compare it with the GPU-accelerated version later:

In [ ]:
%%timeit -n1 -r1 -o
In = np.linalg.norm(P, axis=1)<=1.0
R = np.sum(In)/N
print(R*4)
In [ ]:
# store the timing result
CPU_RESULT = _

Numba Implementation(GPU)¶

Next, we use CUDA to accelerate the Monte Carlo computation on the GPU.

First thing, we need to send the data to the GPU and allocate space for the result. Note that we allocate space on the GPU for the result. We'll transfer it back at the end.

Next, we'll define enough threads to perform all the work necessary. There are other strategies for doing this, but for simplicity, we'll just choose to make sure that nblocks*nthreads> N.

We launch the kernel with our launch configuration to compute the boolean array stored in result_gpu. Now we call the reduction kernel on the data already on the gpu, result_gpu. The return value is the sum of all the True values and so $R$ is simply the number of True values over N.

In [ ]:
%%timeit -n2 -r5 -o
array_gpu = cuda.to_device(P)
result_gpu = cuda.device_array(N, dtype=np.int32)

nthreads=1024
nblocks=math.floor(N/nthreads)+1

# Launch the kernel
compute_norm[nblocks, nthreads](array_gpu, result_gpu)

R = sum_reduce(result_gpu)/N
print(R*4)
In [ ]:
GPU_RESULT = _
In [ ]:
print('Speedup factor: ', CPU_RESULT.average / GPU_RESULT.average, 'X')

Submit Ticket¶

If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!

In [ ]: