Source Information¶


Created by: Abe Stern

Updated by: October 25, 2024 by Gloria Seo

Resources: http://numba.pydata.org/


Goal¶

In summary, the notebook demonstrates the power of GPU acceleration for scientific computations and provides a hands-on example of CUDA programming with Numba.

CUDA-Python with Numba¶

2D Threadblocks¶

CUDA allows us to use 1, 2, or 3 dimensional grids and threadblocks. In this example, we'll utilize a two-dimensional threadblock to simplify our kernel definition.

The kernel that we'll write computes the distance between all pairs of atoms within single molecules. That is, we'll compute intramolecular distances for each molecule in a list of molecules.

Required Modules for the Jupyter Notebook¶

To execute this notebook, the following Python modules and packages are required.

Module:numba, math, numpy, cuda

In [1]:
import numba
from numba import cuda
import numpy as np
import math

Input Data¶

For a list of N molecules with at most ATOM_MAX atoms, each of which has 3 spatial coordinates specified, our input data will be shaped (N, ATOM_MAX, 3). We'll generate some random coordinates and store these in the array crds.

The result array will be shaped (N, ATOM_MAX, ATOM_MAX).

In [2]:
# CONSTANTS
ATOM_MAX = 29
N = int(1e4)
In [3]:
crds = np.random.random((N, ATOM_MAX, 3))
crds_natoms = np.array([ATOM_MAX]*N)

# create result array on the GPU
shape = (N, ATOM_MAX, ATOM_MAX)
distance_mat_gpu = cuda.device_array(shape, dtype=np.float32)

Kernel Design¶

We'll leverage the block-thread hierarchy in CUDA to make our kernel design simple. Specifically, we'll use a 2-D threadblock such that each thread will compute a distance between a unique pair of atoms in our array. In other words, threadIdx.x and threadIdx.y will index into atoms i and j of each molecule. It follows that blockIdx.x should be 1-D and index over molecules.

In [4]:
@cuda.jit
def compute_distance_mat(crds, natoms, dmat):
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    
    dmat[bx, tx, ty] = 0.0
    
    # bounds check
    N = natoms[bx]
    if tx >= N: return
    if ty >= N: return 
    
    for i in range(3):
        dmat[bx, tx, ty] += ( crds[bx, tx, i] - crds[bx, ty, i] )**2
    dmat[bx, tx, ty] = math.sqrt(dmat[bx, tx, ty])  

Call CUDA Kernel¶

First, we'll want to move our data to the GPU. We only have to do this for the crds array. The results array doesn't exist, so we'll create space for this on the GPU.

Next, we want to launch the kernel. To do so, we'll have to include a launch configuration that specifies how many blocks and threads to use. As we discussed above, we want this kernel to use a 2-D threadblock so that threadIdx.x and threadIdx.y index into atoms i and j of each molecule. It follows that blockIdx.x should be 1-D and index over molecules.

In [5]:
%%timeit -n1 -r1 -o

#transfer structures data (the coordinates) to GPU
crds_gpu = cuda.to_device(crds)
crds_natoms_gpu = cuda.to_device(crds_natoms)

# launch kernel
compute_distance_mat[ N , (29, 29)](crds_gpu, crds_natoms_gpu, distance_mat_gpu)

# copy the data back
distance_mat_cpu = distance_mat_gpu.copy_to_host()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-16fc9f4d26e3> in <module>
----> 1 get_ipython().run_cell_magic('timeit', '-n1 -r1 -o', '\n#transfer structures data (the coordinates) to GPU\ncrds_gpu = cuda.to_device(crds)\ncrds_natoms_gpu = cuda.to_device(crds_natoms)\n\n# launch kernel\ncompute_distance_mat[ N , (29, 29)](crds_gpu, crds_natoms_gpu, distance_mat_gpu)\n\n# copy the data back\ndistance_mat_cpu = distance_mat_gpu.copy_to_host()\n')

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2380             with self.builtin_trap:
   2381                 args = (magic_arg_s, cell)
-> 2382                 result = fn(*args, **kwargs)
   2383             return result
   2384 

<decorator-gen-53> in timeit(self, line, cell, local_ns)

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell, local_ns)
   1171                     break
   1172 
-> 1173         all_runs = timer.repeat(repeat, number)
   1174         best = min(all_runs) / number
   1175         worst = max(all_runs) / number

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/timeit.py in repeat(self, repeat, number)
    203         r = []
    204         for i in range(repeat):
--> 205             t = self.timeit(number)
    206             r.append(t)
    207         return r

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/IPython/core/magics/execution.py in timeit(self, number)
    167         gc.disable()
    168         try:
--> 169             timing = self.inner(it, self.timer)
    170         finally:
    171             if gcold:

<magic-timeit> in inner(_it, _timer)

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in __call__(self, *args)
    767 
    768     def __call__(self, *args):
--> 769         return self.dispatcher.call(args, self.griddim, self.blockdim,
    770                                     self.stream, self.sharedmem)
    771 

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in call(self, args, griddim, blockdim, stream, sharedmem)
    859         argtypes = tuple(
    860             [self.typingctx.resolve_argument_type(a) for a in args])
--> 861         kernel = self.compile(argtypes)
    862         kernel.launch(args, griddim, blockdim, stream, sharedmem)
    863 

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in compile(self, sig)
    933             self.definitions[(cc, argtypes)] = kernel
    934             if self._bind:
--> 935                 kernel.bind()
    936             self.sigs.append(sig)
    937         return kernel

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in bind(self)
    574         Force binding to current CUDA context
    575         """
--> 576         self._func.get()
    577 
    578     @property

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in get(self)
    444         cufunc = self.cache.get(device.id)
    445         if cufunc is None:
--> 446             ptx = self.ptx.get()
    447 
    448             # Link

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/compiler.py in get(self)
    412         ptx = self.cache.get(cc)
    413         if ptx is None:
--> 414             arch = nvvm.get_arch_option(*cc)
    415             ptx = nvvm.llvm_to_ptx(self.llvmir, arch=arch,
    416                                    **self._extra_options)

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/anaconda3-2020.11-da3i7hmt6bdqbmuzq6pyt7kbm47wyrjp/lib/python3.8/site-packages/numba/cuda/cudadrv/nvvm.py in get_arch_option(major, minor)
    343     else:
    344         arch = find_closest_arch((major, minor))
--> 345     return 'compute_%d%d' % arch
    346 
    347 

TypeError: not enough arguments for format string
In [ ]:
# store the timing result
GPU_TIMING = _

Naive Numpy Version¶

There are better ways to do this calculation, but they require reshapes and broadcasts. For the sake of simplicity and ensuring that we're computing the correct result, we use loops.

In [ ]:
dmat_np_cpu = np.zeros((N, ATOM_MAX, ATOM_MAX))
In [ ]:
%%timeit -n1 -r1 -o

for m in range(N):
    for i in range(ATOM_MAX):
        for j in range(ATOM_MAX):
            dmat_np_cpu[m,i,j] = np.sqrt(np.sum((crds[m, i] - crds[m,j])**2))
In [ ]:
# store the timing result
CPU_TIMING = _

Speedup Factor¶

In [ ]:
print('Speedup factor: ', CPU_TIMING.average / GPU_TIMING.average, 'X')

Checking Results¶

In [ ]:
# copy the data back (again)
distance_mat_cpu = distance_mat_gpu.copy_to_host()

# check results
tol = 1e-4
error = 0
for m in range(N):
    for i in range(ATOM_MAX):
        for j in range(ATOM_MAX):
            r = dmat_np_cpu[m,i,j] - distance_mat_cpu[m,i,j]
            if r>tol:
                error=1
if not error:
    print('results agree')

Submit Ticket¶

If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!

In [ ]: