Source Information¶


Created by: Bob Sinkovits

Updated by: October 24, 2024 by Gloria Seo

Resources: https://github.com/sinkovit/PythonSeries


Goal¶

This notebook provides an introduction of NumPy library, covering array creation, manipulation, basic linear algebra, and random sampling for high-performance computing.

NumPy¶

The NumPy library provides support for arrays, matricies and mathematical functions to operate on them. We'll start with the basics and then cover a few more advanced topics.

The N-dimensional array (ndarray) and array basics¶

The core of NumPy is the n-dimensional array (ndarray). Although it may seem similar to Python lists or the array.array class, there are some important differences. ndarrays ...

  • are homogenous, all elements must be of the same type (unlike list)
  • can be multidimensional (unlike array.array)
  • have much more functionality

We'll dive in with an example. The arange function is analogous to Python's range function, while reshape does exactly what you would expect and reshapes an array into a specified shape.

  • ndarray.ndim: number of dimensions or rank
  • ndarray.shape: tuple containing the dimensions of the array
  • ndarray.dtype: data type of elements
  • ndarray.size: total number of elements (product of dimensions)

Required Modules for the Jupyter Notebook¶

Module: numpy, add, matplotlib, collections, Counter

Before running the notebook, we need to load the following modules.

In [1]:
import numpy as np
from operator import add
import matplotlib.pyplot as plt
import collections
from collections import Counter
In [2]:
a = np.arange(20).reshape(5,4)
In [3]:
print(a)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
In [4]:
a.ndim
Out[4]:
2
In [5]:
a.shape
Out[5]:
(5, 4)
In [6]:
a.dtype
Out[6]:
dtype('int64')
In [7]:
a.size
Out[7]:
20
In [8]:
type(a)
Out[8]:
numpy.ndarray

Arrays creation¶

Arrays can be created from lists or tuples using the array method

In [9]:
np.array([1,2,3])
Out[9]:
array([1, 2, 3])
In [10]:
np.array([[1,2,3], [4,5,6], [7,8,9]])
Out[10]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

NumPy provides functions for creating arrays of arbitrary dimensions containing all zeros or ones. Note that the argument is a tuple of dimensions.

In [11]:
np.zeros( (3,4) )
Out[11]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
In [12]:
np.ones ( (5,3) )
Out[12]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

The zeros_like and ones_like functions create arrays filled with zeros or ones, respectively, of the same shape as the argument.

In [13]:
a = np.arange(20).reshape(5,4)
np.zeros_like(a)
Out[13]:
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
In [14]:
a = np.arange(12).reshape(6,2)
np.ones_like(a)
Out[14]:
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

Printing arrays¶

As we've seen, arrays can be printed using the built-in print function. To avoid cluttering you screen, only the top and bottom of very large arrays are displayed.

In [15]:
a = np.arange(10000).reshape(2000,5)
print(a)
[[   0    1    2    3    4]
 [   5    6    7    8    9]
 [  10   11   12   13   14]
 ...
 [9985 9986 9987 9988 9989]
 [9990 9991 9992 9993 9994]
 [9995 9996 9997 9998 9999]]

Array math¶

Basic math operations are performed elementwise on conforming arrays (arrays of the same shape). Scalar arguments are applied to each element

Universal functions (ufunc) are applied to each element. Note in the examples below that we did not need to use the map function since the NumPy ufuncs are designed to operate on ndarray arguments

In [16]:
a = np.arange(9).reshape(3,3)
b = np.arange(9,18).reshape(3,3)
print(a)
print()
print(b)
[[0 1 2]
 [3 4 5]
 [6 7 8]]

[[ 9 10 11]
 [12 13 14]
 [15 16 17]]
In [17]:
a+b + 100
Out[17]:
array([[109, 111, 113],
       [115, 117, 119],
       [121, 123, 125]])

Note that the "*" operator does an elementwise multiplication and is not the linear algebra matrix multiply

In [18]:
a*b
Out[18]:
array([[  0,  10,  22],
       [ 36,  52,  70],
       [ 90, 112, 136]])
In [19]:
np.sin(a) + np.log(b)
Out[19]:
array([[2.19722458, 3.14405608, 3.3071927 ],
       [2.62602666, 1.80814686, 1.68013305],
       [2.4286347 , 3.42957532, 3.82257159]])

A brief digression - how would we have done this with lists?¶

Arithmetic on Python lists doesn't work the way you might expect. Let's define a couple of conformable lists and add them element by element. Sounds easy enough

In [20]:
a = [ 1,  2,  3]
b = [10, 20, 30]
In [21]:
a+b
Out[21]:
[1, 2, 3, 10, 20, 30]

That's not what we wanted. The '+' operator concatenates the elements of the two lists to create a new list with length equal to the sum of the lengths of the original lists. There are multiple ways we can do this, two based on map are shown below.

In [22]:
#from operator import add
list(map(add, a, b))
Out[22]:
[11, 22, 33]
In [23]:
list(map(lambda x,y: x+y, a, b))
Out[23]:
[11, 22, 33]

The second choice is more general since we're not limited to simply adding the elements. Of course, if you're going to be doing a lot of this sort of thing, just use numpy arrays.

Sum, min, max, argmin, argmax, cumsum¶

The ndarray class has methods for finding the min and max values in an array, their locations (argmin and argmax), the sum and the cumulative sum. When applied to multidimensional arrays, the array is first flattened into a 1d array.

In [24]:
np.random.seed(123)
a = np.random.rand(6)
print(a)
[0.69646919 0.28613933 0.22685145 0.55131477 0.71946897 0.42310646]
In [25]:
print("Min :", a.min())
print("Max :", a.max())
print("Sum :", a.sum())
print("Argmin :", a.argmin())
print("Argmax :", a.argmax())

print("\nCumlative sum :\n", a.cumsum())
Min : 0.2268514535642031
Max : 0.7194689697855631
Sum : 2.9033501731053595
Argmin : 2
Argmax : 4

Cumlative sum :
 [0.69646919 0.98260852 1.20945997 1.76077474 2.48024371 2.90335017]

For multidimensional arrays, many of these functions can take an argument specifying the axis along which the operations should be applied. The result is an array of dimension one lower than the array.

In [26]:
np.random.seed(123)
a = np.random.rand(16).reshape(4,4)
print(a)
[[0.69646919 0.28613933 0.22685145 0.55131477]
 [0.71946897 0.42310646 0.9807642  0.68482974]
 [0.4809319  0.39211752 0.34317802 0.72904971]
 [0.43857224 0.0596779  0.39804426 0.73799541]]
In [27]:
# By column (axis = 0)
print("Min :", a.min(axis=0))
print("Max :", a.max(axis=0))
print("Sum :", a.sum(axis=0))
print("Argmin :", a.argmin(axis=0))
print("Argmax :", a.argmax(axis=0))
Min : [0.43857224 0.0596779  0.22685145 0.55131477]
Max : [0.71946897 0.42310646 0.9807642  0.73799541]
Sum : [2.3354423  1.16104121 1.94883792 2.70318962]
Argmin : [3 3 0 0]
Argmax : [1 1 1 3]
In [28]:
# By row (axis = 1)
print("Min :", a.min(axis=1))
print("Max :", a.max(axis=1))
print("Sum :", a.sum(axis=1))
print("Argmin :", a.argmin(axis=1))
print("Argmax :", a.argmax(axis=1))
Min : [0.22685145 0.42310646 0.34317802 0.0596779 ]
Max : [0.69646919 0.9807642  0.72904971 0.73799541]
Sum : [1.76077474 2.80816937 1.94527714 1.6342898 ]
Argmin : [2 1 2 1]
Argmax : [0 2 3 3]
In [29]:
# By entire array 
# Note that argmin/max returns index for flattened array (row major order)
print("Min :", a.min())
print("Max :", a.max())
print("Sum :", a.sum())
print("Argmin :", a.argmin())
print("Argmax :", a.argmax())
Min : 0.05967789660956835
Max : 0.9807641983846155
Sum : 8.148511055639919
Argmin : 13
Argmax : 6

Stacking and splitting arrays¶

NumPy provides methods for stacking arrays (hstack, vstack) and splitting arrays (hsplit, vsplit). The stacking methods take as input a tuple containing an arbitrary number of arrays.

In [30]:
a = np.arange(16).reshape(4,4)
b = np.arange(16,32).reshape(4,4)
print(a)
print()
print(b)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

[[16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]
In [31]:
np.vstack( (a,b) )
Out[31]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
In [32]:
c = np.hstack( (a,b,a,b) )
print(c)
[[ 0  1  2  3 16 17 18 19  0  1  2  3 16 17 18 19]
 [ 4  5  6  7 20 21 22 23  4  5  6  7 20 21 22 23]
 [ 8  9 10 11 24 25 26 27  8  9 10 11 24 25 26 27]
 [12 13 14 15 28 29 30 31 12 13 14 15 28 29 30 31]]
In [33]:
(d, e) = np.hsplit(c,(3,))
print(d)
print()
print(e)
[[ 0  1  2]
 [ 4  5  6]
 [ 8  9 10]
 [12 13 14]]

[[ 3 16 17 18 19  0  1  2  3 16 17 18 19]
 [ 7 20 21 22 23  4  5  6  7 20 21 22 23]
 [11 24 25 26 27  8  9 10 11 24 25 26 27]
 [15 28 29 30 31 12 13 14 15 28 29 30 31]]

Copies¶

Be aware that ndarrays behave the same way as other Python objects when doing a simple assignment

In [34]:
a = np.arange(4)
b = a
In [35]:
print(a)
b[0] = 7
print(a)
[0 1 2 3]
[7 1 2 3]

If you really want to copy an array, use the copy method

In [36]:
a = np.arange(4)
b = a.copy()
In [37]:
print(a)
b[0] = 7
print(a)
[0 1 2 3]
[0 1 2 3]

Histograms¶

Although most plotting packages have the capabilities to directly generate a histogram from a data set, sometimes you just want the result and not the plot. This can be done using the NumPy histogram method. In its simplest form, the method just takes an array and the number of bins

In [38]:
mu, sigma = 2, 0.5
v = np.random.normal(mu,sigma,10000)
(hist, bin_edges) = np.histogram(v, bins=50)

#import matplotlib.pyplot as plt
plt.plot(.5*(bin_edges[1:]+bin_edges[:-1]), hist)
plt.show()
No description has been provided for this image

Linear algebra¶

NumPy provides a comprehensive set of linear algebra capabilities. We demonstrate a few of these below. A convenient summary of of these capabilities can be found here https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html

In this first example, we calculate the inverse of a matrix and then confirm that the product of the matrix and its inverse is equal to the identity matrix

In [39]:
A = np.random.rand(4,4)      # Populate matrix with random values
Ainv = np.linalg.inv(A)      # Calculate the inverse
I = np.eye(4)                # Identity matrix used for testing 
AxAinv = np.matmul(A, Ainv)  # Matrix multiplication
np.all(AxAinv == I)          # Test if all elements of A x Ainv are true
Out[39]:
False

What happened? Wouldn't we expect the matrix multipied by its inverse to be equal to the identity matrix? Let's take a look at the content of the arrays.

In [40]:
print("Matrix A")
print(A, "\n")
print("Inverse of matrix A")
print(Ainv, "\n")
print("Product A x Ainv")
print(AxAinv, "\n")
print("Identity matrix")
print(I)
Matrix A
[[0.90553761 0.3491035  0.01878307 0.21474258]
 [0.42990531 0.43566402 0.54816385 0.8233592 ]
 [0.33177785 0.78546986 0.46148833 0.20164153]
 [0.36477242 0.63998744 0.69290713 0.47342497]] 

Inverse of matrix A
[[ 1.42991476 -1.02893323 -1.5228803   1.78949998]
 [-0.52207461  1.11761668  3.48566067 -3.19151373]
 [ 0.09321778 -2.35441611 -3.12018731  5.38136317]
 [-0.53242551  2.72790331  1.02809474 -2.82836043]] 

Product A x Ainv
[[ 1.00000000e+00  7.96662729e-17  5.71395677e-17 -6.97831754e-17]
 [-9.01683109e-17  1.00000000e+00 -4.17752095e-16 -5.53943842e-17]
 [-1.41696994e-16  1.22311500e-16  1.00000000e+00 -4.76805001e-17]
 [-9.21524294e-17 -1.64944443e-17 -1.66299248e-16  1.00000000e+00]] 

Identity matrix
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

We forgot to take into account the fact that floating point math involves rounding errors. Fortunately, these floating point comparisons are so common that numpy provides a function for testing whether results agree to within a tolerance.

In [41]:
np.allclose(AxAinv, I)
Out[41]:
True

As a second example, we'll solve the eigenvalue problem Hx = λx and confirm that the sum over the eigenvalues is equal to the trace of the matrix. Don't worry if you're unfamiliar with eigenvalue problems. These come up in many scientific and engineering problems and are used in the machine learning technique principal component analysis (PCA).

In [42]:
eigvals = np.linalg.eigvals(A)
eigvals_sum = eigvals.sum()
trace = np.trace(A)
np.allclose(eigvals_sum, trace)
Out[42]:
True

Random sampling¶

Numpy contains an extensive set of functions for generating samplings from random distributions. For the full set of functions, see https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

The random.rand() function returns an array of floating point values between 0 ≤ x < 1, with the argument specifying the dimensions of the array. We saw this earlier, in the discussion of linear algeba funcions, but without explanation.

In [43]:
np.random.rand(10)
Out[43]:
array([0.10907356, 0.28531668, 0.93306458, 0.66983283, 0.53323553,
       0.61277927, 0.86410959, 0.98047986, 0.26844615, 0.8989367 ])
In [44]:
np.random.rand(4,4)
Out[44]:
array([[0.2697794 , 0.07437439, 0.64219329, 0.7948267 ],
       [0.7027967 , 0.34430167, 0.04163668, 0.22708323],
       [0.36225732, 0.99812157, 0.56178248, 0.58425461],
       [0.21611783, 0.81738305, 0.005409  , 0.65411136]])
In [45]:
np.random.rand(3,3,3)
Out[45]:
array([[[0.38373161, 0.64473773, 0.26193407],
        [0.87365637, 0.53666328, 0.29251293],
        [0.51022266, 0.79682803, 0.88425962]],

       [[0.84423737, 0.85855229, 0.85831302],
        [0.18222367, 0.64214835, 0.25397529],
        [0.46093213, 0.93434437, 0.7036276 ]],

       [[0.40167112, 0.15803781, 0.69315088],
        [0.61848046, 0.64817606, 0.3902677 ],
        [0.0442045 , 0.81781409, 0.77493736]]])

random.randimt(low, [,high ,size]) generates random integers between low (inclusive) and high (exclusive), with size setting the number of random integers returned

In [46]:
np.random.randint(1,7,100)
Out[46]:
array([2, 2, 3, 6, 2, 5, 3, 3, 4, 6, 1, 6, 5, 6, 4, 2, 6, 2, 6, 3, 6, 1,
       3, 5, 6, 5, 6, 5, 5, 2, 4, 2, 1, 5, 1, 1, 6, 5, 3, 6, 4, 2, 3, 1,
       5, 5, 6, 4, 3, 6, 6, 4, 6, 1, 6, 4, 6, 2, 6, 6, 4, 4, 5, 5, 6, 4,
       6, 2, 1, 4, 6, 5, 6, 1, 4, 4, 1, 3, 6, 3, 5, 1, 5, 4, 4, 1, 6, 3,
       3, 3, 4, 6, 5, 2, 4, 2, 1, 5, 3, 6])

random.randn() samples a normal distribution with mean 0 and standard deviation 1

In [47]:
np.random.randn(5,5)
Out[47]:
array([[ 0.63116972, -1.73836663, -0.44176693,  0.59618155, -1.70313142],
       [ 0.33847276,  1.05857895,  0.68398476, -0.91119956, -0.69131798],
       [ 0.73665637,  0.33307578,  1.24909313, -1.37381538, -1.3639383 ],
       [-0.65155781, -0.99280722, -0.37658655, -0.00511713, -1.49188858],
       [-0.35612259,  0.19870999, -0.72351776, -0.21422377,  0.40551945]])

To convert to a normal distribution with a different mean ($\mu$) and standard deviation ($\sigma$), multiply results by sigma and add mu

In [48]:
sigma, mu = 0.5, 5
np.random.randn(25)*sigma + mu
Out[48]:
array([5.14938549, 5.38897374, 5.51315471, 5.16802521, 4.81803226,
       5.07241934, 5.2719953 , 5.53741929, 6.02608204, 5.13784721,
       5.01754471, 4.48931958, 4.02599793, 5.96125335, 5.91920031,
       5.74117287, 5.19562785, 5.54516599, 5.40298004, 4.94522871,
       4.61516916, 4.88477814, 5.0694626 , 4.10695006, 5.38603891])

random.binomial(n,p [,size]) samples a binomial distribution with n trials, probability p. A third optional argument (size) specifies the number of samples to be generated

In [49]:
# This will return the number of heads from six (n) flips of a fair coin (p)
n, p = 6, 0.5
np.random.binomial(n, p, 100)
Out[49]:
array([6, 3, 2, 3, 3, 5, 1, 1, 0, 2, 3, 4, 3, 3, 3, 5, 4, 2, 1, 1, 2, 3,
       4, 1, 4, 3, 2, 5, 3, 1, 4, 4, 4, 3, 2, 2, 2, 3, 3, 2, 5, 2, 1, 4,
       2, 2, 2, 3, 4, 3, 1, 3, 4, 1, 2, 5, 5, 4, 4, 2, 1, 2, 3, 3, 4, 5,
       1, 2, 5, 1, 2, 4, 4, 3, 2, 2, 5, 1, 6, 4, 1, 4, 2, 2, 4, 3, 4, 2,
       3, 5, 4, 4, 3, 4, 3, 3, 1, 3, 5, 2])

Many machine learning applications require that we either permute the data or draw a subsample of the data. This can be done with random.permutation() and random.choice() functions.

In [50]:
# Generate permutations of a list
x = np.array(['a', 'b', 'c', 'd', 'e'])
for i in range(10):
    print(np.random.permutation(x))
['c' 'a' 'b' 'd' 'e']
['a' 'd' 'e' 'b' 'c']
['b' 'e' 'd' 'a' 'c']
['a' 'c' 'e' 'd' 'b']
['a' 'c' 'e' 'd' 'b']
['a' 'd' 'e' 'b' 'c']
['e' 'd' 'b' 'c' 'a']
['d' 'e' 'c' 'b' 'a']
['d' 'a' 'e' 'b' 'c']
['a' 'b' 'e' 'c' 'd']
In [51]:
# Random select 4 elemnts from a list with replacement allowed
y = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
for i in range(10):
    print(np.random.choice(y, 4))
['b' 'e' 'd' 'c']
['b' 'h' 'g' 'g']
['h' 'g' 'a' 'h']
['a' 'd' 'd' 'd']
['b' 'f' 'h' 'c']
['e' 'g' 'd' 'h']
['b' 'c' 'f' 'e']
['d' 'e' 'h' 'h']
['e' 'f' 'b' 'b']
['c' 'g' 'c' 'h']
In [52]:
# Random select 4 elemnts from a list without replacement
y = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
for i in range(10):
    print(np.random.choice(y, 4, replace=False))
['f' 'a' 'c' 'g']
['a' 'h' 'b' 'f']
['c' 'h' 'd' 'a']
['d' 'h' 'c' 'b']
['g' 'b' 'd' 'e']
['a' 'e' 'd' 'c']
['e' 'g' 'h' 'f']
['f' 'e' 'c' 'b']
['f' 'c' 'h' 'b']
['h' 'g' 'd' 'b']

A brief digression: Python collections - High-performance container datatypes¶

We've already explored Python's major built in data types: lists, dictionaries, sets and tuples. The collections package adds some additional containers such as Counter, OrderedDict, and dequeue. We'll take a look at a few of these here.

The Counter container is a dict subclass for counting hashable objects

In [53]:
mylist = ['A', 'B', 'C', 'D', 'A', 'A', 'A', 'B', 'C']
collections.Counter(mylist)
Out[53]:
Counter({'A': 4, 'B': 2, 'C': 2, 'D': 1})

The deque (pronounced "deck") is a double-ended queue that allows pushing and popping data onto either end.

In [54]:
mydeq = collections.deque(['C','D','E'])
mydeq.append('F')
mydeq.appendleft('B')
mydeq
Out[54]:
deque(['B', 'C', 'D', 'E', 'F'])

The OrderedDict container has all the properties of a Python dict except that the order in which items (key-value pairs) are added is preserved. Recall that order is not guaranteed for the standard built-in dictionary type.

In [55]:
myord = collections.OrderedDict({'A':1, 'B':2, 'C':3})
myord
Out[55]:
OrderedDict([('A', 1), ('B', 2), ('C', 3)])

Back to sampling - choosing elements from an uneven probability distribution¶

An especially useful feature of np.random.choice is the ability to choose elements from an uneven probability distribution. This might arise, for example, in bioinformatics applications where we want to select amino acids with the probabilities that they appear in an organism.

In [56]:
# from collections import Counter

x = ['A', 'B', 'C']
probs = [0.1, 0.2, 0.7]
sampling = np.random.choice(x, size=10000, p=probs)
Counter(sampling)
Out[56]:
Counter({'C': 7082, 'A': 938, 'B': 1980})

If you really need your distribution of elements to exactly match the probability distribution and have a random ordering, just create a list and then apply the permutation operation. Note that we took advantage of the way that list addition really works to create a list containing the correct numbers of A, B and C.

In [57]:
x = ['A']*10 + ['B']*5 + ['C']*10
np.random.permutation(x)
Out[57]:
array(['B', 'A', 'A', 'B', 'A', 'C', 'C', 'A', 'B', 'A', 'A', 'C', 'C',
       'A', 'C', 'C', 'C', 'A', 'B', 'A', 'C', 'C', 'A', 'C', 'B'],
      dtype='<U1')
In [58]:
Counter(np.random.permutation(x))
Out[58]:
Counter({'A': 10, 'B': 5, 'C': 10})

Submit Ticket¶

If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!

In [ ]: