pm21-dragon/lectures/lecture-07/1 - numpy.ipynb
2024-12-02 10:47:50 +01:00

64 KiB

None <html> <head> </head>

"Alle Teilnehmer:innen biologischer Profilmodule werden zur Studienleistung angemeldet."

Numpy introduction

numpy basics and creating arrays

Numpy is a widely used library for handling arrays of data, especially numerical data. It would not be an exageration to say it is fundamental to the Python data science ecosystem.

The most important part of numpy is the numpy array type. A numpy array is conceptually similar to a Python list or tuple but each element has the same data type and the array has a fixed size.

Typically numpy is imported as np, a conventional shorthand that saves a bit of typing.

In [4]:
import numpy as np

We can create an array from any sequence type, such as lists and tuples:

In [5]:
x = np.array([1,2,3,4])
x
Out[5]:
array([1, 2, 3, 4])

We can crate an array of n elements (from 0 to n-1) with the arange function.

In [6]:
x = np.arange(10)
x
Out[6]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [7]:
x = np.arange(4,10)
x
Out[7]:
array([4, 5, 6, 7, 8, 9])
In [8]:
x = np.arange(0,10,1)
x
Out[8]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [9]:
x = np.arange(4,10,2)
x
Out[9]:
array([4, 6, 8])

We can create an array of n equally spaced elements from start to stop with np.linspace. For example, here start is 100, stop is 120 and n is 11.

In [12]:
np.linspace(100, 120, 11)
Out[12]:
array([100., 102., 104., 106., 108., 110., 112., 114., 116., 118., 120.])

We can also create arrays of zero or one with a given shape:

In [13]:
np.zeros((3,))
Out[13]:
array([0., 0., 0.])
In [14]:
np.zeros((3,5)) # shape parameter - (number of rows, number of columns) in this case
Out[14]:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])
In [15]:
np.ones((5,3))
Out[15]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

array shape

In addition to 1 dimensional numpy arrays which are very similar to lists or tuples, numpy arrays may also be 2 or more dimensions. The shape attribute of a numpy array may be used to get or set its number of dimensions and size.

In [16]:
x = np.arange(12)
x
Out[16]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
In [17]:
x.shape
Out[17]:
(12,)
In [18]:
x = np.arange(12)
x.shape = (3,4)
x
Out[18]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [19]:
x.shape
Out[19]:
(3, 4)
In [20]:
x.shape = (3,2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 x.shape = (3,2)

ValueError: cannot reshape array of size 12 into shape (3,2)

The ndim attribute is the dimensionality of the array (and, thus, equal to length of the array's shape attribute):

In [21]:
x.ndim
Out[21]:
2
In [23]:
x = np.arange(12)
x.shape = (3,2,2)
x
Out[23]:
array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])
In [24]:
x.ndim
Out[24]:
3

array operations

Numpy arrays support mathematical operations with other numpy arrays and with single numbers ("scalars").

With scalars, the scalar is first converted to an array with the same shape as the numpy array and then an element-wise operation is performed.

With other arrays of the same size, an element-wise operation is performed.

In [25]:
x = np.arange(10)
x
Out[25]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [26]:
y = 2 * np.arange(10)
y
Out[26]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
In [27]:
z = x + y
z
Out[27]:
array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])
In [28]:
x + 3
Out[28]:
array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
In [29]:
x + 3.0
Out[29]:
array([ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.])
In [30]:
x
Out[30]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [31]:
x/5
Out[31]:
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8])
In [32]:
1/5
Out[32]:
0.2
In [33]:
x
Out[33]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [34]:
4*x
Out[34]:
array([ 0,  4,  8, 12, 16, 20, 24, 28, 32, 36])
In [35]:
np.array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])/3
Out[35]:
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

array dtype

Just like lists or tuples, every element in a numpy array has a data type. As mentioned above, however, every element in a numpy array has the same data type, and thus we can refer to the "datatype of the array". This can be set when the array is created with the dtype keyword argument and read from the dtype attribute:

In [36]:
x = np.arange(10)
x
Out[36]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [37]:
x.dtype
Out[37]:
dtype('int64')
In [38]:
x = np.arange(10, dtype=np.float64)
x
Out[38]:
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [39]:
x.dtype
Out[39]:
dtype('float64')

array indexing and slicing

Numpy arrays can be indexed and sliced just like other Python sequence types such as lists, tuples, and strings.

Just like with python lists, the indices or slices can be read and written. In other words, numpy arrays are mutable.

In [40]:
x = np.arange(10)
x
Out[40]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [41]:
x[4]
Out[41]:
4
In [42]:
x[:4]
Out[42]:
array([0, 1, 2, 3])
In [56]:
x[4:]
Out[56]:
array([4, 5, 6, 7, 8, 9])
In [54]:
tmp = slice(4)
x[tmp]
Out[54]:
array([0, 1, 2, 3])
In [57]:
tmp = slice(4, 7)
x[tmp]
Out[57]:
array([4, 5, 6])
In [58]:
x[4:7]
Out[58]:
array([4, 5, 6])
In [59]:
tmp = slice(2, 7, 2)
x[tmp]
Out[59]:
array([2, 4, 6])
In [60]:
tmp = slice(2, None, 2)
x[tmp]
Out[60]:
array([2, 4, 6, 8])
In [61]:
x[2::2]
Out[61]:
array([2, 4, 6, 8])
In [62]:
x[4::3]
Out[62]:
array([4, 7])
In [63]:
x[4:7]
Out[63]:
array([4, 5, 6])
In [64]:
x[2::2.2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[64], line 1
----> 1 x[2::2.2]

TypeError: slice indices must be integers or None or have an __index__ method
In [65]:
np.arange(0, 10, 2.2)
Out[65]:
array([0. , 2.2, 4.4, 6.6, 8.8])
In [66]:
x[::2]
Out[66]:
array([0, 2, 4, 6, 8])

Because numpy arrays can have 2 or more, dimensions, we can also index and slice them in higher dimensions. For two dimensional arrays, the first index is always the row index and the second index is always the column index.

In [67]:
x = np.arange(12)
x.shape = (3,4)
x
Out[67]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [68]:
x[1:, :]
Out[68]:
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [69]:
x[:, 1:]
Out[69]:
array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11]])
In [70]:
x[1:, 2:]
Out[70]:
array([[ 6,  7],
       [10, 11]])
In [71]:
x[1:2:]
Out[71]:
array([[4, 5, 6, 7]])
In [72]:
x
Out[72]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [73]:
x[1:, 2:] = 99
x
Out[73]:
array([[ 0,  1,  2,  3],
       [ 4,  5, 99, 99],
       [ 8,  9, 99, 99]])
In [74]:
x.dtype
Out[74]:
dtype('int64')
In [75]:
x[1:, 2:] = 99.5
x
Out[75]:
array([[ 0,  1,  2,  3],
       [ 4,  5, 99, 99],
       [ 8,  9, 99, 99]])
In [76]:
x[1:, 2:] = 99.9
x
Out[76]:
array([[ 0,  1,  2,  3],
       [ 4,  5, 99, 99],
       [ 8,  9, 99, 99]])
In [77]:
x[:, :] = 99.9
x
Out[77]:
array([[99, 99, 99, 99],
       [99, 99, 99, 99],
       [99, 99, 99, 99]])

References to arrays

Remember that variable assignment in Python does not create a new object but only creates a variable which points to an existing object. This is very important with numpy.

In [78]:
x = np.arange(20)
x
Out[78]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
In [79]:
# Here we create a variable which references the first 10 elements of `x`.
y = x[:10]
y
Out[79]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [80]:
x[4] = 9999
In [81]:
x
Out[81]:
array([   0,    1,    2,    3, 9999,    5,    6,    7,    8,    9,   10,
         11,   12,   13,   14,   15,   16,   17,   18,   19])
In [82]:
y
Out[82]:
array([   0,    1,    2,    3, 9999,    5,    6,    7,    8,    9])
In [83]:
# Now we assign all the elements of `y` to have the value of 123.
# We do this by creating a slice into the array `y` and assigning to it.
y[:] = 123
y
Out[83]:
array([123, 123, 123, 123, 123, 123, 123, 123, 123, 123])
In [84]:
# How does this affect the original array `x`?
x
Out[84]:
array([123, 123, 123, 123, 123, 123, 123, 123, 123, 123,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19])
In [85]:
y[-1] = 999
In [86]:
x
Out[86]:
array([123, 123, 123, 123, 123, 123, 123, 123, 123, 999,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19])
In [87]:
y[::2] = -1
In [88]:
x
Out[88]:
array([ -1, 123,  -1, 123,  -1, 123,  -1, 123,  -1, 999,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19])
In [89]:
x[:3] = -100
In [90]:
x
Out[90]:
array([-100, -100, -100,  123,   -1,  123,   -1,  123,   -1,  999,   10,
         11,   12,   13,   14,   15,   16,   17,   18,   19])
In [91]:
y
Out[91]:
array([-100, -100, -100,  123,   -1,  123,   -1,  123,   -1,  999])
In [92]:
z = y.copy()
In [93]:
y
Out[93]:
array([-100, -100, -100,  123,   -1,  123,   -1,  123,   -1,  999])
In [94]:
z
Out[94]:
array([-100, -100, -100,  123,   -1,  123,   -1,  123,   -1,  999])
In [95]:
y[:2] = -9999
In [96]:
y
Out[96]:
array([-9999, -9999,  -100,   123,    -1,   123,    -1,   123,    -1,
         999])
In [97]:
z
Out[97]:
array([-100, -100, -100,  123,   -1,  123,   -1,  123,   -1,  999])
In [98]:
z = y[:]
In [99]:
y
Out[99]:
array([-9999, -9999,  -100,   123,    -1,   123,    -1,   123,    -1,
         999])
In [100]:
z
Out[100]:
array([-9999, -9999,  -100,   123,    -1,   123,    -1,   123,    -1,
         999])
In [101]:
y[:2] = 444444
In [102]:
y
Out[102]:
array([444444, 444444,   -100,    123,     -1,    123,     -1,    123,
           -1,    999])
In [103]:
z
Out[103]:
array([444444, 444444,   -100,    123,     -1,    123,     -1,    123,
           -1,    999])
In [104]:
z = np.arange(12)
z
Out[104]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
In [105]:
zref = z[::3]
zref[:] = 999
In [106]:
z
Out[106]:
array([999,   1,   2, 999,   4,   5, 999,   7,   8, 999,  10,  11])
In [107]:
zrefref = zref[::2]
zrefref[:] = -1000
In [108]:
z
Out[108]:
array([-1000,     1,     2,   999,     4,     5, -1000,     7,     8,
         999,    10,    11])

Array slices - a key difference between a numpy array and a Python list

With a numpy array, a slice is created by [:] (e.g. my_array[:]). With a plain Python list, [:] will return a copy of the list.

For both numpy arrays and Python lists, the .copy() method will make a copy, so this is preferred if you want to be sure you are making a copy.

In [109]:
# First with a list
a = [1,2,3]
In [110]:
b = a[:]
In [111]:
a[0] = 100
In [112]:
b
Out[112]:
[1, 2, 3]
In [113]:
a
Out[113]:
[100, 2, 3]
In [114]:
# Now with a numpy array
a = np.array([1,2,3])
In [115]:
b = a[:]
In [116]:
a[0] = 100
In [117]:
b
Out[117]:
array([100,   2,   3])
In [118]:
a
Out[118]:
array([100,   2,   3])

Efficient data processing with numpy

Because operations on numpy arrays happen for all elements with a single Python expression, these can operations can be performed very fast and efficiently by the computer. For example, if x is a numpy array with 10,000 elements, we can avoid a Python for loop with 10,000 iterations by performing our work with numpy.

Below we use the Jupyter "magic command %timeit" to measure how long a single expression takes, in this case performing an element-wise multiplication.

In [119]:
x = np.arange(10000, dtype=np.float64)
In [120]:
x*x
Out[120]:
array([0.0000000e+00, 1.0000000e+00, 4.0000000e+00, ..., 9.9940009e+07,
       9.9960004e+07, 9.9980001e+07])
In [121]:
%timeit x*x
1.6 μs ± 15.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [122]:
y = x*x
len(y)
Out[122]:
10000
In [124]:
assert y[2] == 4
assert y[3] == 9

Now let's do the same as above with a Python list. We need to crease a list_mul function.

In [125]:
def list_mul(a,b):
    """element-wise product of `a` and `b`"""
    n = len(a)
    assert n==len(b) 
    c = []
    for i in range(n):
        c.append(a[i] * b[i])
    return c

Now convert x to a list from a numpy array.

In [126]:
x = list(x)
In [127]:
type(x)
Out[127]:
list
In [128]:
x[:10]
Out[128]:
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
In [129]:
%timeit list_mul(x,x)
483 μs ± 24.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [130]:
y = list_mul(x,x)
In [131]:
assert y[3] == 9

Elementwise numpy operations

Above you have already seen element-wise multiplication, which multiplies every element of two inputs. Similarly, other operations operate element wise on a single input array.

In [132]:
np.sqrt( np.array([1, 4, 9]))
Out[132]:
array([1., 2., 3.])
In [133]:
np.linspace( 0, 2*np.pi, 30) 
Out[133]:
array([0.        , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
       1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
       2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
       3.24992343, 3.466585  , 3.68324656, 3.89990812, 4.11656968,
       4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
       5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531])
In [134]:
np.cos( np.linspace( 0, 2*np.pi, 30) )
Out[134]:
array([ 1.        ,  0.97662056,  0.90757542,  0.79609307,  0.64738628,
        0.46840844,  0.26752834,  0.05413891, -0.161782  , -0.37013816,
       -0.56118707, -0.72599549, -0.85685718, -0.94765317, -0.99413796,
       -0.99413796, -0.94765317, -0.85685718, -0.72599549, -0.56118707,
       -0.37013816, -0.161782  ,  0.05413891,  0.26752834,  0.46840844,
        0.64738628,  0.79609307,  0.90757542,  0.97662056,  1.        ])

More numpy operations

In addition to elementwise operations such as np.cos(x) or x * y where x and y are same-shaped arrays, numpy can also perform operations on entire arrays.

Take for example the mean() function.

In [135]:
x = np.arange(10)
x
Out[135]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [136]:
np.mean(x)
Out[136]:
4.5

We can also do the mean on a 2D array, either for the entire array or row-wise or column-wise:

In [137]:
x = np.arange(30)
x.shape = (5,6)
x
Out[137]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])
In [138]:
np.mean(x)
Out[138]:
14.5
In [139]:
# take the mean across the rows, (i.e. mean of each column), which is axis 0.
np.mean(x,axis=0)
Out[139]:
array([12., 13., 14., 15., 16., 17.])
In [140]:
# take the mean across the columns, which is axis 1.
np.mean(x,axis=1)
Out[140]:
array([ 2.5,  8.5, 14.5, 20.5, 26.5])

In addition to mean(), numpy provides std(), sum(), and more.

In [141]:
np.std(x)
Out[141]:
8.65544144839919
In [142]:
np.sum(x)
Out[142]:
435
In [143]:
np.max(x)
Out[143]:
29
In [144]:
x.mean()
Out[144]:
14.5
In [145]:
np.mean(x)
Out[145]:
14.5

argmin and argmax

Important in many scientific computing applications are argmin and argmax functions. These return the index of the smallest or largest value, respectively.

In [146]:
x = np.array([0, 10, -10, 4, 3, 2, 100, 2, 2, -1])
x
Out[146]:
array([  0,  10, -10,   4,   3,   2, 100,   2,   2,  -1])
In [147]:
min_idx = np.argmin(x)
min_idx
Out[147]:
2
In [148]:
x[min_idx]
Out[148]:
-10
In [149]:
np.min(x)
Out[149]:
-10
In [150]:
x
Out[150]:
array([  0,  10, -10,   4,   3,   2, 100,   2,   2,  -1])
In [151]:
max_idx = np.argmax(x)
max_idx
Out[151]:
6
In [152]:
x[max_idx]
Out[152]:
100
In [153]:
x = np.array([0, 100, 0, 4, 3, 2, 100, 2, 2, -1])
x
Out[153]:
array([  0, 100,   0,   4,   3,   2, 100,   2,   2,  -1])
In [154]:
np.argmax(x)
Out[154]:
1

Because of its speed, numpy makes it possible to use Python for scientific computing.

You can read more about numpy at its User Guide and its Reference Guide.

Live coding example: calculate distance between 2D points.

In [155]:
import matplotlib.pyplot as plt
In [156]:
a = (10, 20)
b = (13, 24)

plt.plot([a[0]], [a[1]], 'o', label='a')
plt.plot([b[0]], [b[1]], 'o', label='b')
plt.plot([5, 15, 15, 5, 5], [15, 15, 25, 25, 15], 'k-')
plt.legend();
In [157]:
def compute_distance(a,b):
    dx = a[0] - b[0]
    dy = a[1] - b[1]
    return np.sqrt(dx*dx + dy*dy)
In [158]:
compute_distance(a,b)
Out[158]:
5.0
In [159]:
assert compute_distance(a,b)==5.0
</html>