NumPy for Data Science in Python

NumPy for Data Science in Python Cover Image

In this tutorial, you’ll learn how to use Python’s NumPy library for data science. You’ll learn why the library matters in the realm of data science and how it’s foundational for many other libraries. You’ll learn about the NumPy ndarray data structure and how it works. By the end of the tutorial, you’ll have learned:

  • How NumPy arrays are different from Python lists
  • Why NumPy is so fast and efficient
  • How to create one-dimensional and multi-dimensional NumPy arrays
  • How to apply methods and functions to NumPy arrays
  • How to create NumPy arrays programmatically

Why Use NumPy for Data Science in Python

NumPy is one of the core packages for scientific computing in Python. The library is so important to Python’s data science community, in fact, that it is at the core of many other data science libraries, like Pandas and Matplotlib.

NumPy provides a key object, the ndarray. The ndarray is an n-dimensional array of homogenous data. It enables the creation of arrays of a single dimension, two dimensions (like a table or matrix), and multiple other dimensions.

One of the important benefits of NumPy is its speed. Why is NumPy so fast? NumPy allows you to vectorized your code, providing you with methods to modify, transform, and aggregate your arrays at blazing fast speeds. The reason vectorization is possible is that NumPy uses optimized, pre-compiled C code.

NumPy provides you with tools that allow you to broadcast your operations (a concept you’ll learn more about later). This results in much more readable code. This is because NumPy handles these operations for you, rather than relying on operations such as for loops.

Installing and Importing NumPy in Python

Let’s start off by learning how to install NumPy. Since NumPy isn’t part of the standard Python library, you need to install it prior to being able to use it. It’s easy to install using pip package installer. To install the library, simply run the code below in your terminal:

# Installing NumPy with pip
pip install numpy

pip will handle installing NumPy and all of its dependencies. Once the installation is complete, you can import the library. Conventionally, NumPy is imported with the alias np. While you don’t have to follow the convention, you’ll encounter this virtually everywhere. This will make troubleshooting your code much, much easier should you run into issues.

Try writing the code below into a Python file and running the code. If it runs without issue, then you’re ready to start working with NumPy in Python!

# Importing NumPy
import numpy as np

Let’s get started at exploring the wonderful work of NumPy arrays.

Creating Python NumPy Arrays

NumPy ndarray objects are n-dimensional arrays. On the surface, they appear to be quite similar to Python lists, but they work quite differently. Let’s work on creating our first array:

# Creating your first array
import numpy as np
array = np.array([1,2,3,4,5])

Let’s check what the type of the this array is by using the type() function:

# Checking the type of the array
print(type(array))

# Returns: <class 'numpy.ndarray'>

Because NumPy arrays are homogeneous, you can actually define a data type of the array when you create it. Let’s see what the data type of the array you created above is. You can do this using the .dtype attribute:

# Checking the data type of an array
print(array.dtype)

# Returns: int64

Similarly, you can define the data type when you create the array by passing in the dtype= parameter. Let’s recreate our array using the data type float64:

# Creating an array with a data type
array = np.array([1,2,3,4,5], dtype='float64')
print(array.dtype)

# Returns: float64

Now that you have a preliminary understanding of how to create NumPy arrays, let’s take a look at how they differ from lists.

NumPy Arrays versus Python Lists

On the surface, NumPy arrays may look quite similar to the Python list object. In fact, you can even use lists to create arrays (and vice versa). However, NumPy arrays are quite different from Python lists. Let’s take a look at some of the key differences between them.

  1. Fixed Size: NumPy arrays have a fixed size when they are created. On the other hand, Python lists can grow dynamically. When you change the size of a NumPy array, the original is destroyed and a new one is created.
  2. Homogenous: items in a NumPy array are required to be of the same data type. Python lists on the other hand don’t enfoce this. (There is one exception: when NumPy arrays contain objects, the objects can contain different data types)
  3. NumPy arrays are built around mathematical operations: the functions and methods that can be applied to a NumPy array are focused around math and efficiency

NumPy arrays are wonderful because they can be written with the simplicity of Python, but achieve the speed of compiled C code. This is because, under the hood, NumPy uses compiled C code for many of its operations. This allows you to attain incredibly efficient programming speeds, with the ease and simplicity that Python coding provides.

Let’s take a look at an example of how this works in practise. We’ll dive into this more later on, too.

Multiplying an Array by a Scalar

For our example, let’s take a look at how we would multiply all list items by a scalar or an entire array by a scalar. Let’s imagine we have a list with the items [1, 2, 3, 4, 5]. With an array, we can simply multiply the array by that value:

# Multiplying an array by a scalar
array = np.array([1,2,3,4,5])
array2 = 2 * array
print(array2)

# Returns: [ 2  4  6  8 10]

Let’s try that same operation with a list:

list1 = [1,2,3,4,5]
list2 = 2 * list1
print(list2)

# Returns: [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In order to accomplish the same thing with a list, we need to iterate over each item and multiply it by that scalar.

list1 = [1,2,3,4,5]
list2 = []
for item in list1:
    list2.append(2 * item)
print(list2)

# Returns: [2, 4, 6, 8, 10]

The benefit of using NumPy is two-fold:

  1. We achieve greater readability of what we’re hoping to accomplish
  2. Our processing of the data is vectorized through the use of C under the hood. This allows the operations to happen significantly faster!

Now that you have a good understanding of how lists and arrays are different, lets take a look at multi-dimensional NumPy arrays in Python!

Multi-Dimensional NumPy Arrays

In this section, you’ll learn about how multi-dimensional arrays work. Primarily, you’ll learn how to create two-dimensional arrays, since these are easier to communicate on a computer screen.

In the previous example, you created a one-dimensional array by passing in a list. Similarly, you can create a two-dimensional array by passing in a list of lists. Let’s take a look at a simple example:

# Creating a two-dimensional array
array = np.array([[1,2,3],[4,5,6]])
print(array)

# Returns:
# [[1 2 3]
#  [4 5 6]]

We can check the dimensions of the array by using the .ndim attribute, which returns a single value of dimensions:

# Checking the dimensions of an array
array = np.array([[1,2,3],[4,5,6]])
print(array.ndim)

# Returns: 2

Similarly, we can use the .shape attribute to return the number of elements stored along each dimension of the array.

array = np.array([[1,2,3],[4,5,6]])
print(array.shape)

# Returns: (2, 3)

Finally, you can use the .size attribute to understand the total number of elements that exist in the array. This attribute reflects the product of the elements of the arrays shape.

array = np.array([[1,2,3],[4,5,6]])
print(array.size)

# Returns: 6

In the next section, you’ll learn how to access items in a NumPy array using indexing, slicing, and boolean indexing.

Indexing, Slicing, and Boolean Indexing NumPy Arrays

So far, you’ve learned how to create one-dimensional arrays as well as multi-dimensional arrays. In this section, you’ll learn how access data with these arrays using indexing, slicing, and boolean indexing.

Let’s start off by accessing items in a one-dimensional array. This works very similar to accessing list items. Indexing and slicing NumPy arrays works very similar to indexing and slicing Python lists:

  • Indices start at 0 and contineue through to the end of the list
  • Negative indices start at -1
  • Arrays can be sliced using a colon, using either positive or negative indices (or both)
  • An slice end will imply either the full left or right side of the array

Let’s look at a few indices and slices:

# Indexing and Slicing a 1-D NumPy Array
import numpy as np
array = np.array([1,2,3,4,5])

print(array[0])     # Returns: 1
print(array[-1])    # Returns: 5
print(array[1:3])   # Returns [2 3]

You can also apply boolean slicing to arrays. This means that you can filter the array based on a boolean condition. You can get a refresher on boolean truth tables here.

Let’s see what this looks like. We’ll apply a condition on the array to filter values based on the value being greater than 2:

# Creating a Boolean Index
import numpy as np
array = np.array([1,2,3,4,5])
bool_array = array > 2

print(bool_array)

# Returns: [False False  True  True  True]

This returns an array containing only boolean values, where a True represents that the condition is met. While on the surface this may not appear to be immediately useful. However, you can apply that array as a slice on your existing array to filter down the values:

# Filtering an array
import numpy as np
array = np.array([1,2,3,4,5])
bool_array = array > 2

filtered = array[bool_array]
# Same as: filtered = array[bool_array > 2]
print(filtered)

# Returns: [3 4 5]

Indexing, Slicing, and Boolean Indexing Multi-Dimensional NumPy Arrays

In the previous section, indexing, slicing and boolean indexing a one-dimensional array was compared to working with Python lists. Similarly, indexing, slicing and boolean indexing a multi-dimensional NumPy array can be compared to working with Python lists of lists.

Let’s create a multi-dimensional NumPy array to work with:

# Creating a 2-dimensional NumPy array
import numpy as np

array = np.array([[1,2,3], [4,5,6]])

Now, let’s try and apply the indexing you learned earlier:

print(array[0])

# Returns: [1 2 3]

Instead of returning the first value (1), the indexing method returned an array. This is actually quite helpful, since we can simply index that array again!

print(array[0][0])

# Returns: 1

Other than the note around indexing the inner arrays, indexing and slicing works exactly the same.

On the other hand, boolean indexing works a little differently. Let’s try to apply the same filter as earlier (that the item is greater than 2):

# Boolean indexing a 2-dimensional array
array = np.array([[1,2,3], [4,5,6]])
filtered = array[array > 2]
print(filtered)

# Returns: [3 4 5 6]

Applying a boolean index on a multi-dimensional array returns a flattened, 1-dimensional array. In order to maintain the original dimensionalty of the arrays you can use the np.where() function. Let’s apply the same filter:

# Using np.where() to filter an array
import numpy as np

array = np.array([[1,2,3], [4,5,6]])
filtered = np.where(array > 2, array, np.NaN)

print(filtered)

# Returns:
# [[nan nan  3.]
#  [ 4.  5.  6.]]

Applying Functions on Numpy Arrays

In this section, you’ll learn how to apply functions and methods to a NumPy array. Previously, you briefly learned how multiplying a NumPy array by a scalar works differently from a Python list. This is true for many other operations. Let’s take a look at a few:

# Applying Operations to a NumPy array
array = np.array([1,2,3,4,5])
print(array * 2)        # [ 2  4  6  8 10]
print(array + 1)        # [2 3 4 5 6]
print(array % 2)        # [1 0 1 0 1]

Similarly, you can add, subtract, and multiply (and more) different arrays together:

# Adding, Subtracting, and Multiplying Arrays
array1 = np.array([1,2,3])
array2 = np.array([4,5,6])

print(array1 + array2)          # [5 7 9]
print(array1 - array2)          # [-3 -3 -3]
print(array1 * array2)          # [ 4 10 18]

NumPy arrays also have a number of very helpful methods. For example, you can calculate the sum of all values or the mean of all values easily by applying the corresponding method. Let’s take a look at a few:

# NumPy Array Methods
array = np.array([1,2,3,4,5])

print(array.mean())         # 3.0
print(array.sum())          # 15

There are many different methods available in NumPy. This tutorial isn’t meant to provide an overview of all methods, but rather as a way to provide you with enough information on how to apply these methods.

When working with array methods on multi-dimensional arrays, the concept of an axis becomes important. When you don’t pass in an axis, NumPy will assume the axis of None. When passing in axis=None, any multi-dimensional array would be flattened.

The axis of 0 can be thought of as the “columns” of a matrix. Meanwhile the axis of 1 can be thought of as the “rows” of a matrix. Because of this, you can apply these axes to a method to calculate different aggregations. Let’s load a multi-dimensional array and apply different methods to it:

# Applying Methods to 2-D Arrays
array = np.array([[1,2], [3,4], [5,6]])

print(array.sum(axis=None))     # 21
print(array.sum(axis=0))        # [ 9 12]
print(array.sum(axis=1))        # [ 3  7 11]

You can see how this works. Remember, the array looks like the code below, where each sublist that is passed in is a “row” in the matrix:

print(array)

# Returns:
# [[1 2]
#  [3 4]
#  [5 6]]

By passing in different axes, the following interpretations can be made:

  • axis=None flattens the arrays and returns the sum of all elements
  • axis=0 returns the sum along the column dimension
  • axis=1 returns the sum along the row dimension

Concatenating NumPy Arrays

In this section, you’ll learn how concatenating NumPy arrays works. NumPy arrays have a concept of an axis, which can help instruct NumPy how to concatenate different arrays.

Let’s take a look at two arrays:

# Two Samples Arrays
a = np.array([[1,2], [3,4]])
b = np.array([[5,6]])

You can use the np.concatenate() function to pass in these arrays. The method requires you to specify an axis. When we pass in an axis of None, then the concatenated array is of a single dimension:

# Concatenating with an axis of None
a = np.array([[1,2], [3,4]])
b = np.array([[5,6]])

c = np.concatenate((a, b), axis=None)
print(c)

# Returns: [1 2 3 4 5 6]

If you wanted to add the second array as another “row” in the matrix, you can pass in axis=0. It’s important that the length of each array is the same.

# Concatenating with axis=0
a = np.array([[1,2], [3,4]])
b = np.array([[5,6]])

c = np.concatenate((a, b), axis=0)
print(c)

# Returns:
# [[1 2]
#  [3 4]
#  [5 6]]

Now, if you wanted to add the array as a “column” to the first array, you can apply the axis=1 parameter. However, because the dimensions don’t match you first need to transpose the array. This can be done using the .T method, which returns a transposed array:

# Concatenating with axis=1
a = np.array([[1,2], [3,4]])
b = np.array([[5,6]])

c = np.concatenate((a, b.T), axis=1)
print(c)

# Returns:
# [[1 2 5]
#  [3 4 6]]

In the next section, you’ll learn how to use NumPy to generate arrays programmatically.

Generating NumPy Arrays Programmatically

There are a number of easy ways to generate NumPy arrays programmatically. This can be incredibly helpful either when you need fake data or you need to create identity matrices. Let’s take a look at a few functions you can use in Python’s numpy:

Creating Arrays of Zeroes in NumPy

To create an array of only zeroes in Python’s NumPy, you can use the aptly named .zeroes() function. The function can take the size of your array as an argument. If you pass in a single value, the function returns a one-dimensional array. Passing in a tuple will produce an array of zeroes with the size passed in. Let’s take a look at some examples:

# Creating arrays of zeroes
array1 = np.zeros(3)
array2 = np.zeros((2,3))

print('array1 looks like:')
print(array1)
print('\narray2 looks like:')
print(array2)

# Returns:
# array1 looks like:
# [0. 0. 0.]

# array2 looks like:
# [[0. 0. 0.]
#  [0. 0. 0.]]

Creating Arrays of Ones in NumPy

Similarly, NumPy comes with a function to generate an array of ones. This function is equally aptly named as .ones(). The function works in the same way as .zeroes(), except it returns 1s instead of 0s:

# Creating arrays of 1s
array1 = np.ones(3)
array2 = np.ones((2,3))

print('array1 looks like:')
print(array1)
print('\narray2 looks like:')
print(array2)

# Returns:
# array1 looks like:
# [1. 1. 1.]

# array2 looks like:
# [[1. 1. 1.]
#  [1. 1. 1.]]

Creating Identity Matrices in NumPy

The identity matrix is an n by n matrix where the diagonal line are all 1s and the remaining values are 0s. This can be created using the .eye() function. Because the matrix is required to be squared, you simply need to pass in a single dimension. Let’s see what this looks like:

array = np.eye(4)
print(array)

# Returns:
# [[1. 0. 0. 0.]
#  [0. 1. 0. 0.]
#  [0. 0. 1. 0.]
#  [0. 0. 0. 1.]]

Creating an Array of a Range in NumPy

NumPy also a helpful function to produce an array of a range of values. This function is called .arange(), which can be used to create a range of values from 0 through (but not including) the input number. Let’s create an array containing the values from 0 through 5:

array = np.arange(6)
print(array)

# Returns: [0 1 2 3 4 5]

Similar to the Python range() , you can specify start, stop, and step parameters. To create an array containing the values from 0 through 10, stepping at a value of 2, you could write:

array = np.arange(0, 11, 2)
print(array)

# Returns: [ 0  2  4  6  8 10]

Generating Random Numbers in NumPy

NumPy also comes with powerful functions to produce arrays of random values. For example, you can create uniformly random distributions or normal (Gaussian) distributions.

Creating Uniformly Random Values in NumPy

To create a uniformly random distribution, you can use the np.random.random() function. Similar to the above examples, passing in a single value returns a one-dimensional array of that length. Passing in a tuple, generates a multi-dimensional array of the lengths passed in. Let’s create a 3x2 array of random values between 0 and 1.

# Uniformly Random Values
array = np.random.random((3, 2))
print(array)

# Returns:
# [[0.56942196 0.55263432]
#  [0.12823255 0.60557413]
#  [0.36275958 0.46599701]]

Creating a Normal (Gaussian) Distribution in NumPy

To create a random normal distribution in Python’s NumPy you can use the np.random.randn() function. The mean of the array will be 0 and the array will have a unit variance. Let’s see how we can created a 3x2 array of normally distributed values:

# Normal Distribution in NumPy
array = np.random.randn(3,2)
print(array)

# Returns:
# [[-1.07816465  1.3593095 ]
#  [ 0.5428646  -0.55262844]
#  [-0.46369626  0.56692646]]

Creating an Array of Random Integers in NumPy

Finally, let’s take a look at how to create an array of random integers in NumPy. For this, you can use the np.random.randint() function. The function takes a low= argument, a high= argument, and a size= argument. The high argument is exclusive, meaning the values will go up to that value but not include it. Let’s create a 5x5 array with random values from 4 through 12.

# Random Integer Arrays
array = np.random.randint(low=4, high=13, size=(5,5))
print(array)

# Returns:
# [[ 8  7  9  6  9]
#  [ 4  5 10 12 11]
#  [10  6 12  6  9]
#  [11 12 11  6 12]
#  [ 4  7 12  6  4]]

Exercises

Now it’s time to test your understanding! Try and complete the exercises below. If you need any help or want to check your solution, simply toggle the section under the question.

Use the following arrays to answer the questions:

array1 = np.array([1,2,3,4,5])
array2 = np.array([[1,2,3], [4,5,6], [7,8,9]])

You can filter using the modulus operator to only get even numbers, since any even number with modulus 2 applied to it will be 0.

array = np.array([1,2,3,4,5])
filtered = array[array % 2 == 0]

print(filtered)

# Returns: [2 4]

You can use either positive or negative indexing:

print(array2[1][2])
print(array2[1][-1])

Use axis=1 in the .mean() method:

print(array2.mean(axis=1))

Conclusion and Recap

In this tutorial, you learned how to get up and running with the NumPy library and how to use its array data structure. The section below provides a recap of what you learned:

  • NumPy is an important, foundational library for data science in Python
  • NumPy can be installed using the pip package installer
  • Arrays can look like Python lists but function quite differently
  • NumPy uses precompiled C code to increase its speed and efficiency
  • NumPy arrays can be sliced, index and boolean indexed similar to Python lists
  • NumPy arrays use an axis to identify “rows” and “columns” of a matrix
  • You can apply functions and methods efficiently to arrays

Additional Resources

To learn more about related topics, check out the tutorials below: