Data Cleaning and Preparation in Pandas and Python

Data Cleaning and Preparation in Pandas and Python Cover Image

In this tutorial, you’ll learn how to clean and prepare data in a Pandas DataFrame. You’ll learn how to work with missing data, how to work with duplicate data, and dealing with messy string data. Being able to effectively clean and prepare a dataset is an important skill. Many data scientists estimate that they spend 80% of their time cleaning and preparing their datasets.

Pandas provides you with several fast, flexible, and intuitive ways to clean and prepare your data. By the end of this tutorial, you’ll have learned all you need to know to get started with:

  • Working with missing data using methods such as .fillna()
  • Working with duplicate data using methods such as the .remove_duplicates() method
  • Cleaning string data using the .str accessor.

Handling Missing Data in Pandas

To follow along with this section of the tutorial, let’s load a messy Pandas DataFrame that we can use to explore ways in which we can handle missing data. If you want to follow along line by line, simply copy the code below to load the DataFrame:

# Loading a Sample Pandas DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Evan', 'Kyra', np.NaN],
    'Age': [33, 32, 40, 57, np.NaN],
    'Location': ['Toronto', 'London', 'New York', np.NaN, np.NaN]
})
print(df)

# Returns:
#    Name   Age  Location
# 0   Nik  33.0   Toronto
# 1  Kate  32.0    London
# 2  Evan  40.0  New York
# 3  Kyra  57.0       NaN
# 4   NaN   NaN       NaN

By printing out the DataFrame, we can see that we have three columns. Each column contains at least one missing value.

Understanding the Pandas isnull Method

Pandas comes with an incredibly helpful method, .isnull(), that identifies whether a value is missing or not. The method returns a boolean value, either True or False. We can apply the method either to an entire DataFrame or to a single column. The method will broadcast correctly to either the Series or the DataFrame, depending on what it’s applied to. Let’s take a quick look:

# Exploring the .isnull() method
print(df.isnull())

# Returns:
#     Name    Age  Location
# 0  False  False     False
# 1  False  False     False
# 2  False  False     False
# 3  False  False      True
# 4   True   True      True

Since we applied the method to the entire DataFrame, each value was evaluated to see if it was missing or not missing. In the following sections, you’ll learn how to make use of this method to transform your DataFrame.

Counting Missing Values in a Pandas DataFrame

One of the first steps you’ll want to take is to understand how many missing values you actually have in your DataFrame. One way to do this is to use a chained version the .isnull() method and the .sum() method:

print(df.isnull().sum())

# Returns:
# Name        1
# Age         1
# Location    2
# dtype: int64

The reason that this works is that the value of True is actually represented by the value of 1, while False is represented by the value of 0. Because of this, we can apply the .sum() method to the DataFrame. This returns a Series containing the counts of missing items in each column.

Pandas comes with a negation function of .isnull(). In order to count data that isn’t missing in each column, you can chain together the .notnull() and .sum() methods. This returns a Series containing the counts of non-missing data in each column.

Dropping Missing Data in a Pandas DataFrame

When working with missing data, it’s often good to do one of two things: either drop the records or find ways to fill the data. In this section, you’ll learn how to take on the former of the two. Pandas provides a method, .dropna(), which is used to drop missing data. Let’s take a look at the method:

# Exploring the Pandas .dropna() method
df.dropna(
    axis=0,         # Whether to drop rows or columns
    how='any',      # Whether to drop records if 'all' or 'any' records are missing
    thresh=None,    # How many columns/rows must be missing to drop
    subset=None,    # Which rows/columns to consider
    inplace=False   # Whether to drop in place (i.e., without needing to re-assign)
)

Let’s see how some of these parameters can be used to modify the behaviour of the method. First, let’s simply apply the method with all default arguments and explore the results:

# Dropping Values with Default Arguments
df = df.dropna()
print(df)

# Returns:
#    Name   Age  Location
# 0   Nik  33.0   Toronto
# 1  Kate  32.0    London
# 2  Evan  40.0  New York

By default, Pandas will drop records where any value is missing. Because of this, it also removed the fourth row, where only one value was missing.

We can modify this behaviour to remove a record to only remove if all the records are missing. Let’s see how we can make this happen:

# Dropping Records Only if All Records are Missing
df = df.dropna(how='all')
print(df)

# Returns:
#    Name   Age  Location
# 0   Nik  33.0   Toronto
# 1  Kate  32.0    London
# 2  Evan  40.0  New York
# 3  Kyra  57.0       NaN

Similarly, we can use the following parameters to better modify how the method works:

  • thresh=: the number of items that must be empty
  • subset=: the names of columns to look at when considering missing values

Filling Missing Data in a Pandas DataFrame

Removing missing data also removes any associated data from those records. Because of this, it can be helpful to fill in missing values. You can do this using the .fillna() method. The method can be applied to either an entire DataFrame or to a single column. By filling a constant in as a parameter, all missing values are replaced with that value:

# Using .fillna() to Fill Missing Data
df = df.fillna(0)
print(df)

# Returns:
#    Name   Age  Location
# 0   Nik  33.0   Toronto
# 1  Kate  32.0    London
# 2  Evan  40.0  New York
# 3  Kyra  57.0         0
# 4     0   0.0         0

When working with different data types, it may not always be prudent to fill all missing data with the same value. Because of this, you can pass in a dictionary to fill data with different values. The keys of the dictionary represent the columns and the values represent the values to fill with:

# Filling Columns with Different Values
df = df.fillna({'Name': 'Someone', 'Age': 25, 'Location': 'USA'})
print(df)

# Returns:
#       Name   Age  Location
# 0      Nik  33.0   Toronto
# 1     Kate  32.0    London
# 2     Evan  40.0  New York
# 3     Kyra  57.0       USA
# 4  Someone  25.0       USA

One very handy way to deal with missing values is to impute their values. This refers to filling missing values with estimated values. A simple of example of this could be filling the missing age values with the average age, which we can do by passing in the mean for that column:

# Imputing a Missing Value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

# Returns:
#    Name   Age  Location
# 0   Nik  33.0   Toronto
# 1  Kate  32.0    London
# 2  Evan  40.0  New York
# 3  Kyra  57.0       NaN
# 4   NaN  40.5       NaN

In the following section, you’ll learn how to deal with duplicate data in a Pandas DataFrame.

Working with Duplicate Data in Pandas

Duplicate data can be introduced into a dataset for a number of reasons. Sometimes this data can be valid, while other times it can present serious problems in your data’s integrity. Because of this, it’s important to understand how to find and deal with duplicate data. Let’s load a sample dataset that contains different types of duplicate data:

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Evan', 'Kyra', 'Nik', 'Kate'],
    'Age': [33, 32, 40, 57, 33, 32],
    'Location': ['Toronto', 'London', 'New York', 'Atlanta', 'Toronto', 'Paris'],
    'Date Modified': ['2022-01-01', '2022-02-24', '2022-08-12', '2022-09-12', '2022-01-01', '2022-12-09']
})

print(df)

# Returns:
#    Name  Age  Location Date Modified
# 0   Nik   33   Toronto    2022-01-01
# 1  Kate   32    London    2022-02-24
# 2  Evan   40  New York    2022-08-12
# 3  Kyra   57   Atlanta    2022-09-12
# 4   Nik   33   Toronto    2022-01-01
# 5  Kate   32     Paris    2022-12-09

In the DataFrame you loaded above, there are a number of records that are completely unique and others that are partially duplicated or complete duplicated. In the following section, you’ll learn how to identify duplicate records.

Identifying Duplicate Records in a Pandas DataFrame

Pandas provides a helpful method, .duplicated(), which allows you to identify duplicate records in a dataset. The method, similar to the .isnull() method you learned above, returns boolean values when duplicate records exist. This method returns a single Series if records are duplicated:

# Identifying Duplicate Records in a Pandas DataFrame
print(df.duplicated())

# Returns:
# 0    False
# 1    False
# 2    False
# 3    False
# 4     True
# 5    False
# dtype: bool

On its own, this may not seem particularly useful. However, this method, as you’ll soon learn allows us to drop duplicate records. What’s more, is that this allows us to count duplicate records. For example, we can simply add up the Series to determine how many duplicate records exist.

# Counting Duplicate Records in a DataFrame
print(df.duplicated().sum())

# Returns: 1

This allows you to understand the extent of duplicate records in a dataset. Knowing how many records are duplicate can give you a better sense of any potential data integrity issues.

Removing Duplicate Data in a Pandas DataFrame

Pandas makes it easy to remove duplicate records using the .drop_duplicates() method. Let’s take a look at what parameters the method has available:

# The Pandas .drop_duplicates() method
df.drop_duplicates(
    subset=None,            # Which columns to consider 
    keep='first',           # Which duplicate record to keep
    inplace=False,          # Whether to drop in place
    ignore_index=False      # Whether to relabel the index
)

Let’s see what happens when we apply the method with all default parameters:

# Dropping Duplicates with Default Arguments
df = df.drop_duplicates()
print(df)

# Returns:
#    Name  Age  Location Date Modified
# 0   Nik   33   Toronto    2022-01-01
# 1  Kate   32    London    2022-02-24
# 2  Evan   40  New York    2022-08-12
# 3  Kyra   57   Atlanta    2022-09-12
# 5  Kate   32     Paris    2022-12-09

We can see that this returned a DataFrame where only all items matched. It kept the first record of our duplicate (index 0).

Now, let’s see how we can expand the functionality of this method. We can see that we have two quite similar records: index 1 and 5. We can also see that they are only duplicate across two of the columns and that one of the records is more recent.

We can modify the behavior of the method to keep the most recent record by first sorting the data based on the last modified date. Then, we can ask Pandas to drop based on a subset of relevant columns. Let’s see what this looks like:

# Dropping Based on a Subset of Columns
df = df.sort_values(by='Date Modified', ascending=False)
df = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
print(df)

# Returns:
#    Name  Age  Location Date Modified
# 5  Kate   32     Paris    2022-12-09
# 3  Kyra   57   Atlanta    2022-09-12
# 2  Evan   40  New York    2022-08-12
# 0   Nik   33   Toronto    2022-01-01

Let’s break down what we did here:

  1. We sorted the data by 'Date Modified' in descending order. This places more recent records first.
  2. We then drop duplicates based on 'Name' and 'Age', keep the first instance.

In the next section, you’ll learn how to clean strings in Pandas.

Cleaning Strings in Pandas

One of the perks of working with Pandas is its strong ability to work with text data. This is made even more powerful by being able to access any type of string method and applying it directly to an entire array of data. In this section, you’ll learn how to trim white space, split strings into columns, and replace text in strings.

Let’s load a Pandas DataFrame that contains some string data to work with:

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame.from_dict({
    'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind, Shelly'],
    'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
    'Location': ['TORONTO', 'LONDON', 'New york', 'ATLANTA', 'toronto'],
    'Favorite Color': ['   green  ', 'red', '  yellow', 'blue', 'purple  ']
})

print(df)

# Returns:
#               Name    Region  Location Favorite Color
# 0  Tranter, Melvyn  Region A   TORONTO        green  
# 1   Lana, Courtney  Region A    LONDON            red
# 2     Abel, Shakti  Region B  New york         yellow
# 3    Vasu, Imogene  Region C   ATLANTA           blue
# 4  Aravind, Shelly  Region D   toronto       purple  

We can see that our DataFrame has some messy string data! For example, some columns contain multiple data points (first and last name), others have redundant data (the word ‘Region’), have messy capitalization (location), and have added whitespace (favorite colors).

In order to apply string methods to an entire Pandas Series, you need to use the .str attribute, which gives you access to apply vectorized string methods.

Trimming White Space in Pandas Strings

Let’s start off by removing whitespace from text in Pandas. We can see that the column 'Favorite Color' has extra whitespace on either end of the color. Python comes with a number of methods to strip whitespace from the front of a string, the back of a string, or either end. Because the whitespace exists on either end of the string, we will make use of the .strip() method.

# Trimming Whitespace from a Pandas Column
df['Favorite Color'] = df['Favorite Color'].str.strip()
print(df)

# Returns:
#               Name    Region  Location Favorite Color
# 0  Tranter, Melvyn  Region A   TORONTO          green
# 1   Lana, Courtney  Region A    LONDON            red
# 2     Abel, Shakti  Region B  New york         yellow
# 3    Vasu, Imogene  Region C   ATLANTA           blue
# 4  Aravind, Shelly  Region D   toronto         purple

Here, we were able to successfully strip whitespace from a column by re-assigning it to itself.

Splitting Strings into Columns in Pandas

The 'Name' column contains both the person’s last and first names. In many cases, you may want to split this column into two – one for each the first and last name. This approach will work a little differently, as we will want to assign two columns, rather than just one.

Let’s see what happens when we apply the .str.split() method on the 'Name' column:

# Applying .split on a column
print(df['Name'].str.split(','))

# Returns:
# 0    [Tranter,  Melvyn]
# 1     [Lana,  Courtney]
# 2       [Abel,  Shakti]
# 3      [Vasu,  Imogene]
# 4    [Aravind,  Shelly]
# Name: Name, dtype: object

We can see that this returned a list of strings. What we want to do, however, is assign this to multiple columns. In order to do this, we need to pass in the expand=True argument, in order to instruct Pandas to split the values into separate items. From there, we can assign the values into two columns:

# Splitting a Column into Two Columns
df[['Last Name', 'First Name']] = df['Name'].str.split(',', expand=True)

print(df)
# Returns:
#               Name    Region  Location Favorite Color Last Name First Name
# 0  Tranter, Melvyn  Region A   TORONTO        green     Tranter     Melvyn
# 1   Lana, Courtney  Region A    LONDON            red      Lana   Courtney
# 2     Abel, Shakti  Region B  New york         yellow      Abel     Shakti
# 3    Vasu, Imogene  Region C   ATLANTA           blue      Vasu    Imogene
# 4  Aravind, Shelly  Region D   toronto       purple     Aravind     Shelly

Make note here of the use of the double square brackets. This is required since we are passing in a list of columns we want to create!

Replacing Text in Strings in Pandas

In the 'Region' column, the word “Region” is redundant. In this example, you’ll learn how to replace some text in a column. In particular, you’ll learn how to remove a given substring in a larger string. For this, we can use the aptly-named .replace() method. The method takes a string we want to replace and a string that we want to substitute with. Because we want to remove a substring, we’ll simply pass in an empty string to substitute with.

# Replacing a Substring in Pandas
df['Region'] = df['Region'].str.replace('Region ', '')
print(df)

# Returns:
#               Name Region  Location Favorite Color
# 0  Tranter, Melvyn      A   TORONTO        green  
# 1   Lana, Courtney      A    LONDON            red
# 2     Abel, Shakti      B  New york         yellow
# 3    Vasu, Imogene      C   ATLANTA           blue
# 4  Aravind, Shelly      D   toronto       purple  

Changing String Case in Pandas

In this section, we’ll learn how to fix the odd and inconsistent casing that exists in the 'Location' column. Pandas provides access to a number of methods that allow us to change cases of strings:

  • .upper() will convert a string to all upper case
  • .lower() will convert a string to all lower case
  • .title() will convert a string to title case

In this case, we want our locations to be in title case, so we can apply to .str.title() method to the string:

# Changing Text to Title Case in Pandas
df['Location'] = df['Location'].str.title()
print(df)

# Returns:
#               Name    Region  Location Favorite Color
# 0  Tranter, Melvyn  Region A   Toronto        green  
# 1   Lana, Courtney  Region A    London            red
# 2     Abel, Shakti  Region B  New York         yellow
# 3    Vasu, Imogene  Region C   Atlanta           blue
# 4  Aravind, Shelly  Region D   Toronto       purple  

We can see that by applying the .str.title() method that every word was capitalized.

Exercises

It’s time to check your learning! Try and solve the exercises below. If you want to verify your solution, simply toggle the box to see a sample solution. Load the sample DataFrame below to answer the questions:

# Loading a DataFrame
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict({
    'Name': ['Tranter; Melvyn', 'Lana; Courtney', 'Abel; Shakti', 'Vasu; Imogene', 'Aravind; Shelly', 'Tranter; Melvyn'],
    'Location': ['TORONTO', 'LONDON', 'New york', np.NaN, 'toronto', 'Madrid'],
    'Sales': [123, 243, 654, np.NaN, 345, np.NaN]
})

Divide the output of df.isnull().sum() by the length of the dataframe:

print(df.isnull().sum() / len(df))

# Returns:
# Name        0.000000
# Location    0.166667
# Sales       0.333333
# dtype: float64
df[['Last Name', 'First Name']] = df['Name'].str.split(';', expand=True)

Conclusion and Recap

In this tutorial, you learned how to use Pandas for data cleaning! The section below provides a quick recap of what you learned in this tutorial:

  • Pandas provides a large variety of methods aimed at manipulating and cleaning your data
  • Missing data can be identified using the .isnull() method. Missing data can be counted per column by chaining in the .sum() method, which returns a Series of counts per column.
  • Missing data can be removed using the .dropna() method, which can be customized using different parameters
  • Duplicate data can be found using the .isduplicate() method
  • The .drop_duplicates() method can be used to drop duplicate records, providing significant customization on which records to consider as duplicate and which records to keep.
  • Pandas can access Python string methods using the .str attribute. This can be helpful in applying vectorized cleaning methods on messy text data.

Additional Resources

To learn more about related topics, check out the following tutorials: