In this tutorial, you’ll learn how to clean and prepare data in a Pandas DataFrame. You’ll learn how to work with missing data, how to work with duplicate data, and dealing with messy string data. Being able to effectively clean and prepare a dataset is an important skill. Many data scientists estimate that they spend 80% of their time cleaning and preparing their datasets.
Pandas provides you with several fast, flexible, and intuitive ways to clean and prepare your data. By the end of this tutorial, you’ll have learned all you need to know to get started with:
- Working with missing data using methods such as
.fillna()
- Working with duplicate data using methods such as the
.remove_duplicates()
method - Cleaning string data using the
.str
accessor.
Table of Contents
Handling Missing Data in Pandas
To follow along with this section of the tutorial, let’s load a messy Pandas DataFrame that we can use to explore ways in which we can handle missing data. If you want to follow along line by line, simply copy the code below to load the DataFrame:
# Loading a Sample Pandas DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra', np.NaN],
'Age': [33, 32, 40, 57, np.NaN],
'Location': ['Toronto', 'London', 'New York', np.NaN, np.NaN]
})
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
# 3 Kyra 57.0 NaN
# 4 NaN NaN NaN
By printing out the DataFrame, we can see that we have three columns. Each column contains at least one missing value.
Understanding the Pandas isnull Method
Pandas comes with an incredibly helpful method, .isnull()
, that identifies whether a value is missing or not. The method returns a boolean value, either True
or False
. We can apply the method either to an entire DataFrame or to a single column. The method will broadcast correctly to either the Series or the DataFrame, depending on what it’s applied to. Let’s take a quick look:
# Exploring the .isnull() method
print(df.isnull())
# Returns:
# Name Age Location
# 0 False False False
# 1 False False False
# 2 False False False
# 3 False False True
# 4 True True True
Since we applied the method to the entire DataFrame, each value was evaluated to see if it was missing or not missing. In the following sections, you’ll learn how to make use of this method to transform your DataFrame.
Counting Missing Values in a Pandas DataFrame
One of the first steps you’ll want to take is to understand how many missing values you actually have in your DataFrame. One way to do this is to use a chained version the .isnull()
method and the .sum()
method:
print(df.isnull().sum())
# Returns:
# Name 1
# Age 1
# Location 2
# dtype: int64
The reason that this works is that the value of True
is actually represented by the value of 1, while False
is represented by the value of 0. Because of this, we can apply the .sum()
method to the DataFrame. This returns a Series containing the counts of missing items in each column.
How can you count non-missing data in a Pandas DataFrame?
Pandas comes with a negation function of .isnull()
. In order to count data that isn’t missing in each column, you can chain together the .notnull()
and .sum()
methods. This returns a Series containing the counts of non-missing data in each column.
Dropping Missing Data in a Pandas DataFrame
When working with missing data, it’s often good to do one of two things: either drop the records or find ways to fill the data. In this section, you’ll learn how to take on the former of the two. Pandas provides a method, .dropna()
, which is used to drop missing data. Let’s take a look at the method:
# Exploring the Pandas .dropna() method
df.dropna(
axis=0, # Whether to drop rows or columns
how='any', # Whether to drop records if 'all' or 'any' records are missing
thresh=None, # How many columns/rows must be missing to drop
subset=None, # Which rows/columns to consider
inplace=False # Whether to drop in place (i.e., without needing to re-assign)
)
Let’s see how some of these parameters can be used to modify the behaviour of the method. First, let’s simply apply the method with all default arguments and explore the results:
# Dropping Values with Default Arguments
df = df.dropna()
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
By default, Pandas will drop records where any value is missing. Because of this, it also removed the fourth row, where only one value was missing.
We can modify this behaviour to remove a record to only remove if all the records are missing. Let’s see how we can make this happen:
# Dropping Records Only if All Records are Missing
df = df.dropna(how='all')
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
# 3 Kyra 57.0 NaN
Similarly, we can use the following parameters to better modify how the method works:
thresh=
: the number of items that must be emptysubset=
: the names of columns to look at when considering missing values
Filling Missing Data in a Pandas DataFrame
Removing missing data also removes any associated data from those records. Because of this, it can be helpful to fill in missing values. You can do this using the .fillna()
method. The method can be applied to either an entire DataFrame or to a single column. By filling a constant in as a parameter, all missing values are replaced with that value:
# Using .fillna() to Fill Missing Data
df = df.fillna(0)
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
# 3 Kyra 57.0 0
# 4 0 0.0 0
When working with different data types, it may not always be prudent to fill all missing data with the same value. Because of this, you can pass in a dictionary to fill data with different values. The keys of the dictionary represent the columns and the values represent the values to fill with:
# Filling Columns with Different Values
df = df.fillna({'Name': 'Someone', 'Age': 25, 'Location': 'USA'})
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
# 3 Kyra 57.0 USA
# 4 Someone 25.0 USA
One very handy way to deal with missing values is to impute their values. This refers to filling missing values with estimated values. A simple of example of this could be filling the missing age values with the average age, which we can do by passing in the mean for that column:
# Imputing a Missing Value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
# Returns:
# Name Age Location
# 0 Nik 33.0 Toronto
# 1 Kate 32.0 London
# 2 Evan 40.0 New York
# 3 Kyra 57.0 NaN
# 4 NaN 40.5 NaN
In the following section, you’ll learn how to deal with duplicate data in a Pandas DataFrame.
Working with Duplicate Data in Pandas
Duplicate data can be introduced into a dataset for a number of reasons. Sometimes this data can be valid, while other times it can present serious problems in your data’s integrity. Because of this, it’s important to understand how to find and deal with duplicate data. Let’s load a sample dataset that contains different types of duplicate data:
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra', 'Nik', 'Kate'],
'Age': [33, 32, 40, 57, 33, 32],
'Location': ['Toronto', 'London', 'New York', 'Atlanta', 'Toronto', 'Paris'],
'Date Modified': ['2022-01-01', '2022-02-24', '2022-08-12', '2022-09-12', '2022-01-01', '2022-12-09']
})
print(df)
# Returns:
# Name Age Location Date Modified
# 0 Nik 33 Toronto 2022-01-01
# 1 Kate 32 London 2022-02-24
# 2 Evan 40 New York 2022-08-12
# 3 Kyra 57 Atlanta 2022-09-12
# 4 Nik 33 Toronto 2022-01-01
# 5 Kate 32 Paris 2022-12-09
In the DataFrame you loaded above, there are a number of records that are completely unique and others that are partially duplicated or complete duplicated. In the following section, you’ll learn how to identify duplicate records.
Identifying Duplicate Records in a Pandas DataFrame
Pandas provides a helpful method, .duplicated()
, which allows you to identify duplicate records in a dataset. The method, similar to the .isnull()
method you learned above, returns boolean values when duplicate records exist. This method returns a single Series if records are duplicated:
# Identifying Duplicate Records in a Pandas DataFrame
print(df.duplicated())
# Returns:
# 0 False
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# dtype: bool
On its own, this may not seem particularly useful. However, this method, as you’ll soon learn allows us to drop duplicate records. What’s more, is that this allows us to count duplicate records. For example, we can simply add up the Series to determine how many duplicate records exist.
# Counting Duplicate Records in a DataFrame
print(df.duplicated().sum())
# Returns: 1
This allows you to understand the extent of duplicate records in a dataset. Knowing how many records are duplicate can give you a better sense of any potential data integrity issues.
Removing Duplicate Data in a Pandas DataFrame
Pandas makes it easy to remove duplicate records using the .drop_duplicates()
method. Let’s take a look at what parameters the method has available:
# The Pandas .drop_duplicates() method
df.drop_duplicates(
subset=None, # Which columns to consider
keep='first', # Which duplicate record to keep
inplace=False, # Whether to drop in place
ignore_index=False # Whether to relabel the index
)
Let’s see what happens when we apply the method with all default parameters:
# Dropping Duplicates with Default Arguments
df = df.drop_duplicates()
print(df)
# Returns:
# Name Age Location Date Modified
# 0 Nik 33 Toronto 2022-01-01
# 1 Kate 32 London 2022-02-24
# 2 Evan 40 New York 2022-08-12
# 3 Kyra 57 Atlanta 2022-09-12
# 5 Kate 32 Paris 2022-12-09
We can see that this returned a DataFrame where only all items matched. It kept the first record of our duplicate (index 0).
Now, let’s see how we can expand the functionality of this method. We can see that we have two quite similar records: index 1 and 5. We can also see that they are only duplicate across two of the columns and that one of the records is more recent.
We can modify the behavior of the method to keep the most recent record by first sorting the data based on the last modified date. Then, we can ask Pandas to drop based on a subset of relevant columns. Let’s see what this looks like:
# Dropping Based on a Subset of Columns
df = df.sort_values(by='Date Modified', ascending=False)
df = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
print(df)
# Returns:
# Name Age Location Date Modified
# 5 Kate 32 Paris 2022-12-09
# 3 Kyra 57 Atlanta 2022-09-12
# 2 Evan 40 New York 2022-08-12
# 0 Nik 33 Toronto 2022-01-01
Let’s break down what we did here:
- We sorted the data by
'Date Modified'
in descending order. This places more recent records first. - We then drop duplicates based on
'Name'
and'Age'
, keep the first instance.
In the next section, you’ll learn how to clean strings in Pandas.
Cleaning Strings in Pandas
One of the perks of working with Pandas is its strong ability to work with text data. This is made even more powerful by being able to access any type of string method and applying it directly to an entire array of data. In this section, you’ll learn how to trim white space, split strings into columns, and replace text in strings.
Let’s load a Pandas DataFrame that contains some string data to work with:
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind, Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
'Location': ['TORONTO', 'LONDON', 'New york', 'ATLANTA', 'toronto'],
'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple ']
})
print(df)
# Returns:
# Name Region Location Favorite Color
# 0 Tranter, Melvyn Region A TORONTO green
# 1 Lana, Courtney Region A LONDON red
# 2 Abel, Shakti Region B New york yellow
# 3 Vasu, Imogene Region C ATLANTA blue
# 4 Aravind, Shelly Region D toronto purple
We can see that our DataFrame has some messy string data! For example, some columns contain multiple data points (first and last name), others have redundant data (the word ‘Region’), have messy capitalization (location), and have added whitespace (favorite colors).
In order to apply string methods to an entire Pandas Series, you need to use the .str
attribute, which gives you access to apply vectorized string methods.
Trimming White Space in Pandas Strings
Let’s start off by removing whitespace from text in Pandas. We can see that the column 'Favorite Color'
has extra whitespace on either end of the color. Python comes with a number of methods to strip whitespace from the front of a string, the back of a string, or either end. Because the whitespace exists on either end of the string, we will make use of the .strip()
method.
# Trimming Whitespace from a Pandas Column
df['Favorite Color'] = df['Favorite Color'].str.strip()
print(df)
# Returns:
# Name Region Location Favorite Color
# 0 Tranter, Melvyn Region A TORONTO green
# 1 Lana, Courtney Region A LONDON red
# 2 Abel, Shakti Region B New york yellow
# 3 Vasu, Imogene Region C ATLANTA blue
# 4 Aravind, Shelly Region D toronto purple
Here, we were able to successfully strip whitespace from a column by re-assigning it to itself.
Splitting Strings into Columns in Pandas
The 'Name'
column contains both the person’s last and first names. In many cases, you may want to split this column into two – one for each the first and last name. This approach will work a little differently, as we will want to assign two columns, rather than just one.
Let’s see what happens when we apply the .str.split()
method on the 'Name'
column:
# Applying .split on a column
print(df['Name'].str.split(','))
# Returns:
# 0 [Tranter, Melvyn]
# 1 [Lana, Courtney]
# 2 [Abel, Shakti]
# 3 [Vasu, Imogene]
# 4 [Aravind, Shelly]
# Name: Name, dtype: object
We can see that this returned a list of strings. What we want to do, however, is assign this to multiple columns. In order to do this, we need to pass in the expand=True
argument, in order to instruct Pandas to split the values into separate items. From there, we can assign the values into two columns:
# Splitting a Column into Two Columns
df[['Last Name', 'First Name']] = df['Name'].str.split(',', expand=True)
print(df)
# Returns:
# Name Region Location Favorite Color Last Name First Name
# 0 Tranter, Melvyn Region A TORONTO green Tranter Melvyn
# 1 Lana, Courtney Region A LONDON red Lana Courtney
# 2 Abel, Shakti Region B New york yellow Abel Shakti
# 3 Vasu, Imogene Region C ATLANTA blue Vasu Imogene
# 4 Aravind, Shelly Region D toronto purple Aravind Shelly
Make note here of the use of the double square brackets. This is required since we are passing in a list of columns we want to create!
Replacing Text in Strings in Pandas
In the 'Region'
column, the word “Region” is redundant. In this example, you’ll learn how to replace some text in a column. In particular, you’ll learn how to remove a given substring in a larger string. For this, we can use the aptly-named .replace()
method. The method takes a string we want to replace and a string that we want to substitute with. Because we want to remove a substring, we’ll simply pass in an empty string to substitute with.
# Replacing a Substring in Pandas
df['Region'] = df['Region'].str.replace('Region ', '')
print(df)
# Returns:
# Name Region Location Favorite Color
# 0 Tranter, Melvyn A TORONTO green
# 1 Lana, Courtney A LONDON red
# 2 Abel, Shakti B New york yellow
# 3 Vasu, Imogene C ATLANTA blue
# 4 Aravind, Shelly D toronto purple
Changing String Case in Pandas
In this section, we’ll learn how to fix the odd and inconsistent casing that exists in the 'Location'
column. Pandas provides access to a number of methods that allow us to change cases of strings:
.upper()
will convert a string to all upper case.lower()
will convert a string to all lower case.title()
will convert a string to title case
In this case, we want our locations to be in title case, so we can apply to .str.title()
method to the string:
# Changing Text to Title Case in Pandas
df['Location'] = df['Location'].str.title()
print(df)
# Returns:
# Name Region Location Favorite Color
# 0 Tranter, Melvyn Region A Toronto green
# 1 Lana, Courtney Region A London red
# 2 Abel, Shakti Region B New York yellow
# 3 Vasu, Imogene Region C Atlanta blue
# 4 Aravind, Shelly Region D Toronto purple
We can see that by applying the .str.title()
method that every word was capitalized.
Exercises
It’s time to check your learning! Try and solve the exercises below. If you want to verify your solution, simply toggle the box to see a sample solution. Load the sample DataFrame below to answer the questions:
# Loading a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
'Name': ['Tranter; Melvyn', 'Lana; Courtney', 'Abel; Shakti', 'Vasu; Imogene', 'Aravind; Shelly', 'Tranter; Melvyn'],
'Location': ['TORONTO', 'LONDON', 'New york', np.NaN, 'toronto', 'Madrid'],
'Sales': [123, 243, 654, np.NaN, 345, np.NaN]
})
Create a First Name and a Last Name column. Note that there is a semi-colon between names.
df[['Last Name', 'First Name']] = df['Name'].str.split(';', expand=True)
Drop any duplicate records based only on the Name column, keeping the last record.
df = df.drop_duplicates(subset='Name', keep='last')
Calculate the percentage of missing records in each column.
Divide the output of df.isnull().sum()
by the length of the dataframe:
print(df.isnull().sum() / len(df))
# Returns:
# Name 0.000000
# Location 0.166667
# Sales 0.333333
# dtype: float64
Conclusion and Recap
In this tutorial, you learned how to use Pandas for data cleaning! The section below provides a quick recap of what you learned in this tutorial:
- Pandas provides a large variety of methods aimed at manipulating and cleaning your data
- Missing data can be identified using the
.isnull()
method. Missing data can be counted per column by chaining in the.sum()
method, which returns a Series of counts per column. - Missing data can be removed using the
.dropna()
method, which can be customized using different parameters - Duplicate data can be found using the
.isduplicate()
method - The
.drop_duplicates()
method can be used to drop duplicate records, providing significant customization on which records to consider as duplicate and which records to keep. - Pandas can access Python string methods using the
.str
attribute. This can be helpful in applying vectorized cleaning methods on messy text data.
Additional Resources
To learn more about related topics, check out the following tutorials:
Pingback: Introduction to Machine Learning in Python • datagy