In this tutorial, you’ll learn how to use the Pandas drop_duplicates method to drop duplicate records in a DataFrame. Understanding how to work with duplicate values is an important skill for any data analyst or data scientist. Because data cleaning can take up to 80% of the time of an analytics project, knowing how to work with duplicate values can make your analytics process faster.
By the end of this tutorial, you’ll have learned:
- How the Pandas
.drop_duplicates()
method works - How to customize which record to keep when dropping duplicates
- How to specify which columns to search for duplicates in
- How to keep specific records, such as those with the max value
- How to drop duplicate records in place in Pandas
Table of Contents
Understanding the Pandas drop_duplicates() Method
Before diving into how the Pandas .drop_duplicates()
method works, it can be helpful to understand what options the method offers. Let’s first take a look at the different parameters and default arguments in the Pandas .drop_duplicates()
method:
# Understanding the Pandas .drop_duplicates Method
import pandas as pd
df = pd.DataFrame()
df.drop_duplicates(
subset=None,
keep='first',
inplace=False,
ignore_index=False
)
From the code block above, we can see that the method offers four parameters, each with a default argument provided. This means that we can simply call the method without needing to provide any additional information.
It’s important to understand what these parameters do. The table below breaks down the behavior of each of these parameters:
Parameter | Description | Default Argument | Potential Values |
---|---|---|---|
subset= | Which column(s) to consider when identifying duplicate records | None | Column label or sequence of column labels. The default value of None will consider all columns. |
keep= | Which duplicate record to keep | ‘first’ | {‘first’, ‘last’, False} |
inplace= | Whether to drop duplicates in place or to return a copy of the resulting DataFrame | False | Boolean |
ignore_index= | Whether to relabel the resulting index axis | False | Boolean |
.drop_duplicates()
methodNow that you have a strong understanding of the different parameters that the method provides, let’s dive into how to use the method to drop duplicate records in Pandas.
Loading a Sample Pandas DataFrame
If you’re not using your own dataset, I have provided a sample DataFrame below that you can use to follow along. Simply copy and paste the code below into your code editor of choice:
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
print(df.head())
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 2 A USA 100
# 3 B USA 200
# 4 B USA 175
In the code block above, we loaded a sample Pandas DataFrame with three columns. We can see that we have a number of records that are either duplicate records across all columns or only a subset of columns.
In the following section, you’ll learn how to start using the Pandas .drop_duplicates()
method to drop duplicates across all columns.
Using Pandas drop_duplicates to Keep the First Row
In order to drop duplicate records and keep the first row that is duplicated, we can simply call the method using its default parameters. Because the keep=
parameter defaults to 'first'
, we do not need to modify the method to behave differently.
Let’s see what this looks like in Python:
# Using Pandas .drop_duplicates() to Keep the First Record
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates()
print(df.head())
# Returns:
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 3 B USA 200
# 4 B USA 175
We can see from the code block above that the record at index 2 was dropped. Because the records at positions 0 and 2 were complete duplicates, the default behavior of the method was to keep the first instance.
In the following section, you’ll learn how to drop duplicates that are identified across a subset of specific columns.
Use Pandas drop_duplicates to Check Across Specific Columns
In some cases, you’ll only want to drop duplicate records across specific columns. Thankfully, the Pandas .drop_duplicates()
method makes this an easy task! In order to this, you can pass either a single column label or a list of columns into the subset=
parameter.
Let’s see how we can scan only across the ['Product', 'Location']
columns:
# Dropping Duplicates Based on a Subset of Columns
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates(subset=['Product', 'Location'])
print(df.head())
# Returns:
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 3 B USA 200
In the example above, we considered only two of the columns when dropping duplicates. Because of this, the values that were different in the Amount column were ignored. Recall, that by default, Pandas will keep the first record it encounters. Because of this, records 0 and 3 were kept, while 2 and 4 were dropped.
In the next section, you’ll learn how to customize this behavior and keep the last row when dropping duplicate records.
Using Pandas drop_duplicates to Keep the Last Row
Pandas also allows you to easily keep the last instance of a duplicated record. This behavior can be modified by passing in keep='last'
into the method. This is more intuitive when dropping duplicates based on a subset of columns. Let’s take a look at an example:
# Keeping Last Records When Dropping Duplicates
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates(subset=['Product', 'Location'], keep='last')
print(df.head())
# Returns:
# Product Location Amount
# 1 A Canada 125
# 2 A USA 100
# 4 B USA 175
In the example above, we passed in a subset of columns to check and identified that we wanted to keep the last record of a duplicated set of records. Because of this, we dropped the first (as well as any subsequent) instance of a duplicated row.
How to Remove All Duplicate Rows in Pandas
In some cases, you’ll want to drop every record that is duplicated, rather than keeping the first or the last record. This can also be controlled using the keep=
parameter, though you’ll pass in False
.
Let’s see how you can remove all duplicated rows in Pandas:
# Drop All Duplicated Rows in Pandas
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates(keep=False)
print(df.head())
# Returns:
# Product Location Amount
# 1 A Canada 125
# 3 B USA 200
# 4 B USA 175
We can see in the code block above, that by passing keep=False
into the method, none of the duplicated records were kept.
This can also be done using a subset of columns, where duplicates are evaluated only based on some columns.
Let’s see what this would look like:
# Drop All Duplicated Rows in Pandas Using a Subset
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates(keep=False, subset=['Product', 'Location'])
print(df.head())
# Returns:
# Product Location Amount
# 1 A Canada 125
In the code above, any records where Product and Location were duplicated were dropped. In the following section, you’ll learn how to keep the row with the maximum value in a given column.
Use Pandas drop_duplicates to Keep Row with Max Value
Pandas doesn’t provide out-of-the-box functionality to keep a row with the maximum value in a column. However, we can combine the .drop_duplicates()
method with the .sort_values()
method to achieve this.
If we wanted to remove duplicates based on the Location and Product column, but keep the record with the highest value in the Amount column, we could write the following:
# Use Pandas drop_duplicates to Keep Row with Max Value
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.sort_values(by='Amount', ascending=False).drop_duplicates(
keep='first', subset=['Product', 'Location'])
print(df.head())
# Returns:
# Product Location Amount
# 3 B USA 200
# 1 A Canada 125
# 0 A USA 100
Let’s break down what we did in the code above:
- We first applied the
.sort_values()
method, sorting by the Amount column in descending order (meaning larger values will come before smaller values) - Then, we applied the
.drop_duplicates()
method, searching for duplicates in the Product and Location columns. We also identified that we wanted to keep the first record (which in this case would be the larger Amount value).
We can also use this to keep the row with the minimum value, by simply switching the sort order. Let’s take a look at what this would look like:
# Use Pandas drop_duplicates to Keep Row with Min Value
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.sort_values(by='Amount') \\
.drop_duplicates(keep='first', subset=['Product', 'Location'])
print(df.head())
# Returns:
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 4 B USA 175
In the code block above, we applied the .sort_values()
method with its default ascending=True
argument. This sorted values from smallest to largest, allowing us to keep the smallest value in the Amount column. Alternatively, we could have asked Pandas to sort in descending order and keep the last record.
Use Pandas to Remove Duplicate Records In Place
The Pandas .drop_duplicates()
method also provides the option to drop duplicate records in place. This means that the DataFrame is modified and nothing is returned. In the previous sections, we’ve dropped duplicate records by reassigning the DataFrame to itself.
Let’s see how we can drop duplicate records in Pandas in place:
# Drop Duplicate Records In Place
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df.drop_duplicates(inplace=True)
print(df.head())
# Returns:
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 3 B USA 200
# 4 B USA 175
In the section above, we dropped the duplicate records in place. We did this by modifying the DataFrame directly. This means that the method returns nothing (or, more specifically, None
), so we do not need to reassign it to itself.
How to Reset an Index When Dropping Duplicate Records in Pandas
We can also reset the index of the resulting DataFrame when dropping duplicates using the Pandas .drop_duplicates()
method. This allows us to relabel the index axis from 0, 1, … n-1.
Let’s see how we can reset an index when dropping duplicate records:
# Reset an Index When Dropping Duplicate Records
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'A', 'A', 'B', 'B'],
'Location': ['USA', 'Canada', 'USA', 'USA', 'USA'],
'Amount': [100, 125, 100, 200, 175]})
df = df.drop_duplicates(ignore_index=True)
print(df.head())
# Returns:
# Product Location Amount
# 0 A USA 100
# 1 A Canada 125
# 2 B USA 200
# 3 B USA 175
We can see in the code block above that the values of the original indices are changed when we pass ignore_index=True
into the method.
Frequently Asked Questions
To drop duplicate records in Pandas, you can apply the .drop_duplicates()
method to the DataFrame. By default, this drops any duplicate records across all columns.
You can customize which columns are considered when dropping duplicate records using the subset=
parameter, which allows you to pass in a single column label or a list of column labels.
By default, the Pandas .drop_duplicates()
method returns a copy of the modified DataFrame. Because of this it needs to be reassigned to another variable. If you want to modify the DataFrame in place, you can set inplace=True
.
Conclusion
In this tutorial, you learned how to drop duplicate records in Pandas using the .drop_duplicates()
method. You first learned how to understand the different parameters and default arguments of the method. Then, you learned how to use the method to drop duplicate records, based on all or some columns.
Following that, you learned how to use different parameters to further customize the behavior of the method. This included modifying which record to keep, how to reset the DataFrame index, and how to drop duplicate records in place.
Additional Resources
To learn more about related topics, check out the tutorials below:
Pingback: Data Cleaning and Preparation in Pandas and Python • datagy