How to Drop Duplicates in Pandas

  • by
Drop Duplicates in Pandas Cover Image
  • Save

Taking on data science work in Pandas, it can be important to drop duplicate data. In this post, you’ll learn all the ways to drop duplicates in Pandas.

Table of Contents

Loading our Dataset

Throughout this tutorial, we’ll use a sample dataset. To get started, let’s load Pandas and create a dataframe:

import pandas as pd
df = pd.DataFrame.from_dict({'Name': ['Nik', 'Evan', 'Sam', 'Nik', 'Sam'], 'Age': [30, 31, 29, 30, 30], 'Height':[180, 185, 160, 180, 160]})
print(df)

This returns the following table:

   Name  Age  Height
0   Nik   30     180
1  Evan   31     185
2   Sam   29     160
3   Nik   30     180
4   Sam   30     160

Pandas Drop Duplicates

To remove duplicates in Pandas, you can use the .drop_duplicates() method. This method drops all records where all items are duplicate:

df = df.drop_duplicates()
print(df)

This returns the following dataframe:

   Name  Age  Height
0   Nik   30     180
1  Evan   31     185
2   Sam   29     160
4   Sam   30     160

Drop Duplicates of Certain Columns in Pandas

By default, Pandas will ensure that values in all columns are duplicate before removing them. If you want to remove records even if not all values are duplicate, you can use the subset argument.

For example, if you wanted to remove all rows only based on the name column, you could write:

df = df.drop_duplicates(subset='Name')

This returns the following:

   Name  Age  Height
0   Nik   30     180
1  Evan   31     185
2   Sam   29     160

The keep argument also accepts a list of columns. This will check only for duplicates across a list of columns.

For example, you can remove duplicates based on duplicates in the Name and Age columns by writing:

df = df.drop_duplicates(subset=['Name', 'Age'])
print(df)

This returns:

   Name  Age  Height
0   Nik   30     180
1  Evan   31     185
2   Sam   29     160
4   Sam   30     160

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Keep First or Last Value – Pandas Drop Duplicates

When removing duplicates, Pandas gives you the option of keeping a certain record. The keep argument accepts ‘first’ and ‘last’, which keep either the first or last instance of a remove record. This can be combined with first sorting data, to make sure that the correct record is retained.

If you wanted to keep the first record when removing duplicates based on the name column, you could write:

df = df.drop_duplicates(subset='Name', keep='first')
print(df)

This returns:

   Name  Age  Height
0   Nik   30     180
1  Evan   31     185
2   Sam   29     160

Conclusion

In this post, you learned how to remove duplicates in Pandas, including removing duplicates based on a subset of columns, and identifying whether to keep the first or last instance.

To learn more about the drop_duplicates function, check out the official documentation here.

Cover of Introduction to Python for Data Science
  • Save

Want to learn Python for Data Science? Check out my ebook for as little as $10!