In this tutorial, you’ll learn how to use the Pandas dropna() method to drop missing values in a Pandas DataFrame. Working with missing data is one of the essential skills in cleaning your data before analyzing it. Because data cleaning can take up to 80% of a data analyst’s / data scientist’s time, being able to do this work effectively and efficiently is an important skill.
By the end of this tutorial, you’ll have learned:
- How to use the Pandas
.dropna()
method effectively - How to drop rows missing (NaN) values in Pandas
- How to drop columns missing (NaN) values in Pandas
- How to use the Pandas
.dropna()
method only on specific columns - How to set thresholds when dropping missing values in a Pandas DataFrame
- How to fix common errors when working with the Pandas
.dropna()
method
Table of Contents
Understanding the Pandas dropna() Method
The Pandas .dropna()
method is an essential method for a data analyst or data scientist of any level. Because cleaning data is an essential preprocessing step, knowing how to work with missing data will make you a stronger programmer.
Before diving into how to use the method, let’s take a minute to understand how the Pandas .dropna()
method works. We can do this by taking a look at the parameters and default arguments that method provides:
# Understanding the Pandas .dropna() Method
import pandas as pd
df = pd.DataFrame()
df.dropna(
axis=0,
how='any',
thresh=None,
subset=None,
inplace=False
)
We can see that the Pandas .dropna()
method offers five different parameters. All of these parameters have default arguments provided. This means that you can simply call the method and it will execute.
However, understanding what the different parameters do will ensure that you get the result you’re hoping for! Let’s break these parameters down a little further:
Argument | Description | Default Value | Accepted Values |
---|---|---|---|
axis= | Determines whether rows or columns are removed if they contain missing data. | 0 | {0, 1, ‘index’, ‘columns’} |
how= | Determines whether a column or row must be missing any or all values in order to be removed. | ‘any’ | {‘any’, ‘all’} |
thresh= | How many non-NA values are required for a column or row to be dropped. | None | integers |
subset= | Labels along the other axis to consider when checking for missing values. | None | column labels or sequence of labels |
inplace= | Whether to complete the operation in place or not. | False | booleans |
.dropna()
parameters and default argumentsBased on these parameters, we can see that the Pandas .dropna()
method offers a lot of flexibility in how to drop records with missing values.
Now that you have a strong understanding of what’s possible with the method, let’s dive into using the method to drop all rows with missing data.
Loading a Sample Pandas DataFrame
In order to follow along with this tutorial, I have provided a sample Pandas DataFrame. If you’re not using your own dataset, feel free to copy and paste the code below into your code editor of choice.
# Loading a Sample Pandas DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 1 Kyra NaN False Canada NaN
# 2 Kate 33.0 NaN Canada NaN
# 3 Nik 27.0 True USA NaN
# 4 NaN NaN NaN NaN NaN
We can see that the dataset has five different columns, some of which contain some or all missing values. In the following section, you’ll learn how to drop rows containing missing data.
How to Drop Rows with Missing Data in Pandas Using .dropna()
The Pandas dropna()
method makes it very easy to drop all rows with missing data in them. By default, the Pandas dropna()
will drop any row with any missing record in it. This is because the how=
parameter is set to 'any'
and the axis=
parameter is set to 0
.
Let’s see what happens when we apply the .dropna()
method to our DataFrame:
# Applying the .dropna() Method to Our DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna()
print(df.head())
# Returns:
# Empty DataFrame
# Columns: [Name, Age, Active, Country, Missing]
# Index: []
Because every record in our DataFrame contains a missing value, all of the records in our DataFrame are removed.
We can modify the behavior of the function to only drop records where all values are missing by passing how='all'
. Let’s see what this looks like:
# Dropping Records using .dropna() If All Records Are Missing
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(how='all')
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 1 Kyra NaN False Canada NaN
# 2 Kate 33.0 NaN Canada NaN
# 3 Nik 27.0 True USA NaN
By modifying the method’s behavior, we were able to instruct Pandas to only drop records where all values were missing.
In the next section, you’ll learn how to instruct Pandas to only check for missing records in some columns.
How to Use Pandas dropna() With A Subset or Specific Columns
Pandas makes it very easy to use the .dropna()
method to check only a subset of columns or a specific column. This can be incredibly helpful when some columns are more critical to your analysis than others.
In order to specify which columns to check for, you can use the subset=
parameter, which takes either a single column label or a list of column labels.
Let’s see how we can first check only for missing values in a single column, the 'Age'
column:
# Dropping Records in Pandas Using .dropna() Based on a Single Column
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(subset='Age')
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 2 Kate 33.0 NaN Canada NaN
# 3 Nik 27.0 True USA NaN
In the code block above, we passed a single string into the subset=
parameter. Because we only wanted to check for missing values in a single column, a string worked. In this case, Pandas ignored that there were missing values in the other columns.
Now, let’s see how we can check for missing values across multiple columns. In order to do this, we will need to pass a list (or sequence) of column labels into the subset=
parameter. Let’s see how we can check across the 'Age'
and 'Active'
columns:
# Dropping Records in Pandas Using .dropna() Based on Multiple Columns
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(subset=['Age', 'Active'])
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 3 Nik 27.0 True USA NaN
In this case, we were able to check for missing records in two columns. In the following section, you’ll learn how to drop columns with missing data in a Pandas DataFrame.
How to Drop Columns with Missing Data in Pandas Using .dropna()
Dropping columns in a Pandas DataFrame works in a very similar way to dropping records. In order to this, we simply set the axis=
parameter to either 1
or 'columns'
. While dropping columns based on missing data is, perhaps, a less common exercise, it’s important to know how to do this.
Let’s see what happens when we pass in axis=1
into our .dropna()
method call:
# Dropping All Columns with Missing Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(axis=1)
print(df.head())
# Returns:
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4]
Because every column contained some missing data, all of the columns were dropped. At this point, only an empty DataFrame and the indices remain.
We can modify the behavior to only drop columns where all of the values are missing by passing in how='any'
.
# Dropping Columns with All Missing Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(axis=1, how='all')
print(df.head())
# Returns:
# Name Age Active Country
# 0 Evan 36.0 True USA
# 1 Kyra NaN False Canada
# 2 Kate 33.0 NaN Canada
# 3 Nik 27.0 True USA
# 4 NaN NaN NaN NaN
By modifying the behavior of the method, we were able to drop only columns where all the records were missing.
In many cases, you’ll want to drop records or columns based on some threshold of values – this is what you’ll learn in the following section.
How to Set a Threshold of Missing Values for Dropping Records in Pandas
By using the thresh=
parameter, you’re able to set a minimum number of values that need to be not missing for a record to be dropped. This allows you to specify how many data points each record must have, at a minimum, for it be kept in the dataset.
Because the Pandas .dropna()
method will drop records by default, we can simply apply the thresh=
parameter to run our method. Let’s see how we can specify that we want at least four data points per record:
# Setting a Minimum Threshold for Keeping Records
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(thresh=4)
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 3 Nik 27.0 True USA NaN
In the example above, only records where at least four of the values were not missing were kept. In the following section, you’ll learn how to use this parameter to drop columns with a given number of missing records.
Dropping Columns With a Given Number of Missing Records
Similar to the example above, you may want to drop entire columns if a set number of records is missing. Because we can use the .dropna()
method to also drop columns, we can combine this with the thresh=
parameter to drop columns.
Because the tresh=
parameter works by specifying the number of records that must be not-null, we need to specify that as the value.
Let’s see how we can drop columns that don’t have at least 4 non-missing data points:
# Dropping Columns Based on Missing Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(axis=1, thresh=4)
print(df.head())
# Returns:
# Name Country
# 0 Evan USA
# 1 Kyra Canada
# 2 Kate Canada
# 3 Nik USA
# 4 NaN NaN
In many cases, however, you will want to drop columns based on a percentage of missing values. We can do this using the same method, but simply calculate the value in our method call.
Let’s see how we can drop columns where at least 10% of values are missing:
# Dropping Columns Where At Least 10% of Values are Missing
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df = df.dropna(axis=1, thresh=int(0.9*len(df)))
print(df.head())
# Returns:
# Name Country
# 0 Evan USA
# 1 Kyra Canada
# 2 Kate Canada
# 3 Nik USA
# 4 NaN NaN
In the example above, we calculate the thresh=
parameter in the method call itself by calculating the minimum number of non-missing records. Because the thresh=
parameter looks at how many records are not missing, we multiply the length of the DataFrame by 1 - percentage
.
How to Drop Missing Values in Pandas In Place
In the examples above, we dropped records or columns by re-assigning the DataFrame to itself. However, if you want to drop the records or columns in place, you can do this by setting inplace=True
.
By modifying the DataFrame in place, the method doesn’t return anything. Instead, it modifies the underlying DataFrame itself.
Let’s see how we can drop missing rows where all the records are missing, in place:
# Dropping Records in a DataFrame In Place
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Evan', 'Kyra', 'Kate', 'Nik', np.NaN],
'Age': [36, np.NaN, 33, 27, np.NaN],
'Active': [True, False, np.NaN, True, np.NaN],
'Country': ['USA', 'Canada', 'Canada', 'USA', np.NaN],
'Missing': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df.dropna(how='all', inplace=True)
print(df.head())
# Returns:
# Name Age Active Country Missing
# 0 Evan 36.0 True USA NaN
# 1 Kyra NaN False Canada NaN
# 2 Kate 33.0 NaN Canada NaN
# 3 Nik 27.0 True USA NaN
In the code block above, we dropped records in place. This avoids needing to re-assign the DataFrame to itself.
Frequently Asked Questions
The Pandas .dropna()
method is used to drop either records or columns with missing data. The method gives you flexibility in terms of how the records are dropped.
The Pandas .dropna()
method may not work if the DataFrame isn’t reassigned to itself or if the method isn’t done in place. If you’re not dropping values in place, you need to re-assign the DataFrame to itself.
The .dropna()
method doesn’t offer an option to reset the index. However, you can chain the resetting using the .reset_index()
method. For example, you could write: df.dropna().reset_index()
.
Conclusion
In this post, you learned how to use the Pandas .dropna()
method to drop missing records or columns in a DataFrame. You first learned about the different parameters available in the method. Then, you learned how to use the method to drop records with missing data as well as columns with missing data. Then, you learned how to use the method to drop records based on a threshold of not missing values and how to modify the DataFrame in place.
Additional Resources
To learn more about related topics, check out the tutorials below: