Skip to content

Pandas fillna: A Guide for Tackling Missing Data in DataFrames

Pandas fillna Guide for Tackling Missing Values in DataFrames Cover Image

Welcome to our comprehensive guide on using the Pandas fillna method! Handling missing data is an essential step in the data-cleaning process. It ensures that your analysis provides reliable, accurate, and consistent results. Luckily, using the Pandas .fillna() method can make dealing with those pesky “NaN” or “null” values a breeze. In this tutorial, we’ll delve deep into .fillna(), covering its parameters, usage, and various ways to help you maintain the integrity of your data.

By the end of this tutorial, you’ll have learned the following:

  • What the Pandas .fillna() method is and why it’s crucial in handling missing data
  • Detailed descriptions and use cases for each .fillna() parameter
  • Different ways to fill missing data using .fillna(), such as forward fill or backward fill
  • Answers to frequently asked questions regarding the usage of .fillna()

Want to learn how to drop missing data instead? Check out my complete guide to the Pandas dropna method, which provides huge flexibility in dropping missing data.

Understanding the Pandas fillna() Method

Before diving into using the Pandas .fillna() method, let’s take a moment to understand how the method works. The code block below breaks down the different available parameters of the method:

# Understanding the Pandas fillna() Method
import pandas as pd
df = pd.DataFrame()
df.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)

Let’s explore the parameters and default arguments a little further. The table below breaks down each of the parameters of the .fillna() method as well as their default values and accepted values:

ParameterDescriptionDefault ValueAccepted Values
valueSpecifies the value(s) to be used to fill missing dataNoneScalar, dict, Series, or DataFrame
methodIndicates the method to fill missing data (forward fill or backward fill)None‘pad’ or ‘ffill’ (forward fill), ‘bfill’ or ‘backfill’ (backward fill)
axisDetermines the axis along which to fill missing values (rows or columns)00 (index/rows) or 1 (columns)
inplaceIf True, will fill missing data in-place without creating a new DataFrameFalseTrue, False
limitSets the maximum number of consecutive missing values to fill inNonePositive integer value
downcastOptionally provide a dictionary for downcasting filled values to datatypesNoneDictionary in the form {‘column name’: ‘datatype’} or {‘column name’: ‘infer’} for inferring datatype from the other values

Now that you have learned about the different parameters of the .fillna() method, let’s start learning how to use the method.

Loading a Sample Pandas DataFrame

Let’s take a look at the DataFrame that we’ll be using for this tutorial. I have kept the dataset simple on purpose. In my experience, when learning something new, it’s best to start simple and build to more complex use cases after.

Let’s load a DataFrame by passing in a dictionary of data:

# Load a Sample Dataset
import pandas as pd
df = pd.DataFrame({
    "Name": ['Alice', 'Bob', None, 'David', None, 'Fiona', 'George'],
    "Age": [25, None, 23, 35, None, 31, 28],
    "Gender": ['F', 'M', 'M', None, 'F', 'F', 'M'],
    "Years": [3, None, None, None, 7, None, 2]
})

print(df.head())

# Returns:
#       Name   Age Gender  Years
# 0    Alice  25.0      F    3.0
# 1      Bob   NaN      M    NaN
# 2  Charlie  23.0      M    NaN
# 3    David  35.0   None    NaN
# 4      Eva   NaN      F    7.0

We can see that we have four columns, each of which contain a number of different missing values. Let’s now dive into how to use the .fillna() method to fill missing data for an entire DataFrame.

Using Pandas fillna() to Fill Missing Values in a Single DataFrame Column

The Pandas .fillna() method can be applied to a single column (or, rather, a Pandas Series) to fill all missing values with a value. To fill missing values, you can simply pass in a value into the value= parameter.

This gives you a ton of flexibility in terms of how you want to fill your missing values. Let’s explore a few of these by looking at how to fill with 0, another constant value, the mean of the column, or with a string.

Using Pandas fillna() To Fill with 0

To fill all missing values in a Pandas column with 0, you can pass in .fillna(0) and apply it to the column. Let’s see how we can fill all missing values in the Years column:

# Fill Missing Values with 0
df['Years'] = df['Years'].fillna(0)
print(df.head())

# Returns:
#     Name   Age Gender  Years
# 0  Alice  25.0      F    3.0
# 1    Bob   NaN      M    0.0
# 2   None  23.0      M    0.0
# 3  David  35.0   None    0.0
# 4   None   NaN      F    7.0

In the code block above, we applied the .fillna() method to the Years column. Note, in particular that we re-assigned the column here, thereby overwriting the original Pandas Series.

Using Pandas fillna() To Fill with a Constant Value

Similar to the example above, to fill all missing values in a Pandas column with a constant value, we simply pass that value into the .fillna() method’s value= parameter. The value will attempt to match the value to the data type of the column.

Let’s see how we can fill all missing values in the Age column with 99:

# Fill Missing Values with a Constant Value
df['Age'] = df['Age'].fillna(99)
print(df.head())

# Returns:
#     Name   Age Gender  Years
# 0  Alice  25.0      F    3.0
# 1    Bob  99.0      M    NaN
# 2   None  23.0      M    NaN
# 3  David  35.0   None    NaN
# 4   None  99.0      F    7.0

In the example above, we filled all missing values in the 'Age' column using the value 99. In many cases, this wouldn’t actually be how we would infer missing data. Because of this, let’s see how we can pass in the mean (or average) value of a column.

Using Pandas fillna() To Fill with the Mean

In order to fill all missing values of a column with the mean of that column, you can apply .fillna() with the mean value of that column. Let’s see how we can use the Pandas .mean() method to replace missing values with the mean:

# Fill Missing Values with the Mean of a Column
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df.head())

# Returns:
#     Name   Age Gender  Years
# 0  Alice  25.0      F    3.0
# 1    Bob  28.4      M    NaN
# 2   None  23.0      M    NaN
# 3  David  35.0   None    NaN
# 4   None  28.4      F    7.0

In the code block block above, rather than passing in a direct value, we pass in the .mean() method applied to the column. This allows the code to be reusable and adaptive as the data changes.

What’s great about this approach is that it allows us to use any other type of calculated value, such as the median or the mode of a dataset.

Using Pandas fillna() To Fill with a String

Similarly, we can pass in a string to fill all missing values with the given string. This works in the same way as passing in a constant value. Let’s see how we can pass in the string 'Missing' to fill all missing values in the 'Name' column:

# Fill Missing Values with a String
df['Name'] = df['Name'].fillna('Missing')
print(df.head())

# Returns:
#       Name   Age Gender  Years
# 0    Alice  25.0      F    3.0
# 1      Bob   NaN      M    NaN
# 2  Missing  23.0      M    NaN
# 3    David  35.0   None    NaN
# 4  Missing   NaN      F    7.0

We can see that this works in the same way as our previous examples. Note that if we passed a string into a column that’s numeric (such as an integer or float), that the entire column’s data type would change to object.

Using Pandas fillna() to Fill Missing Values in an Entire DataFrame

In order to fill missing values in an entire Pandas DataFrame, we can simply pass a fill value into the value= parameter of the .fillna() method. The method will attempt to maintain the data type of the original column, if possible.

Let’s see how we can fill all of the missing values across the DataFrame using the value 0:

# Filling Missing Values in a Pandas DataFrame with One Value
df = df.fillna(0)
print(df.head())

# Returns:
#     Name   Age Gender  Years
# 0  Alice  25.0      F    3.0
# 1    Bob   0.0      M    0.0
# 2      0  23.0      M    0.0
# 3  David  35.0      0    0.0
# 4      0   0.0      F    7.0

We can see in the code block above that by passing in a single value into the .fillna() method, that value was passed into each column. What’s neat about this approach is that the data types are adjusted to match those of the column.

Using Pandas fillna() to Fill Missing Values in Specific DataFrame Columns

So far, we have explored filling missing data either for one column at a time or for the entire DataFrame. Pandas allows you to pass in a dictionary of column-value pairs to fill missing values in identified columns with specific values.

This can be tremendously helpful when you want to clean missing data across the DataFrame, without needing to call the method multiple times.

Let’s see how we can use this approach to fill missing values with different values:

# Fill Missing Values in Specific DataFrame Columns
df = df.fillna({
    'Name': 'Missing',
    'Age': df['Age'].mean(),
    'Years': 0
})

print(df.head())

# Returns:
#       Name   Age Gender  Years
# 0    Alice  25.0      F    3.0
# 1      Bob  28.4      M    0.0
# 2  Missing  23.0      M    0.0
# 3    David  35.0   None    0.0
# 4  Missing  28.4      F    7.0

In the code block above, we passed in a dictionary mapping of the columns we wanted to fill missing data in and the values with which we wanted to fill them.

Notice that we used a number of different approaches: a string, the average of a column, and a constant value.

Similarly, the column we skipped (Gender) was ignored. This approach allows you to write clean code, without successive calls to the .fillna() method.

Using Pandas fillna() to Back Fill or Forward Fill Data

The Pandas .fillna() method also allows you to fill holes in your data by using the last or next observations. This process is called forward-filling or back-filling the data.

In doing this, we have the following options to pass into the method= parameter:

  • 'ffill' or 'pad' will use the previous value to fill missing values in a gap
  • 'bfill' or 'backfill' will use the next value to fill missing values in a gap

Let’s see how we can use this to fill missing values in the 'Years' column:

# Forward fill missing data using .fillna()
df['Years'] = df['Years'].fillna(method='ffill')
print(df)

# Returns:
#      Name   Age Gender  Years
# 0   Alice  25.0      F    3.0
# 1     Bob   NaN      M    3.0
# 2    None  23.0      M    3.0
# 3   David  35.0   None    3.0
# 4    None   NaN      F    7.0
# 5   Fiona  31.0      F    7.0
# 6  George  28.0      M    2.0

In the code block above, we apply the .fillna(method='ffill') to the 'Years' column. Notice all gaps are filled with the last value that preceded the gap. This is approach is particularly helpful with time series data.

If you want to fill gaps in data with interpolated values, you can check out the .interpolate() method which can be used to fill missing data by calculating what these values should be.

Limiting the Number of Consecutive Missing Data Filled with Pandas fillna()

When using the method= parameter of the .fillna() method, you may not want to fill an entire gap in your data. By using the limit= parameter, you can specify the maximum numbers of consecutive missing values to forward-fill or back-fill.

Let’s see how we we can use this parameter to limit the number of values filled in a gap in our data:

# Limiting the Number of Data Filled
df['Years'] = df['Years'].fillna(method='ffill', limit=2)
print(df)

# Returns:
#      Name   Age Gender  Years
# 0   Alice  25.0      F    3.0
# 1     Bob   NaN      M    3.0
# 2    None  23.0      M    3.0
# 3   David  35.0   None    NaN
# 4    None   NaN      F    7.0
# 5   Fiona  31.0      F    7.0
# 6  George  28.0      M    2.0

In the example above, we used the same method to forward-fill missing data in our dataset. However, we specified that we would want to fill a maximum of two missing records, by passing in limit=2. We can see that the third missing value in the gap isn’t filled in.

For this parameter to work, the value passed in must be greater than 0 and not None.

Using Pandsa fillna() with groupby and transform

In this section, we’re going to explore using the Pandas .fillna() method to fill data across different categories. Recall from our earlier example, when we filled the missing data in the Age column, using the average of that column.

Something we can do to make our filled values more representative can be to split the data by a group. For example, we can fill the data by providing the missing data for each group in the Gender column.

In order to do this, we use the Pandas groupby method to calculate the average age of each group, then pass those values back to the DataFrame using the .transform() method.

# Calculate the mean age for each gender
mean_age_by_gender = df.groupby('Gender')['Age'].transform('mean')

# Fill the missing age values with the mean age of each gender
df['Age'] = df['Age'].fillna(mean_age_by_gender)
print(df)

# Returns:
# Name   Age Gender  Years
# 0    Alice  25.0      F    3.0
# 1      Bob  25.5      M    NaN
# 2     None  23.0      M    NaN
# 3    David  35.0   None    NaN
# 4     None  28.0      F    7.0
# 5    Fiona  31.0      F    NaN
# 6  George   28.0      M    2.0

In the example above, we first calculated the average age by gender and created a resulting Series with those values mapped to the index. We then used the .fillna() method to pass in that Series to fill all missing data.

Using Pandas fillna() to Fill Missing Data In Place

So far, we have explored using the Pandas .fillna() method by re-assigning either the DataFrame to itself or a Pandas Series to itself.

The .fillna() method allows you to fill missing values in place, by setting inplace=True. In my experience, there is some degree of contention as to whether this approach is faster or more memory efficient.

I prefer to run the operations by re-assigning the DataFrame/Series, since this can be used more consistently.

Let’s take a look at how we can fill missing values in place using the .fillna() method:

# Fill Missing Values In Place
df['Name'].fillna('Missing', inplace=True)
print(df.head())

# Returns:
#       Name   Age Gender  Years
# 0    Alice  25.0      F    3.0
# 1      Bob   NaN      M    NaN
# 2  Missing  23.0      M    NaN
# 3    David  35.0   None    NaN
# 4  Missing   NaN      F    7.0

We can see that by using the inplace=True argument, that we didn’t need to re-assign the column to itself.

Frequently Asked Questions

What is the difference between fillna and dropna in Pandas?

Both fillna and dropna are methods for handling missing data in a Pandas DataFrame or Series, but they work differently. fillna replaces the missing values (NaN or None) with specified values, while dropna eliminates the rows or columns containing missing values. Generally, use fillna when you want to maintain the shape and size of your dataset while filling missing data. On the other hand, choose dropna when you prefer to remove data with missing values entirely.

How can I use fillna to replace missing values with the mean, median, or mode of a column?

To fill missing values with the mean, median or mode of a column, simply pass the respective statistical measure as the ‘value’ parameter in the fillna method.

Can I use fillna on a specific subset of columns or rows in my DataFrame?

Yes, you can apply fillna to a subset of columns or rows. To perform this action, use the DataFrame.loc function to select the subset you want to work on, and then apply fillna.

Conclusion

In this comprehensive tutorial, we’ve explored the power of the Pandas fillna method in tackling missing data. To recap, we’ve covered the following:

  1. The importance of handling missing data and the role of the Pandas fillna method in this process
  2. The various parameters of the fillna method and their usage, such as value, method, axis, inplace, and limit
  3. Examples demonstrating different ways to use fillna, including filling missing values with constants, dictionaries, forward fill, backward fill, and groupby combined with transform

Remember that handling missing data effectively is essential for ensuring the accuracy and consistency of your data analysis. Don’t hesitate to refer to this tutorial if you need guidance or inspiration while working with missing data in your future projects. Good luck, and happy data cleaning!

To learn more about the Pandas .fillna() method, check out the official documentation.

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Tags:

2 thoughts on “Pandas fillna: A Guide for Tackling Missing Data in DataFrames”

  1. Nice article, thanks for posting. It has some good information for handling null values.
    One thing you didn’t cover that seems to be an ongoing issue with pandas fillna() is that null datetime values (NaT) do not get handled with fillna(value=”). Something to do with the particular way pandas handles datetime vs other dtypes.
    There is a discussion on the pandas github repository:
    https://github.com/pandas-dev/pandas/issues/11953
    It seems you can force it to accept a string or number or even a space – just not ”.
    Thought you might find it of interest.
    Thanks again.

Leave a Reply

Your email address will not be published. Required fields are marked *