Pandas Fillna – Dealing with Missing Values

  • by
Pandas Fillna Cover Image
  • Save

In this post, you’ll learn about the Pandas Fillna function and how to deal with missing values.

No dataset is perfect. Learning how to deal with missing values is an important step in retaining useful data.

Video Tutorial

Table of Contents

Loading the Dataset

Let’s start the tutorial by loading a dataset. We’ll import pandas and load a dataset specifically made for this tutorial. The dataset covers the temperature and humidity in Toronto, Ontario for a period of days.

import pandas as pd
df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/fillna.xlsx')
print(df.head())

This returns:

        Time 	        Temperature (F) 	Humidity
0 	2020-07-01 	73.3 	                74.3
1 	2020-07-02 	83.7 	                47.5
2 	2020-07-03 	81.2 	                NaN
3 	2020-07-04 	NaN 	                NaN
4 	2020-07-05 	74.5 	                NaN

Identifying Missing Values in Pandas

An easy tip to see how many missing values exist in any column in Pandas is to chain the isna and sum functions.

df.isna().sum()

This returns:

Time               0
Temperature (F)    3
Humidity           3
dtype: int64

The way this works is that the isna function returns a boolean array. If a value is missing, it returns True. By applying the sum function, True values are evaluated as 1 and False values as 0.

Pandas Fillna Overview

Let’s take a quick moment to explore the Pandas fillna function:

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) 

The function has very useful parameters:

ParameterDescription
valueThe value to fill with
methodAllows you to propagate the last or next value
axisFor columns or rows
inplaceWhether to replace NaNs in place
limitHow many values to propegate if specifying a method
downcastA dict of item->dtype of what to downcast

Pandas Fillna to Fill Values

There are a number of options that you can use to fill values using the Pandas fillna function.

You can pass in either a single value or a dictionary of values, where the keys represent the columns to replace values in.

Tip! All the code below will not actually replace values. To do this, either set inplace=True or re-assign the dataframe.

Let’s replace all missing values with a single value first:

df.fillna(0)

This returns:

         Time  Temperature (F)  Humidity
0  2020-07-01             73.3      74.3
1  2020-07-02             83.7      47.5
2  2020-07-03             81.2       0.0
3  2020-07-04              0.0       0.0
4  2020-07-05             74.5       0.0

Now let’s try replacing different columns’ missing values with different values:

df.fillna({'Temperature (F)': 99, 'Humidity':0})

This returns:

         Time  Temperature (F)  Humidity
0  2020-07-01             73.3      74.3
1  2020-07-02             83.7      47.5
2  2020-07-03             81.2       0.0
3  2020-07-04             99.0       0.0
4  2020-07-05             74.5       0.0

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Fill Missing Values with the Mean (Average)

There may be a lot of times when replacing with a particular value isn’t practical. You may want to replace the missing values with the average value in that particular column.

Let’s see how to do this:

df.fillna({
    'Temperature (F)': df['Temperature (F)'].mean(),
    'Humidity': df['Humidity'].mean()
})

Here, you pass in the dictionary of values as before, and assign the mean of each column to the missing values.

This returns:

         Time  Temperature (F)  Humidity
0  2020-07-01            73.30  74.30000
1  2020-07-02            83.70  47.50000
2  2020-07-03            81.20  76.94375
3  2020-07-04            76.05  76.94375
4  2020-07-05            74.50  76.94375

Fill Missing Values with Previous Values (bfill, ffill)

Instead of interpolating the values, it can be helpful to use a previous or following value (also known as backward fill or forward fill).

This takes either the previous or the following values, we can set the method to either bfill or ffill, for backward fill / forward fill.

Let’s say we wanted to use the previous method. We would then use bfill:

df.fillna(method='bfill')

This returns:

         Time  Temperature (F)  Humidity
0  2020-07-01             73.3      74.3
1  2020-07-02             83.7      47.5
2  2020-07-03             81.2      72.5
3  2020-07-04             74.5      72.5
4  2020-07-05             74.5      72.5

There may be times when we don’t want to fill multiple values. Here, we can use the limit parameter. For example, if we set the limit to 1, it will only fill one value and leave other NaNs as NaN:

df.fillna(method='ffill', limit=1)

This returns:

        Time  Temperature (F)  Humidity
0 2020-07-01             73.3      74.3
1 2020-07-02             83.7      47.5
2 2020-07-03             81.2      47.5
3 2020-07-04             81.2       NaN
4 2020-07-05             74.5       NaN

Conclusion

In this post, you learned how to work with missing data. We started off by learning how to count missing values. We then learned how to fill missing values using the Pandas fillna function, including interpolating values and using the backward fill/forward fill methods.

To learn more, check out the official documentation.

Cover of Introduction to Python for Data Science
  • Save

Want to learn Python for Data Science? Check out my ebook for as little as $10!

Tags: