Pandas Variance: Calculating Variance of a Pandas Dataframe Column

Python Pandas Variance Cover Image

In this tutorial, you’ll learn how to calculate the Pandas variance, including how to calculate the variance of a single column, multiple columns, and an entire Pandas Dataframe.

The Quick Answer: Use Pandas .var()

Quick Answer - Calculate the Pandas Variance

What is the Variance Statistic?

The term variance is used to represent a measurement of the spread between numbers in a dataset. In fact, the variance measures how far each number if from the mean of all numbers, thereby providing a ways to identify how spread our numbers are.

The variance is calculated by:

  1. Calculating the difference between each number and the mean
  2. Calculating the square of each difference
  3. Dividing the the sum of the squared differences by the number (minus 1) of observations in your sample

The formula for the variance looks like this:

Variance Calculation

Now that you have a good understanding of what the variance measure is, let’s learn how to calculate it using Python.

Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.

How to Calculate Variance in Python

Before we dive into how to calculate the variance using Pandas, let’s first understand how you can implement calculating the variance from scratch using Python.

# Calculate the variance from scratch in Python

numbers = [1,2,3,4,5,6,7,8,9]

def variance(observations):
    mean = sum(observations) / len(observations)
    squared_differences = 0
    for number in observations:
        difference = mean - number
        squared_difference = difference ** 2
        squared_differences += squared_difference
    variance = squared_differences / (len(observations) - 1)
    
    return variance

print(variance(numbers))

# Returns 7.5

In this code, we’ve done the following:

  1. Created a function that takes observations in the form of a list
  2. We first calculate the mean of the observations by dividing the sum of observations by the number of observations
  3. We create a new variable that will hold the squared differences and initialize at 0
  4. We then loop over each observation and calculate the difference from the mean and square it. This number is then added to the value of our squared differences
  5. Finally, we divide the sum of squared differences by the number of observations minus one

Thankfully, you don’t need to write this every time you want to calculate the variance of a Pandas dataset. In the next section, you’ll learn how to easily calculate the variance of a single column using Pandas.

Want to learn how to use the Python zip() function to iterate over two lists? This tutorial teaches you exactly what the zip() function does and shows you some creative ways to use the function.

Loading a Sample Pandas Dataframe

If you want to follow along with the tutorial, feel free to load the dataframe below. We’ll include a variety of columns, including one containing strings, one with missing data, and two numerical columns.

Let’s load the dataframe by using the code below:

# Loading a Sample Pandas Dataframe
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

print(df)

# Returns:
#       name  ages  ages_missing_data  income
# 0    James    30               30.0  100000
# 1     Jane    40               40.0   80000
# 2  Melissa    32               32.0   55000
# 3       Ed    67               67.0   62000
# 4     Neil    43                NaN  120000

Now that we have a dataframe to work with, let’s begin calculating the variance for the Pandas dataframe.

How to Calculate Variance in Pandas for a Single Column

Pandas makes it very easy to calculate to calculate the variance for a single column. For our first example, we’ll begin by calculating the difference for a single column that does not contain any missing data.

Let’s see how we can calculate the variance for the income column:

# Calculating a Pandas variance for a single column
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

income_variance = df['income'].var()

print(income_variance)

# Returns: 722800000.0

By default, Pandas will use n-1 as the denominator. If, instead, we wanted to use n as the denominator, we can use the ddof (delta degrees of freedom) argument and change its value to 0.

Let’s see what this would look like:

# Calculating a Pandas variance for a single column
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

income_variance = df['income'].var(ddof=0)

print(income_variance)

# Returns: 578240000.0

In the next section, you’ll learn how to deal with missing values when calculating a Pandas variance.

Want to learn more about Python list comprehensions? Check out this in-depth tutorial that covers off everything you need to know, with hands-on examples. More of a visual learner, check out my YouTube tutorial here.

How to Deal with Missing Data in Calculating a Pandas Variance

In many cases, you may be working with imperfect data – namely, sometimes data may be missing. Because of this, you will need to make decisions as to how to treat missing data in your calculations. By default, Pandas will ignore missing data from its variance calculation.

Let’s take a look at calculating the variance of a column with missing data.

# Calculating a Pandas variance for a single column with missing data
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

missing_data_variance = df['ages_missing_data'].var()

print(missing_data_variance)

# Returns: 290.9166666666667

Now let’s take a look at what the variance looks like when we include our missing data:

# Calculating a Pandas variance for a single column with missing data
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

missing_data_variance = df['ages_missing_data'].var(skipna=False)

print(missing_data_variance)

# Returns: nan

We can see here that when a missing data exists in a column, then na is returned. In order to work around this, we could replace the missing data using either 0 or an imputed value.

In the next section, you’ll learn how to calculate the variance for multiple columns in Pandas.

Want to learn how to pretty print a JSON file using Python? Learn three different methods to accomplish this using this in-depth tutorial here.

How to Calculate Variance in Pandas for Multiple Columns

There may also be many times when you want to calculate the variance for multiple columns, in order to see the dispersion across related variables.

In order to do this, we can simply index the columns we want to calculate the variance for by using double square brackets [[]] and then use the .var() method.

Let’s see what this looks like:

# Calculating a Pandas variance for multiple columns
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

variances = df[['ages', 'income']].var()
print(variances)

# Returns:
# ages            218.3
# income    722800000.0
# dtype: float64

We can see here that as series of data is returned that provides the column name and the variances of those columns.

Need to check if a key exists in a Python dictionary? Check out this tutorial, which teaches you five different ways of seeing if a key exists in a Python dictionary, including how to return a default value.

How to Calculate Variance in Pandas for a Dataframe

It’s even easier to calculate the variances for an entire dataframe. Pandas will recognize if a column is not numeric and will exclude the column from its variance analysis.

Simply pass the .var() method to the dataframe and Pandas will return a series containing the variances for different numerical columns.

Let’s take a look at what this looks like:

# Calculating a Pandas variance for an entire dataframe
import pandas as pd

df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'ages': [30, 40, 32, 67, 43],
    'ages_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})

variances = df.var()
print(variances)

# Returns:
# ages                 2.183000e+02
# ages_missing_data    2.909167e+02
# income               7.228000e+08
# dtype: float64

Need to automate renaming files? Check out this in-depth guide on using pathlib to rename files. More of a visual learner, the entire tutorial is also available as a video in the post!

Conclusion

In this post, you learned what the variance statistic is, how to calculate it from scratch using Python, and how to easily calculate a Pandas variance for a single or multiple columns or for an entire dataframe.

To learn more about the Pandas .var() method, check out the official documentation here.

Tags: