Python Standard Deviation Tutorial: Explanation & Examples

  • by
  • Save

The Standard Deviation is a measure that describes how spread out values in a data set are. In Python, Standard Deviation can be calculated in many ways – the easiest of which is using either Statistics’ or Numpy’s standard deviant (std) function.

In this tutorial, you’ll learn what the standard deviation is, how to calculate it using built-in functions, and how to use Python to generate the statistics from scratch!

Table of Contents

What is Standard Deviation?

Standard deviation is a helpful way to measure how “spread out” values in a data set are.

But how do you interpret a standard deviation?

A small standard deviation means that most of the numbers are close to the mean (average) value. However, a large standard deviation means that the values are further away from the mean.

Without it, you wouldn’t be able to easily and effectively dive into data sets. Two data sets could have the same average value but could be entirely different in terms of how those values are distributed. This is where the standard deviation is important.

The standard deviation formula looks like this:

σ = √Σ (xi – μ)2 / (n-1)

Let’s break this down a bit:

  • σ (“sigma”) is the symbol for standard deviation
  • Σ is a fun way of writing “sum of”
  • xi represents every value in the data set
  • μ is the mean (average) value in the data set
  • n is the sample size

Why is Standard Deviation Important?

As explained above, standard deviation is a key measure that explains how spread out values are in a data set. A small standard deviation happens when data points are fairly close to the mean. However, a large standard deviation happens when values are less clustered around the mean.

A data set can have the same mean as another data set, but be very different. Let’s take a look at this with an example:

  • Data set #1 = [1,1,1,1,1,1,1,1,2,10]
  • Data set #2 = [2,2,2,2,2,2,2,2,2,2]

Both of these datasets have the same average value (2), but are actually very different.

We’ll get back to these examples later when we calculate standard deviation to illustrate this point.

How to Calculate Standard Deviation in Python?

The easiest way to calculate standard deviation in Python is to use either the statistics module or the Numpy library.

Using the Statistics Module

The statistics module has a built-in function called stdev, which follows the syntax below:

standard_deviation = stdev([data], xbar)
  • [data] is a set of data points
  • xbar is a boolean parameter (either True or False), to take the actual mean of the data set as a value

Let’s try this with an example:

import statistics

sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = statistics.stdev(sample)
print(standard_deviation)

# Returns 2.55

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Using Numpy to Calculate Standard Deviation

Numpy has a function named std, which is used to calculate the standard deviation of a sample.

This follows the following syntax:

standard_deviation = np.std([data], ddof=1)

The formula takes two parameters:

  1. Data is the sample of data
  2. ddof is a value of degrees of freedom. We apply 1, since we are calculating the standard deviation for a sample (rather than an entire population)

Now, let’s try this with an example:

import numpy as np

sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)

# Returns 2.55

Calculate Standard Deviation for List

To calculate the standard deviation for a list that holds values of a sample, we can use either method we explored above. For this example, let’s use Numpy:

import numpy as np

sample_list = [10,30,43,23,67,49,78,98]
standard_deviation = np.std(sample_list, ddof=1)
print(standard_deviation)

# Returns 29.65

Calculate Standard Deviation for Dictionary Values

To calculate the standard deviation for dictionary values in Python, you need to let Python know you only want the values of that dictionary.

For the example below, we’ll be working with peoples’ heights in centimetres and calculating the standard deviation:

import numpy as np

sample_dictionary = {'John': 170, 'Meaghan': 155, 'Kate': 160, 'Peter': 185, 'Jane': 145}
standard_deviation = np.std(list(sample_dictionary.values()), ddof=1)
print(standard_deviation)

# Returns 15.25

This is very similar, except we use the list function to turn the dictionary values into a list.

Pandas Standard Deviation

If you are working with Pandas, you may be wondering if Pandas has a function for standard deviations.

Pandas lets you calculate a standard deviation for either a series, or even an entire dataframe!

The standard syntax looks like this:

DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None)

Let’s explore these parameters:

  • axis is either 0 for index or 1 for columns
  • skipna is used to include/exclude null/NA values in the calculation
  • level determines if the axis is a multi-index and tells Pandas which level to count
  • ddof defaults to 1 as the formula is used for samples
  • numeric_only includes only numeric values in the calculation

Let’s try this out with an example, using peoples’ heights and weights:

import pandas as pd

dataframe_dictionary = {'Name': ['John', 'Meaghan', 'Kate', 'Peter', 'Jane'], 
                        'Height': [170,155,160,185,145], 
                        'Weight': [160, 120, 125, 200, 135]}
df = pd.DataFrame(data = dataframe_dictionary)
standard_deviation = df.std()

print(standard_deviation)

# Returns
# Height    15.247951
# Weight    32.901368

If you wanted to return the standard distribution only for one column, say height, you could write:

import pandas as pd

dataframe_dictionary = {'Name': ['John', 'Meaghan', 'Kate', 'Peter', 'Jane'], 
                        'Height': [170,155,160,185,145], 
                        'Weight': [160, 120, 125, 200, 135]}
df = pd.DataFrame(data = dataframe_dictionary)
standard_deviation = df['Height'].std()

print(standard_deviation)

# Returns 15.247951

You can learn more about the Pandas std function by checking out the official documentation here.

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Python Standard Deviation From Scratch

For our final example, let’s build the standard deviation from scratch, the see what is real going on.

To begin, let’s take another look at the formula:

σ = √Σ (xi – μ)2 / (n-1)

In the code below, the steps needed are broken out:

import math

sample_list = [170,155,160,185,145]

# Need: (1) mean value, (2) difference between each value and mean, squared, (3) sample size

# Finding Mean value
sums = 0
for i in range(len(sample_list)):
    sums += sample_list[i]

mean = sums / len(sample_list)

# Finding square of difference of mean and each value
difference_squared = 0
for i in range(len(sample_list)):
    difference_squared += (sample_list[i] - mean) ** 2

# Finding Square Root
standard_deviation = math.sqrt(difference_squared / ((len(sample_list)) - 1))

print(standard_deviation)
# Returns 15.25

Conclusion

In this post, we learned all about the standard deviation. We started off by learning what it is and how it’s calculated, and why it’s significant. Then, we learned how to calculate the standard deviation in Python, using the statistics module, Numpy, and finally applying it to Pandas. We closed the tutorial off by demonstrating how the standard deviation can be calculated from scratch using basic Python!

I hope you learned a lot! If you did, if you’d consider sharing it, that would help me out immensely!

Cover of Introduction to Python for Data Science
  • Save

Want to learn Python for Data Science? Check out my ebook for as little as $10!

Leave a Reply

Your email address will not be published. Required fields are marked *