Skip to content

How to Calculate a Z-Score in Python (4 Ways)

In this tutorial, you’ll learn how to use Python to calculate a z-score for an array of numbers. You’ll learn a brief overview of what the z-score represents in statistics and how it’s relevant to machine learning. You’ll then learn how to calculate a z-score from scratch in Python as well as how to use different Python modules to calculate the z-score.

By the end of this tutorial, you’ll have learned how to use scipy and pandas modules to calculate the z-score. Each of these approaches has different benefits and drawbacks. In large part, determining which approach works best for you depends on a number of different factors. For example, you may not want to import a different library only to calculate a statistical measure. Alternatively, you may want more control over how to calculate z-scores and rely on the flexibility that scipy gives you.

The Quick Answer: scipy.stats’ zscore() to Calculate a z-score in Python

# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]

zscores = stats.zscore(values)
print(zscores)
# Returns: [-1.2493901  -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752  0.62469505  0.85895569  0.85895569  1.09321633  2.0302589 ]

What is the Z-Score and how is it used in Machine Learning?

The z-score is a score that measures how many standard deviations a data point is away from the mean. The z-score allows us to determine how usual or unusual a data point is in a distribution. The z-score allows us more easily compare datapoints for a record across features, especially when the different features have significantly different ranges.

The z-score must be used with a normal distribution, which is one of the prerequisites for calculating a standard deviation. We know that in a normal distribution, over 99% of values fall within 3 standard deviations from the mean. Because of this, we can assume that if a z-score returned is larger than 3 that the value is quite unusual.

The benefit of this standardization is that it doesn’t rely on the original values of the feature in the dataset. Because of this, we’re able to more easily compare the impact of one feature to another.

The z-score is generally calculated for each value in a given feature. It takes into account the standard deviation and the mean of the feature. The formula for the z-score looks like this:

The formula for a z-score

For each value in an array, the z-score is calculated by dividing the difference between the value and the mean by the standard deviation of the distribution. Because of this, the z-score can be either positive or negative, indicating whether the value is larger or smaller than the mean.

In the next section, you’ll learn how to calculate the z-score from scratch in Python.

How to Calculate a Z-Score from Scratch in Python

In order to calculate the z-score, we need to first calculate the mean and the standard deviation of an array. To learn how to calculate the standard deviation in Python, check out my guide here.

To calculate the standard deviation from scratch, let’s use the code below:

# Calculate the Standard Deviation in Python
mean = sum(values) / len(values)
differences = [(value - mean)**2 for value in values]
sum_of_differences = sum(differences)
standard_deviation = (sum_of_differences / (len(values) - 1)) ** 0.5

print(standard_deviation)
# Returns: 1.3443074553223537

Now that we have the mean and the standard deviation, we can loop over the list of values and calculate the z-scores. We can do this by subtracting the mean from the value and dividing this by the standard deviation.

In order to do this, let’s use a Python list comprehension to loop over each value:

# Calculate the z-score from scratch
zscores = [(value - mean) / standard_deviation for value in values]

print(zscores)
# Returns: [-3.9673463925367023, -3.2234689439360706, -2.479591495335439, -2.479591495335439, -2.479591495335439, -1.7357140467348073, -0.9918365981341759, 1.9836731962683505, 2.727550644868982, 2.727550644868982, 3.4714280934696133, 6.4469378878721395]

This approach works, but it’s a bit verbose. I wanted to cover it off here to provide a mean to calculate the z-score with just pure Python. It can also be a good method to demonstrate in Python coding interviews.

That being said, there are much easier ways to accomplish this. In the next section, you’ll learn how to calculate the z-score with scipy.

How to Use Scipy to Calculate a Z-Score

The most common way to calculate z-scores in Python is to use the scipy module. The module has numerous statistical functions available through the scipy.stats module, including the one we’ll be using in this tutorial: zscore().

The zscore() function takes an array of values and returns an array containing their z-scores. It implicitly handles calculating the mean and the standard deviation, so we don’t need to calculate those ourselves. This has the benefit of saving us many lines of code, but also allows our code to be more readable.

Let’s see how we can use the scipy.stats package to calculate z-scores:

# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]

zscores = stats.zscore(values)
print(zscores)
# Returns: [-1.2493901  -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752  0.62469505  0.85895569  0.85895569  1.09321633  2.0302589 ]

We can see how easy it was to calculate the z-scores in Python using scipy! One important thing to note here is that the scipy.stats.zscore() function doesn’t return a list. It actually returns a numpy array.

In the next section, you’ll learn how to use Pandas and scipy to calculate z-scores for a Pandas Dataframe.

How to Use Pandas to Calculate a Z-Score

There may be many times when you want to calculate the z-scores for a Pandas Dataframe. In this section, you’ll learn how to calculate the z-score for a Pandas column as well as for an entire dataframe. In order to do this, we’ll be using the scipy library to accomplish this.

Let’s load a sample Pandas Dataframe to calculate our z-scores:

# Loading a Sample Pandas Dataframe
import pandas as pd

df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Joe', 'Mitch', 'Alana'],
    'Age': [32, 30, 67, 34, 20],
    'Income': [80000, 90000, 45000, 23000, 12000],
    'Education' : [5, 7, 3, 4, 4]
})

print(df.head())

# Returns:
#     Name  Age  Income  Education
# 0    Nik   32   80000          5
# 1   Kate   30   90000          7
# 2    Joe   67   45000          3
# 3  Mitch   34   23000          4
# 4  Alana   20   12000          4

We can see that by using the Pandas .head() dataframe method, that we have a dataframe with four columns. Three of these are numerical columns, for which we can calculate the z-score.

We can use the scipy.stats.zscore() function to calculate the z-scores on a Pandas dataframe column. Let’s create a new column that contains the values from the Income column normalized using the z-score:

df['Income zscore'] = stats.zscore(df['Income'])
print(df.head())

# Returns:
#     Name  Age  Income  Education  Income zscore
# 0    Nik   32   80000          5       0.978700
# 1   Kate   30   90000          7       1.304934
# 2    Joe   67   45000          3      -0.163117
# 3  Mitch   34   23000          4      -0.880830
# 4  Alana   20   12000          4      -1.239687

One of the benefits of calculating z-scores is to actually normalize values across features. Because of this, it’s often useful to calculate the z-scores for all numerical columns in a dataframe.

Let’s see how we can convert our dataframe columns to z-scores using the Pandas .apply() method:

df = df.select_dtypes(include='number').apply(stats.zscore)
print(df.head())

# Returns:
#         Age    Income  Education
# 0 -0.288493  0.978700   0.294884
# 1 -0.413925  1.304934   1.769303
# 2  1.906565 -0.163117  -1.179536
# 3 -0.163061 -0.880830  -0.442326
# 4 -1.041085 -1.239687  -0.442326

In the example above, we first select only numeric columns using the .select_dtypes() method and then use the .apply() method to apply the zscore function.

The benefit of this, is that we’re now able to compare the features in relation to one another in a way that isn’t impacted by their distributions.

Calculate a z-score From a Mean and Standard Deviation in Python

In this final section, you’ll learn how to calculate a z-score when you know a mean and a standard deviation of a distribution. The benefit of this approach is to be able to understand how far away from the mean a given value is. This approach is available only in Python 3.9 onwards.

For this approach, we can use the statistics library, which comes packed into Python. The module comes with a function, NormalDist, which allows us to pass in both a mean and a standard deviation. This creates a NormalDist object, where we can pass in a zscore value

Let’s take a look at an example:

# Calculate a z-score from a provided mean and standard deviation
import statistics
mean = 7
standard_deviation = 1.3

zscore = statistics.NormalDist(mean, standard_deviation).zscore(5)
print(zscore)

# Returns: -1.5384615384615383

We can see that this returns a value of -1.538, meaning that the value is roughly 1.5 standard deviations away from the mean.

Conclusion

In this tutorial, you learned how to use Python to calculate a z-score. You learned how to use the scipy module to calculate a z-score and how to use Pandas to calculate it for a column and an entire dataframe. Finally, you learned how to use the statistics library to calculate a zscore, when you know a mean, standard deviation and a value.

To learn more about the scipy zscore function, check out the official documentation here.

Additional Resources

To learn more about related topics, check out these articles here:

Leave a Reply

Your email address will not be published.