Calculate and Plot a Correlation Matrix in Python and Pandas

How to Calculate and Plot A Correlation Matrix in Python Cover Image

In this tutorial, you’ll learn how to calculate a correlation matrix in Python and how to plot it as a heat map. You’ll learn what a correlation matrix is and how to interpret it, as well as a short review of what the coefficient of correlation is. You’ll then learn how to calculate a correlation matrix with the pandas library. Then, you’ll learn how to plot the heat map correlation matrix using Seaborn. Finally, you’ll learn how to customize these heat maps to include on certain values.

The Quick Answer: Use Pandas’ df.corr() to Calculate a Correlation Matrix in Python

# Calculating a Correlation Matrix with Pandas
import pandas as pd

matrix = df.corr()
print(matrix)

# Returns:
#                    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
# bill_length_mm           1.000000      -0.235053           0.656181     0.595110
# bill_depth_mm           -0.235053       1.000000          -0.583851    -0.471916
# flipper_length_mm        0.656181      -0.583851           1.000000     0.871202
# body_mass_g              0.595110      -0.471916           0.871202     1.000000

What a Correlation Matrix is and How to Interpret it

A correlation matrix is a common tool used to compare the coefficients of correlation between different features (or attributes) in a dataset. It allows us to visualize how much (or how little) correlation exists between different variables. This is an important step in pre-processing machine learning pipelines. Since the correlation matrix allows us to identify variables that have high degrees of correlation, they allow us to reduce the amount of features we may have in a dataset. This is often referred to as dimensionality reduction and can be used to improve the runtime and effectiveness of our models.

That’s the theory of our correlation matrix. But what does it actually look like? A correlation matrix has the same number of rows and columns as our dataset has columns. This means that if we have a dataset with 10 columns, then our matrix will have ten rows and ten columns. Each row and column represents a variable (or column) in our dataset and the value in the matrix is the coefficient of correlation between the corresponding row and column.

What is a Correlation Coefficient? A coefficient of correlation is a value between -1 and +1 that denotes both the strength and directionality of a relationship between two variables. The closer the value is to 1 (or -1), the stronger a relationship. The closer a number is to 0, the weaker the relationship. A negative coefficient will tell us that the relationship is negative, meaning that as one value increases, the other decreases. Similarly, a positive coefficient indicates that as one value increases, as does the other.

Let’s see what a correlation matrix looks like when we map it as a heat map. Here, we have a simply 4×4 matrix, meaning that we have 4 columns and 4 rows.

A Sample HeatMap Correlation in Seaborn
A sample correlation matrix visualized as a heat map

The values in our matrix are the correlation coefficients between the pairs of features. We can see that we have a diagonal line of the values of 1. This is because these values represent the correlation between a column and itself. Because these values are, of course, always the same they will always be 1.

If you have a keen eye, you’ll notice that the values in the top right are the mirrored image of the bottom left of the matrix. This is because the relationship between the two variables in the row-column pairs will always be the same. It’s common practice to remove these from a heat map matrix in order to better visualize the data. This is something you’ll learn in later sections of the tutorial.

Calculate a Correlation Matrix in Python with Pandas

Pandas makes it incredibly easy to create a correlation matrix using the dataframe method, .corr(). The method takes a number of parameters. Let’s explore them before diving into an example:

matrix = df.corr(
    method = 'pearson',  # The method of correlation
    min_periods = 1      # Min number of observations required
)

By default, the corr method will use the Pearson coefficient of correlation, though you can select the Kendall or spearman methods as well. Similarly, you can limit the number of observations required in order to produce a result.

Loading a Sample Pandas Dataframe

Now that you have an understanding of how the method works, let’s load a sample Pandas Dataframe. For this, we’ll use the Seaborn load_dataset function, which allows us to generate some datasets based on real-world data. We’ll load the penguins dataset. Seaborn allows us to create very useful Python visualizations, providing an easy-to-use high-level wrapper on Matplotlib.

# Loading a sample Pandas dataframe
import pandas as pd
import Seaborn as sns

df = sns.load_dataset('penguins')
print(df.head())

# Returns:
#   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
# 1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
# 2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female
# 3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
# 4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

Let’s break down what we’ve done here:

  • We loaded the Pandas library using the alias pd. We also loaded the Seaborn library using the alias sns.
  • We then created a dataframe, df, using the load_dataset function and passing in 'penguins' as the argument.
  • Finally, we printed the first five rows of the dataframe using the .head() method

We can see that our dataframe has 7 columns. Some of these columns are numeric and others are strings.

Calculating a Correlation Matrix with Pandas

Now that we have our Pandas Dataframe loaded, let’s use the corr method to calculate our correlation matrix. We’ll simply apply the method directly to the entire dataframe:

# Calculating a Correlation Matrix with Pandas
matrix = df.corr()
print(matrix)

# Returns:
#                    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
# bill_length_mm           1.000000      -0.235053           0.656181     0.595110
# bill_depth_mm           -0.235053       1.000000          -0.583851    -0.471916
# flipper_length_mm        0.656181      -0.583851           1.000000     0.871202
# body_mass_g              0.595110      -0.471916           0.871202     1.000000

We can see that while our original dataframe had seven columns, Pandas only calculated the matrix using numerical columns. We can see that four of our columns were turned into column row pairs, denoting the relationship between two columns.

For example, we can see that the coefficient of correlation between the body_mass_g and flipper_length_mm variables is 0.87. This indicates that there is a relatively strong, positive relationship between the two variables.

Rounding our Correlation Matrix Values with Pandas

We can round the values in our matrix to two digits to make them easier to read. The matrix that’s returned is actually a Pandas Dataframe. This means that we can actually apply different dataframe methods to the matrix itself. We can use the Pandas round method to round our values.

matrix = df.corr().round(2)
print(matrix)

# Returns:
#                    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
# bill_length_mm               1.00          -0.24               0.66         0.60
# bill_depth_mm               -0.24           1.00              -0.58        -0.47
# flipper_length_mm            0.66          -0.58               1.00         0.87
# body_mass_g                  0.60          -0.47               0.87         1.00

While we lose a bit of precision doing this, it does make the relationships easier to read.

In the next section, you’ll learn how to use the Seaborn library to plot a heat map based on the matrix.

How to Plot a Heat map Correlation Matrix with Seaborn

In many cases, you’ll want to visualize a correlation matrix. This is easily done in a heat map format where we can display values that we can better understand visually. The Seaborn library makes creating a heat map very easy, using the heatmap function.

Let’s now import pyplot from matplotlib in order to visualize our data. While we’ll actually be using Seaborn to visualize the data, Seaborn relies heavily on matplotlib for its visualizations.

# Visualizing a Pandas Correlation Matrix Using Seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('penguins')
matrix = df.corr().round(2)
sns.heatmap(matrix, annot=True)
plt.show()

Here, we have imported the pyplot library as plt, which allows us to display our data. We then used the sns.heatmap() function, passing in our matrix and asking the library to annotate our heat map with the values using the annot= parameter. This returned the following graph:

Default Seaborn Heatmap Correlation Matrix with Python
Visualizing a correlation matrix with mostly default parameters

We can see that a number of odd things have happened here. Firstly, we know that a correlation coefficient can take the values from -1 through +1. Our graph currently only shows values from roughly -0.5 through +1. Because of this, unless we’re careful, we may infer that negative relationships are strong than they actually are. Further, the data isn’t showing in a divergent manner. We want our colours to be strong as relationships become strong. Rather, the colours weaken as the values go close to +1.

We can modify a few additional parameters here:

  1. vmin=, vmax= are used to anchor the colormap. If none are passed, the values are inferred, which led to the negative values not going beyond 0.5. Since we know that the coefficients or correlation should be anchored at +1 and -1, we can pass these in.
  2. center= species the value at which to centre the colormap when we plot divergent data. Since we want the colours to diverge from 0, we should specify 0 as the argument here.
  3. cmap= allows us to pass in a different color map. Because we want the colors to be stronger at either end of the divergence, we can pass in vlag as the argument to show colours go from blue to red.

Let’s try this again, passing in these three new arguments:

# Visualizing a Pandas Correlation Matrix Using Seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('penguins')
matrix = df.corr().round(2)
sns.heatmap(matrix, annot=True, vmax=1, vmin=-1, center=0, cmap='vlag')
plt.show()

This returns the following matrix. It diverges from -1 to +1 and the colours conveniently darken at either pole.

Seaborn HeatMap Correlation Matrix in Pandas with Divergence
A properly formatted heat map with divergent colours

In this section, you learned how to format a heat map generated using Seaborn to better visualize relationships between columns.

Plot Only the Lower Half of a Correlation Matrix with Seaborn

One thing that you’ll notice is how redundant it is to show both the upper and lower half of a correlation matrix. Our mind’s can only interpret so much – because of this, it may be helpful to only show the bottom half of our visualization. Similarly, it can make sense to remove the diagonal line of 1s, since this has no real value.

In order to accomplish this, we can use the numpy triu function, which creates a triangle of a matrix. Let’s begin by import numpy and adding a mask variable to our function. We can then pass this mask into our Seaborn function, asking the heat map to mask only the values we want to see:

# Showing only the bottom half of our correlation matrix
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = sns.load_dataset('penguins')
matrix = df.corr().round(2)
mask = np.triu(np.ones_like(matrix, dtype=bool))
sns.heatmap(matrix, annot=True, vmax=1, vmin=-1, center=0, cmap='vlag', mask=mask)
plt.show()

This returns the following image:

Seaborn HeatMap Correlation Matrix in Pandas Only the Bottom Half
Displaying only the bottom half of a matrix using a numpy mask

We can see how much easier it is to understand the strength of our dataset’s relationships here. Because we’ve removed a significant amount of visual clutter (over half!), we can much better interpret the meaning behind the visualization.

How to Save a Correlation Matrix to a File in Python

There may be times when you want to actually save the correlation matrix programmatically. So far, we have used the plt.show() function to display our graph. You can then, of course, manually save the result to your computer. But matplotlib makes it easy to simply save the graph programmatically use the savefig() function to save our file.

The file allows us to pass in a file path to indicate where we want to save the file. Say we wanted to save it in the directory where the script is running, we can pass in a relative path like below:

# Saving a Heatmap
plt.savefig('heatmap.png')

In the code shown above, we will save the file as a png file with the name heatmap. The file will be saved in the directory where the script is running.

Selecting Only Strong Correlations in a Correlation Matrix

In some cases, you may only want to select strong correlations in a matrix. Generally, a correlation is considered to be strong when the absolute value is greater than or equal to 0.7. Since the matrix that gets returned is a Pandas Dataframe, we can use Pandas filtering methods to filter our dataframe.

Since we want to select strong relationships, we need to be able to select values greater than or equal to 0.7 and less than or equal to -0.7 Since this would make our selection statement more complicated, we can simply filter on the absolute value of our correlation coefficient.

Let’s take a look at how we can do this:

matrix = df.corr()
matrix = matrix.unstack()
matrix = matrix[abs(matrix) >= 0.7]

print(matrix)

# Returns:
# bill_length_mm     bill_length_mm       1.000000
# bill_depth_mm      bill_depth_mm        1.000000
# flipper_length_mm  flipper_length_mm    1.000000
#                    body_mass_g          0.871202
# body_mass_g        flipper_length_mm    0.871202
#                    body_mass_g          1.000000

Here, we first take our matrix and apply the unstack method, which converts the matrix into a 1-dimensional series of values, with a multi-index. This means that each index indicates both the row and column or the previous matrix. We can then filter the series based on the absolute value.

Selecting Only Positive / Negative Correlations in a Correlation Matrix

In some cases, you may want to select only positive correlations in a dataset or only negative correlations. We can, again, do this by first unstacking the dataframe and then selecting either only positive or negative relationships.

Let’s first see how we can select only positive relationships:

matrix = df.corr()
matrix = matrix.unstack()
matrix = matrix[matrix > 0]

print(matrix)

# Returns:
# bill_length_mm     bill_length_mm       1.000000
#                    flipper_length_mm    0.656181
#                    body_mass_g          0.595110
# bill_depth_mm      bill_depth_mm        1.000000
# flipper_length_mm  bill_length_mm       0.656181
#                    flipper_length_mm    1.000000
#                    body_mass_g          0.871202
# body_mass_g        bill_length_mm       0.595110
#                    flipper_length_mm    0.871202
#                    body_mass_g          1.000000

We can see here that this process is nearly the same as selecting only strong relationships. We simply change our filter of the series to only include relationships where the coefficient is greater than zero.

Similarly, if we wanted to select on negative relationships, we only need to change one character. We can change the > to a < comparison:

matrix = matrix[matrix < 0]

This is a helpful tool, allowing us to see which relationships are either direction. We can even combine these and select only strong positive relationships or strong negative relationships.

Conclusion

In this tutorial, you learned how to use Python and Pandas to calculate a correlation matrix. You learned, briefly, what a correlation matrix is and how to interpret it. You then learned how to use the Pandas corr method to calculate a correlation matrix and how to filter it based on different criteria. You also learned how to use the Seaborn library to visualize a matrix using the heatmap function, allowing you to better visualize and understand the data at a glance.

To learn more about the Pandas .corr() dataframe method, check out the official documentation here.

Additional Resources

To learn about related topics, check out the articles listed below: