Skip to content

PCA in Python: Understanding Principal Component Analysis

Principal Component Analysis in Python Cover Image

Principal Component Analysis (PCA) is a cornerstone technique in data analysis, machine learning, and artificial intelligence, offering a systematic approach to handle high-dimensional datasets by reducing complexity. By distilling data into uncorrelated dimensions called principal components, PCA retains essential information while mitigating dimensionality effects. With diverse applications including dimensionality reduction, feature selection, data compression, and exploration, PCA plays a pivotal role in extracting meaningful insights from large datasets.

By the end of this tutorial, you’ll have learned the following:

  • What Principal Component Analysis (PCA) is an why it matters
  • How Principal Component Analysis works
  • How to conduct PCA in Sklearn
  • How to visualize principal components and better understand your data

Understanding Principal Component Analysis

Principal Component Analysis (or PCA for short) is a technique used in data analysis, machine learning, and artificial intelligence, for reducing the dimensionality of datasets while retaining important information.

PCA works by transforming higher-dimensionality data into new, uncorrelated dimensions called principal components. It has many widespread applications, including:

  • Dimensionality Reduction: PCA extracts the essential information from data, while allowing you to remove the less relevant information
  • Feature Selection: in PCA, the components are ranked by the variance in data that is explained by each of them; this allows you to remove less relevant components
  • Data Compression: because PCA can reduce the dimensionality of your data by finding the most variable information, it can be used to compress data, including images
  • Data Exploration: PCA can help you find new patterns and insights in your data through visual means that wouldn’t have been possible using higher dimensions

How Does Principal Component Analysis Work?

There are quite a few inner workings to Principal Component Analysis. In this section, we’ll break down how PCA works in plain terms, allowing you to build your intuition.

Understanding how Principal Component Analysis Works

  1. Centring Data

    PCA begins by entering data, which involves subtracting the mean of each feature from the data points. This ensures that the mean of each feature becomes zero, allowing us to later more easily identify the directions of maximum variance. (Note, it’s often a good idea to scale your data as well!)Centering Data for PCA

  2. Calculating a Covariance Matrix

    The covariance matrix describes the relationships between different features and how they change together.Covariance Matrix

  3. Eigenvalue Decomposition

    Next, we perform eigenvalue decomposition on the covariance matrix, which yields eigenvectors and eigenvalues. Eigenvectors are used to represent the directions (the principal components) of maximum variance of the data. The eigenvalues represent the amount of variance explained by each principal component.
    Plotting Eigenvectors

  4. Selecting Principal Components

    Next, we sort the eigenvectors based on their corresponding eigenvalues in decreasing order. We choose the top n eigenvectors that represent the highest eigenvalues to capture most of the variance in the dataset.Understanding Explained Variance

  5. Projection of Data

    We then project the original data onto the subspace spanned by the selected principal components. To do this, we multiply the centred data matrix by the matrix of principal components. The resulting transformed data has reduced dimensionality, where each data point is represented by the PCs.Plotting Data with PCA Applied

In the next section, you’ll learn how to use Scikit-Learn to reduce the dimensionality of your data using PCA! Let’s get started.

Principal Component Analysis in Sklearn

In Python, we can abstract away many of the complexities of Principal Component Analysis using sklearn. The popular machine-learning library provides many helpful algorithms for supervised and unsupervised learning, as well as decomposing your datasets.

In order to conduct PCA in sklearn, we can make use of the PCA() class. The class allows you to conduct and explore PCA of your dataset to find the optimal number of components.

This will make more sense as we walk through an example. We’ll use a dataset covering decathlon data from the 1988 Olympics. Each athlete in the dataset competed in ten different events:

  • Running: 100m, 100m hurdles, 400m, and 1500m
  • Jumping: long jump, high jump, and pole vault
  • Throwing: javelin, discus, and shot put

You can find the dataset here, if you want to follow along.

Scaling Data for PCA

We can load the data into a Pandas DataFrame to simplify viewing our data. Let’s see how we can do this using the read_csv function:

# Reading Our Dataset into a DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/nik-pi/Datasets/main/decathlon.csv', index_col='Athlete')
print(df.head())

# Returns:
#           100m  110m Hurdle   400m  1500m  Pole Vault  Long Jump  High Jump  \
# Athlete                                                                       
# SEBRLE   11.04        14.69  49.81  291.7        5.02       7.58       2.07   
# CLAY     10.76        14.05  49.37  301.5        4.92       7.40       1.86   
# KARPOV   11.02        14.09  48.37  300.2        4.92       7.30       2.04   
# BERNARD  11.02        14.99  48.93  280.1        5.32       7.23       1.92   
# YURKOV   11.34        15.31  50.42  276.4        4.72       7.09       2.10   

I have included a truncated version of the data in the code block above. One thing you’ll notice right away is that the scales of our data are quite different. We can verify this by using the .describe() method to get high-level statistics of our dataset:

# Getting Descriptive Statistics for Our Data
print(df.describe())

# Returns:
#             100m  110m Hurdle       400m       1500m  Pole Vault  Long Jump  \
# count  41.000000    41.000000  41.000000   41.000000   41.000000  41.000000   
# mean   10.998049    14.605854  49.616341  279.024878    4.762439   7.260000   
# std     0.263023     0.471789   1.153451   11.673247    0.278000   0.316402   
# min    10.440000    13.970000  46.810000  262.100000    4.200000   6.610000   
# 25%    10.850000    14.210000  48.930000  271.020000    4.500000   7.030000   
# 50%    10.980000    14.480000  49.400000  278.050000    4.800000   7.300000   
# 75%    11.140000    14.980000  50.300000  285.100000    4.920000   7.480000   
# max    11.640000    15.670000  53.200000  317.000000    5.400000   7.960000   

#        High Jump    Javelin   Shot Put     Discus  
# count  41.000000  41.000000  41.000000  41.000000  
# mean    1.976829  58.316585  14.477073  44.325610  
# std     0.088951   4.826820   0.824428   3.377845  
# min     1.850000  50.310000  12.680000  37.920000  
# 25%     1.920000  55.270000  13.880000  41.900000  
# 50%     1.950000  58.360000  14.570000  44.410000  
# 75%     2.040000  60.890000  14.970000  46.070000  
# max     2.150000  70.520000  16.360000  51.650000  

We can see that across different events, our values range from being clustered around 1 (say for High Jump) to being clustered around 270 (say for 1500m). Thankfully, we don’t have any missing data, but if you did, you could explore different ways of dealing with missing data.

It’s a good idea to scale our data so that it doesn’t inadvertently throw off our analysis. Thankfully, we can use sklearn’s StandardScaler class to do this! Let’s take a look at how we can do this:

# Scaling Our Dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

print(scaled_data[0])

# Returns:
# [ 0.16147779  0.18057156  0.16998066  1.09931561  0.93798841  1.02393687
#   1.06045671  1.022196    0.43340505 -0.17252435]

In the code above, we first imported the StandardScaler class from the preprocessing module. We then instantiated the class into the scaler variable. Finally, we used the .fit_transform() method to pass in our DataFrame. This returns a NumPy array, where I’m showing only the first record of the scaled data.

Now that our data is scaled, we can begin our PCA work!

Exploring PCA in Sklearn

In this section, we’ll take a look at how we can use the PCA class in sklearn to make informed decisions about decomposing our data. Let’s begin by created a PCA object and fitting it to our data:

# Fitting Our Data to a PCA Model
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(scaled_data)

In the code block above, we first imported the PCA class. We then created a PCA object and fitted our data to it, using the .fit() method.

When you run this code, it may look like nothing is happening. Under the hood, however, there is quite a lot going on. We can access many different attributes of the fitted model. For example, we can see how much variance each component explains by using the .explained_variance_ratio_ attribute, as shown below:

# Exploring Explained Variance
print(pca.explained_variance_ratio_)

# Returns:
# [0.32719055 0.1737131  0.14049167 0.10568504 0.06847735 0.05992687
#  0.04512353 0.03968766 0.02148149 0.01822275]

This is incredibly useful information! We can see that the first component explains 32.7% of the data’s variance. However, it’s a little hard to read, so let’s create a Pandas DataFrame our of the data:

# Exploring our PCA Data
expl_var = pca.explained_variance_ratio_
df_expl_var = pd.DataFrame(
    data=zip(range(1, len(expl_var) + 1), expl_var, expl_var.cumsum()), 
    columns=['PCA', 'Explained Variance (%)', 'Total Explained Variance (%)']
    ).set_index('PCA').mul(100).round(1)
print(df_expl_var)

# Returns:
#      Explained Variance (%)  Total Explained Variance (%)
# PCA                                                      
# 1                      32.7                          32.7
# 2                      17.4                          50.1
# 3                      14.0                          64.1
# 4                      10.6                          74.7
# 5                       6.8                          81.6
# 6                       6.0                          87.5
# 7                       4.5                          92.1
# 8                       4.0                          96.0
# 9                       2.1                          98.2
# 10                      1.8                         100.0

In the code block above, we created a DataFrame that contains each PCA, its explained variance, and the cumulative sum of the PCAs.

From this, we can see that 50% of the data’s variance is explained by only two components! We can also see that 100% of the variance is explained by ten components (which makes sense since we are working with ten columns of data!).

Let’s plot this data to get a better sense of it:

# Plotting our explained variance
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8,8))
ax.bar(x=df_expl_var.index, height=df_expl_var['Explained Variance (%)'], label='Explained Variance', width=0.9, color='#AAD8D3')
ax.plot(df_expl_var['Total Explained Variance (%)'], label='Total Explained Variance', marker='o', c='#37B6BD')

plt.ylim(0, 100)
plt.ylabel('Explained Variance (%)')
plt.xlabel('PCA')
plt.grid(True, axis='y')
plt.title('Understanding Explained Variance in PCA\ndatagy.io')
plt.legend()

This returns the following graph:

Scree Plot for PCA Analysis in Python

While explaining 50% of the variance in 2 components may not seem like a lot, we can take into the context of our data. We have very few data points compared to other, larger datasets. We also have differing disciplines covered by our data. Because of this, let’s go ahead with 2 principal components to better visualize our data.

There are a number of ways to accomplish this, but since we are working with small data, I would recommend just re-creating our PCA object:

# Fitting and Transforming Our Data Using PCA
pca = PCA(2)
X_r = pca.fit_transform(scaled_data)

In the code block above, we passed in 2 as our number of components into the PCA class. We then used the .fit_transform() method to both fit and transform our data into the principal components. We can explore this data by slicing it:

# Exploring our transformed data
print(X_r[:2])

# Returns:
# [[0.79162772 0.7716112 ]
#  [1.23499056 0.57457807]]

We can see that by diving into the data that each record has been reduced to 2 dimensions (compared the the previous ten). Similarly, the data seems a little non-sensical compared to the previous data, but let’s go with it for now!

One question you may have at this point is how to understand these components. What variables, for example, feed into each component? How can you explain some of this to yourself and to the end consumers of your data? This where the concept of loadings comes into play.

Understanding Loadings in PCA

Recall, in PCA, the original variables are transformed into a new set of uncorrelated variables called principal components. These components represent linear combinations of the original features and are ordered by the amount of variance that they explain.

Loading scores are the coefficients of the original variables in the linear combinations that form each principal component. This means that they represent the weights assigned to each variable to create the principal component.

When we look at loading scores, we can glean a lot of information about the influence of one variable on a principal component:

  • Large loading values (either positive or negative) indicate that the corresponding original variable has a strong influence on that particular principal component.
  • Small loading values indicate that the original variable has a weaker influence on the principal component.
  • Loadings close to zero suggest that the original variable has little relevance to the formation of that principal component.

What’s important to understand here is that loadings help us interpret the meaning of each principal component. By examining which variables have high loadings on which principal components, we can begin to understand the underlying structure of the data and find meaningful patterns.

In sklearn, we can access the loading scores for each principal component by using the .components_ attribute. When we print out the loading scores, we return an array that has as many items as you have principal components (in our case, 2), and each many items in each as there are columns in the original dataset (in our case, 10).

# Printing Loading Scores
print(pca.components_)

# Returns:
# [[-0.42829627 -0.41255442 -0.3757157  -0.03210733  0.02783081  0.41015201
#    0.31619436  0.15319802  0.34414444  0.30542571]
#  [ 0.14198909  0.17359096  0.432046    0.35980486 -0.13684105 -0.26207936
#    0.2657761   0.24050715  0.45394697  0.4600244 ]]

We can see that by printing out the components, we can get information about each principal component. However, reviewing these arrays can be quite difficult, so let’s visualize the data. This is a great example for which a heat map generated in Seaborn would be useful:

# Plotting a Heatmap of Our Loadings
import seabird as sns
fig, ax = plt.subplots(figsize=(8,8))

ax = sns.heatmap(
    pca.components_,
    cmap='coolwarm',
    yticklabels=[f'PCA{x}' for x in range(1,pca.n_components_+1)],
    xticklabels=list(df.columns),
    linewidths=1,
    annot=True,
    fmt=',.2f',
    cbar_kws={"shrink": 0.8, "orientation": 'horizontal'}
    )

ax.set_aspect("equal")
plt.title('Loading for Each Variable and Component', weight='bold')
plt.show()

The code above returns the following image:

Understanding Loadings in PCA

We can see from the image above which variables (the x-axis) influence which components (the y-axis). Keep in mind from earlier, large values (either positive or negative) have more influence on a principal component. Similarly, keep in mind that each component captures a different amount of variation.

In simple terms, we can see that variables such as short distance running and long jump have large effects on PC1 (which explains a significant amount of the data’s variation), while variables like 1500m and pole vault influence it very little.

Since we only have two components, it can also be helpful to visualize this in a scatterplot. We can do this using Matplotlib, as shown below:

# Visualizing Loading in a Scatterplot
fig, ax = plt.subplots(figsize=(10, 8))
plt.scatter(
    x=loadings['PC1'],
    y=loadings['PC2'],
)

plt.axvline(x=0, c="black", label="x=0")
plt.axhline(y=0, c="black", label="y=0")

for label, x_val, y_val in zip(loadings.index, loadings['PC1'], loadings['PC2']):
    plt.annotate(label, (x_val, y_val), textcoords="offset points", xytext=(0,10), ha='center')

plt.title('Visualizing PCA1 and PCA2 Loadings', weight='bold')
ax.spines[['right', 'top', ]].set_visible(False)
plt.show()

This returns the following image:

Plotting Data Along PCAs

What we can get from this image is how closely correlated some variables will be. For example, we can see that 100m and 110m Hurdle are clustered very closely together. Similarly, Discus and Shot Put are very close together. Intuitively this makes sense, since they have logical overlap in physical skills required.

Visualizing Data Following PCA

The next step in PCA is to visualize our data and to see if we can find underlying meaning behind it. To make this process simpler, we can turn our transformed array into a Pandas DataFrame that contains the information about each athlete.

# Converting Our PCA Array into a DataFrame
X_r_df = pd.DataFrame(X_r, columns=['PCA1', 'PCA2'], index=df.index)
print(X_r_df.head())

# Returns:
#              PCA1      PCA2
# Athlete                    
# SEBRLE   0.791628  0.771611
# CLAY     1.234991  0.574578
# KARPOV   1.358215  0.484021
# BERNARD -0.609515 -0.874629
# YURKOV  -0.585968  2.130954

Let’s plot this data using a scatterplot to see if we can get some interesting insights from the data:

# Plotting Our Data Using a Scatterplot
fig, ax = plt.subplots(figsize=(10,10))
sns.scatterplot(data=X_r_df, x='PCA1', y='PCA2', ax=ax, s=100, alpha=0.5)
plt.title('Visualizing Original Data Follow PCA\ndatagy.io')
sns.despine()

This returns the following image:

Visualizing Data in PCA

We can see that there are some logical outliers in the data. Let’s take a look at three points in particular, as shown below:

Picking Relevant Data from PCA

We can see that Bourguignon and Karpov vary dramatically along the first principal component. Meanwhile, Bourguignon and Casarsa are similar along PC1, but vary quite a bit along PC2.

From our loading scores, we know that there is a lot variation in sports such as 100m running and shot put. Let’s take a look at how Bourguignon and Karpov compare across these two disciplines:

# Looking at two athletes
print(df[['100m', 'Shot Put']].rank()[df.index.isin(['Karpov', 'BOURGUIGNON'])])

# Returns:
#              100m  Shot Put
# Athlete                    
# BOURGUIGNON  38.5       5.0
# Karpov        2.0      40.0

In the code block above, we ranked the two columns to see how the two athletes compared. Because they’re showing as quite different along the first component, we would expect that their rankings in 100m running and shot put would be quite different. By reviewing the rankings, we can see that this is true!

Now, let’s take a look at two other athletes: Bourguignon and Casarsa. They are similar along the first component, but quite different along the second component. We know that the second component is influenced quite a bit by throwing sports, such as Discus and Shot Put. Let’s compare the rankings of these two athletes to see how true that holds:

# Comparing Athletes Along the Second Component
print(df[['Discus', 'Shot Put']].rank()[df.index.isin(['Casarsa', 'BOURGUIGNON'])])

# Returns:
#              Discus  Shot Put
# Athlete                      
# BOURGUIGNON     7.0       5.0
# Casarsa        35.0      30.0

We can see that both of these athletes ranked quite differently along sports that vary along the second component.

What Can Visualizing PCA Tell Us?

Visualizing the data along different principal components can tell us a number of different things, including:

  1. Identifying outliers: In the example above, we can see that Casarsa is a bit of an outlier, since he is on the extreme side of both principal components.
  2. Identifying clusters: points that are close together should share very similar profiles.
  3. Identifying differences: just as how we say that Bourguignon and Karpov were on different ends of the first component, visualizing PCA results can tell us which points are very dissimilar.

In short, visualizing data along principal components can help tell us data points are most similar and dissimilar from one another along the variables with the greatest variation.

Conclusion

In conclusion, Principal Component Analysis (PCA) offers a powerful toolkit for dimensionality reduction, feature selection, data compression, and data exploration in various fields including data analysis, machine learning, and artificial intelligence. By transforming high-dimensional data into uncorrelated dimensions called principal components, PCA enables us to extract essential information while discarding less relevant details.

Throughout this exploration, we’ve delved into the inner workings of PCA, from centering and scaling data to calculating covariance matrices, performing eigenvalue decomposition, and selecting principal components. We’ve also demonstrated practical implementation using Python’s Scikit-Learn library, examining a dataset on decathlon performances from the 1988 Olympics.

By visualizing explained variances, loading scores, and transformed data, we gained insights into the underlying structure of our dataset. Through scatterplots and heatmaps, we identified outliers, clusters, and differences among data points, facilitating a deeper understanding of athlete performances and their relationships across different events.

In essence, PCA serves as a valuable tool for uncovering patterns, reducing complexity, and gaining actionable insights from high-dimensional datasets, ultimately enhancing decision-making processes across various domains.

Additional Resources

To learn more about related topics, check out the tutorials below:

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Leave a Reply

Your email address will not be published. Required fields are marked *