Normalize a Pandas Column or Dataframe (w/ Pandas or sklearn)

Normalize a Pandas Column with sklearn Cover Image

Learn how to normalize a Pandas column or dataframe, using either Pandas or scikit-learn.

Normalization is an important skill for any data analyst or data scientist. Normalization involves adjusting values that exist on different scales into a common scale, allowing them to be more readily compared. This is especially important when building machine learning models, as you want to ensure that the distribution of a column’s values don’t get over- or under-represented in your models.

In this tutorial, you’ll learn how to use Pandas and scikit-learn to normalize both a column and an entire dataframe using maximum absolute scaling, min-max feature scaling, and the z-score scaling method. You’ll also learn what these methods represent, as well as when and why to use each one.

The Quick Answer:

Quick Answer - Normalize a Pandas Column with sklearn

What is Data Normalization in Machine Learning?

Data normalization takes features (or columns) of different scales and changes the scales of the data to be common. For example, if you’re comparing the height and weight of an individual, the values may be extremely different between the two scales. Because of this, if you’re attempting to create a machine learning model, one column may be weighed differently.

This is where normalization comes into play: the values of the different columns are adjusted, so that they exist on a common scale, allowing them to be more easily compared.

In the following sections, you’ll learn how to apply data normalization to a Pandas Dataframe, meaning that you adjust numeric columns to a common scale. This prevents the model from favouring values with a larger scale. In essence, data normalization transforms data of varying scales to the same scale. This allows every variable to have similar influence on the model, allowing it to be more stable and increase its effectiveness.

Let’s begin by loading a sample Pandas Dataframe that we’ll use throughout the tutorial.

Want to learn how to use the Python zip() function to iterate over two lists? This tutorial teaches you exactly what the zip() function does and shows you some creative ways to use the function.

Loading a Sample Pandas Dataframe

If you want to follow along with the tutorial, line of code for line of code, copy the code below to create the dataframe. This will generate a sample dataframe that you can use to follow along with the tutorial.

We’ll load a dataframe that has three columns: age, weight, and height. Let’s see how we can do this in Python and Pandas:

import pandas as pd

df = pd.DataFrame.from_dict({
    'Age': [10, 35, 34, 23, 70, 55, 89],
    'Height': [130, 178, 155, 133, 195, 150, 205],
    'Weight': [80, 200, 220, 150, 140, 95, 180]
})

We can print the first five rows of our dataframe by using the print(df.head()) command. This will return the following dataframe:

   Age  Height  Weight
0   10     130      80
1   35     178     200
2   34     155     220
3   23     133     150
4   70     195     140

In the next section, you’ll learn what maximum absolute scaling is.

Want to learn how to pretty print a JSON file using Python? Learn three different methods to accomplish this using this in-depth tutorial here.

What is Maximum Absolute Scaling?

The maximum absolute scaling method rescales each feature to be a value between -1 and 1.

Each value is calculated using the formula below:

xscaled = x / max(|x|)

Each scaled value is calculated by dividing the value itself by the absolute value of the maximum value. Just because the scale can go from -1 to 1, doesn’t mean it will. In fact, the values of negative -1 and +1 will only exist when both negative and positive values of the maximum values exist in the dataset. This means that at least either or both a -1 or +1 will exist.

In the next section, you’ll learn how to normalize a Pandas column with maximum absolute scaling using Pandas.

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Normalize a Pandas Column with Maximum Absolute Scaling using Pandas

Pandas makes it easy to normalize a column using maximum absolute scaling. For this process, we can use the .max() method and the .abs() method. To learn more about the absolute function and how to use it in Python, check out my in-depth post here.

Let’s see how we can develop a function that allows us to apply the maximum absolute scaling method to a column:

def absolute_maximum_scale(series):
    return series / series.abs().max()

for col in df.columns:
    df[col] = absolute_maximum_scale(df[col])

print(df)

# Returns:
#         Age    Height    Weight
# 0  0.112360  0.634146  0.363636
# 1  0.393258  0.868293  0.909091
# 2  0.382022  0.756098  1.000000
# 3  0.258427  0.648780  0.681818
# 4  0.786517  0.951220  0.636364
# 5  0.617978  0.731707  0.431818
# 6  1.000000  1.000000  0.818182

What we’ve done here is defined a function that divides the series by the absolute value of the maximum value in the series. We then apply that function to every column in our dataframe.

The benefit here is that we can choose what columns to apply the function to, rather than immediately applying it to an entire dataframe, every single time.

In the next section, you’ll learn how to use scikit-learn to apply maximum absolute scaling to a Pandas Dataframe.

Want to learn more about calculating the square root in Python? Check out my tutorial here, which will teach you different ways of calculating the square root, both without Python functions and with the help of functions.

Normalize a Pandas Column with Maximum Absolute Scaling using scikit-learn

In many cases involving machine learning, you’ll import the popular machine-learning scikit-learn library. Because of this, you can choose to use the library to apply maximum absolute scaling to your Pandas Dataframe.

For this, we’ll use the MaxAbsScalaer class to create a scalar object. We can then apply the fit method to allow scikit-learn to learn about the parameters required for this (the maximum absolute value). We then use the parameters to transform our data and normalize our Pandas Dataframe column using scikit-learn.

let’s see how we can use Pandas and scikit-learn to accomplish this:

# Use Scikit-learn to transform with maximum absolute scaling
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)

print(scaled_df)

# Returns:
#         Age    Height    Weight
# 0  0.112360  0.634146  0.363636
# 1  0.393258  0.868293  0.909091
# 2  0.382022  0.756098  1.000000
# 3  0.258427  0.648780  0.681818
# 4  0.786517  0.951220  0.636364
# 5  0.617978  0.731707  0.431818
# 6  1.000000  1.000000  0.818182

Let’s break down what we’ve done here:

  1. We load a scaler object using the MaxAbsScaler() class
  2. We pass the dataframe into the .fit() method
  3. We then create a scaled matrix of data using the .transform() method
  4. Finally, we recreate a Pandas Dataframe using the DataFrame class

In the next section, you’ll learn about the min-max feature scaling method.

Need to automate renaming files? Check out this in-depth guide on using pathlib to rename files. More of a visual learner, the entire tutorial is also available as a video in the post!

What is Min-Max Feature Scaling?

Min-max feature scaling is often simply referred to as normalization, which rescales the dataset feature to a range of 0 - 1. It’s calculated by subtracting the feature’s minimum value from the value and then dividing it by the difference between the maximum and minimum value.

The formula looks like this:

xnorm = x - xmin / xmax - xmin

Pandas makes it quite easy to apply the normalization via the min-max feature scaling method.

in the next section, you’ll learn how to use Pandas to normalize a column.

Want to learn how to get a file’s extension in Python? This tutorial will teach you how to use the os and pathlib libraries to do just that!

Normalize a Pandas Column with Min-Max Feature Scaling using Pandas

To use Pandas to apply min-max scaling, or normalization, we can make use of the .max() and .min() methods. We can then apply a function using a vectorized format to significantly increase the efficiency of our operation.

Let’s see what this looks like in Pandas:

def min_max_scaling(series):
    return (series - series.min()) / (series.max() - series.min())

for col in df.columns:
    df[col] = min_max_scaling(df[col])

print(df.head())

# Returns:
#         Age    Height    Weight
# 0  0.000000  0.000000  0.000000
# 1  0.316456  0.640000  0.857143
# 2  0.303797  0.333333  1.000000
# 3  0.164557  0.040000  0.500000
# 4  0.759494  0.866667  0.428571

Let’s break down what we’ve done here:

  1. We defined our function to accept a series
  2. The function returns the formula defined above: the difference between the value and the minimum value, divided by the difference between the maximum and minimum values

In the example above, we loop over each column. While we could define our function to normalize the entire dataframe. Instead, we chose to normalize it column by column, allowing us to skip over columns that are not numerical and can’t use the same standardization method.

In the next section, you’ll learn how to use sklearn to normalize a column using the min-max method.

Want to learn how to use the Python zip() function to iterate over two lists? This tutorial teaches you exactly what the zip() function does and shows you some creative ways to use the function.

Normalize a Pandas Column with Min-Max Feature Scaling using scikit-learn

The Python sklearn module also provides an easy way to normalize a column using the min-max scaling method.The sklearn library comes with a class, MinMaxScaler, which we can use to fit the data.

Let’s see how we can use the library to apply min-max normalization to a Pandas Dataframe:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df)
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)

print(scaled_df)

# Returns:
#         Age    Height    Weight
# 0  0.000000  0.000000  0.000000
# 1  0.316456  0.640000  0.857143
# 2  0.303797  0.333333  1.000000
# 3  0.164557  0.040000  0.500000
# 4  0.759494  0.866667  0.428571
# 5  0.569620  0.266667  0.107143
# 6  1.000000  1.000000  0.714286

Similar to applying max-absolute scaling method, let’s explore what we’ve done here:

  1. We imported the MinMaxScaler class from sklearn.preprocessing
  2. We then create an instance of the class and fit it to the data
  3. We then use the scaler to fit and transform our data
  4. Finally, we create a new dataframe from the data, passing in the original columns to recreate it

In the next section, you’ll learn what z-score scaling is and how to use it.

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

What is Z-Score Scaling?

The z-score method is often referred to as standardization, which transforms the data into a distribution of values where the mean is 0 and has a standard deviation of 1. Unlike the other two methods, this method doesn’t range from 0-1 or -1 to

Instead, because the data uses a standard deviation, 99% of values will fall into the range of -3 through 3. Of course, you’ll have values that can extend beyond that, but they’ll just be extremely uncommon.

The way that this standardization is calculated is to use the following formula:

xstd = x - μ / σ

In the next section, you’ll learn how to standardize a Pandas column using z-score scaling.

Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.

Standardize a Pandas Column with Z-Score Scaling using Pandas

In order to standardize a column in a Pandas Dataframe, we can make good use of the Pandas mean and std functions.

To learn more about calculating a mean of a Pandas Dataframe column, check out this tutorial here. To learn more about calculating a standard deviation in Python, check out my tutorial here, which includes everything from calculating it from scratch to using Pandas.

Let’s see how we can use Pandas to calculate a standardized dataframe with z-score scaling:

def z_score_standardization(series):
    return (series - series.mean()) / series.std()

for col in df.columns:
    df[col] = z_score_standardization(df[col])

print(df)

# Returns:
#         Age    Height    Weight
# 0 -1.270474 -1.141772 -1.384428
# 1 -0.366682  0.483802  0.918383
# 2 -0.402833 -0.295119  1.302185
# 3 -0.800502 -1.040174 -0.041122
# 4  0.898628  1.059526 -0.233023
# 5  0.356352 -0.464450 -1.096577
# 6  1.585510  1.398187  0.534581

Let’s explore what we’ve done here:

  1. We define a new function that accepts a series as its input
  2. We then return the series’s value subtracted from the series’s mean, which is divided by the series’s standard deviation

Finally, we loop over every column in the dataframe and re-assign it to itself.

Want to learn more about Python f-strings? Check out my in-depth tutorial, which includes a step-by-step video to master Python f-strings!

Standardize a Pandas Column with Z-Score Scaling using scikit-learn

In this final section, you’ll learn how to use sklearn to standardize a Pandas column using z-score scaling. In order to this, we use the StandardScaler class from the sklearn module.

Let’s see how we can use the library to apply z-score scaling to a Pandas Dataframe:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df)
scaled = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled, columns=df.columns)

print(scaled_df)

# Returns:
#         Age    Height    Weight
# 0 -1.372269 -1.233255 -1.495353
# 1 -0.396061  0.522566  0.991967
# 2 -0.435110 -0.318765  1.406520
# 3 -0.864641 -1.123516 -0.044416
# 4  0.970629  1.144419 -0.251693
# 5  0.384905 -0.501663 -1.184438
# 6  1.712547  1.510215  0.577414

Let’s break down what we’ve done above:

  1. We instantiated a StandardScaler class and fitted the dataframe to it
  2. We then used the .fit_transform() method to scale the dataframe itself
  3. Finally, we recreated a dataframe out of the data, with the data z-score standardized

Want to learn how to calculate and use the natural logarithm in Python. Check out my tutorial here, which will teach you everything you need to know about how to calculate it in Python.

Conclusion

In this tutorial, you learned three methods of standardizing or normalizing data in Pandas, using either Pandas or sklearn. You learned how to apply the maximum absolute scaling method, the min-max feature scaling method, and the z-score standardization method.

To learn more about sklearn’s min-max normalization method, check out the official documentation found here.