In this tutorial, you’ll learn how normalize NumPy arrays, including multi-dimensional arrays. Normalization is an important skill for any data analyst or data scientist.
Normalization refers to the process of scaling data within a specific range or distribution to make it more suitable for analysis and model training. This is an important and common preprocessing step that is used commonly in machine learning. This can be especially helpful when working with distance-based machine learning models, such as the K-Nearest Neighbor algorithm.
By the end of this tutorial, you’ll have learned:
- How to use NumPy functions to normalize an array, including mix-max scaling, z-score normalization, and L2 normalization
- How to normalize multi-dimensional arrays in NumPy
- How to use different normalization techniques in NumPy
Table of Contents
Understanding Why Normalization Matters
Normalization is an important step in preprocessing data for data analysis, machine learning, and deep learning. By normalizing your data, you’re converting it into a standardized format to ensure that it is more suitable for analysis and model training.
In this tutorial, we’ll explore three main normalization techniques:
- Min-Max Scaling, which scales data between a range of 0 to 1,
- Z-Score Normalization, which converts a normal distribution to a mean of 0 and a standard deviation of 1, and
- L2 Normalization, which converts our data into unit vectors with magnitudes equal to 1
Normalization allows you to preprocess your data in meaningful ways and is essential for many different machine-learning algorithms. When dealing with data on different scales, distance-based algorithms will have significantly better performance when you normalize and scale your data.
For example, when predicting how prices, the number of rooms, and the square footage of a house will be on very different scales, which could lead to performance issues if data aren’t normalized.
How to Use Min-Max Scaling to Normalize a Vector in NumPy
Min-max scaling is one of the simplest and most commonly used normalization techniques. This technique scales data to a specific range, generally from [0, 1) (meaning that data include 0 and go up to, but don’t include, 1).
The min-max scaling method is useful when you want to preserve the relationship between data points while ensuring that all features are within a consistent range.
Let’s take a look at the formula for the min-max scaling technique:
X_normalized = (X - X_min) / (X_max - X_min)
We can implement this easily in NumPy, especially given that NumPy allows array-wise transformations. Let’s see how to implement the min-max scaling technique in NumPy:
# Implementing Min-Max Scaling in NumPy
import numpy as np
# Sample data
data = np.array([1, 2, 3, 4, 5], dtype=float)
# Min-Max scaling
min_val = np.min(data)
max_val = np.max(data)
scaled_data = (data - min_val) / (max_val - min_val)
print("Original Data:", data)
print("Min-Max Scaled Data:", scaled_data)
# Returns:
# Original Data: [1. 2. 3. 4. 5.]
# Min-Max Scaled Data: [0. 0.25 0.5 0.75 1. ]
In the example above, we first defined an array of data. We then calculated the minimum and maximum values using the np.min()
and np.max()
functions, respectively. We then created a new array of data by applying the formula for min-max scaling. Because NumPy arrays can be modified array-wise, we didn’t have to loop over each value.
In the following section, let’s see how we can use NumPy to apply z-score normalization.
How to Use Z-Score Normalization in NumPy
Z-score standardization is used to transform data to have a mean of 0 and a standard deviation of 1. This technique is also known zero-mean normalization. This technique is most useful when dealing with data or algorithms that assume a normal (or Gaussian) distribution of data.
This method is particularly helpful when you want to center the data around zero and scale it by ensuring that it has unit variance.
Let’s take a look at the formula for z-score normalization:
X_normalized = (X - X_mean) / X_std
We can see that each data point has the mean subtracted from it and is divided by the standard deviation of the data. Because NumPy allows us to apply array-wise transformations on our data, we can replicate this function easily on our arrays. Let’s see how we can accomplish this in NumPy:
# Implementing Z-score Normalization in NumPy
import numpy as np
# Sample data
data = np.array([1, 2, 3, 4, 5], dtype=float)
# Z-score standardization
mean = np.mean(data)
std_dev = np.std(data)
standardized_data = (data - mean) / std_dev
print("Original Data:", data)
print("Z-Score Standardized Data:", standardized_data)
# Returns:
# Original Data: [1. 2. 3. 4. 5.]
# Z-Score Standardized Data: [-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
In the code block above, we calculated our necessary values, meaning the mean and the standard deviation. We were then able to normalize our data using z-score normalization by applying the formula to the entire array. We can see that the mean of the data has been replaced with 0 (formerly 3) and that we have a standard deviation of 1.
Let’s now take a look at our final method of normalization: L2 normalization in NumPy.
How to Use L2 Normalization in NumPy
L2 normalization is a technique that converts each data point into a unit vector (that is, a vector with a magnitude of 1). For L2 normalization, we first calculate the L2 norm of the data and divide each element of the data by this norm to transform it into a unit vector.
This technique is particularly useful for text analysis and other machine-learning algorithms that rely on vector representations of data.
In order to use L2 normalization in NumPy, we can first calculate the L2 norm of the data and then divide each data point by this norm. NumPy comes bundled with a function to calculate the L2 norm, the np.linalg.norm()
function. The function takes an array of data and calculates the norm.
Let’s see how we can apply L2 normalization on our array of data in NumPy:
# Normalizing Data Using L2 Normalization in NumPy
import numpy as np
# Sample data (a row vector)
data = np.array([1, 2, 3, 4, 5], dtype=float)
# L2 (unit vector) normalization
norm = np.linalg.norm(data) # Calculate the L2 norm of the data
normalized_data = data / norm
print("Original Data:", data)
print("L2 Normalized Data:", normalized_data)
# Returns:
# Original Data: [1. 2. 3. 4. 5.]
# L2 Normalized Data: [0.13483997 0.26967994 0.40451992 0.53935989 0.67419986]
In the code block above, we first calculated the L2 norm of our data. Then, we divided each data point in the array by the norm. This normalized our array using L2 normalization.
L2 normalization is a powerful technique to ensure that data vectors have the same length while preserving their direction. This is an important normalization technique when consistency in vector magnitudes is essential, while preserving their directions.
Conclusion
In this tutorial, we’ve explored the importance of data normalization and delved into three essential normalization techniques in NumPy: Min-Max scaling, Z-score standardization, and L2 normalization. As a data analyst or data scientist, understanding and applying these techniques is vital for preprocessing your data, making it more amenable to analysis and machine learning tasks.
Additional Resources
To learn more about related topics, check out the tutorials below: