In this tutorial, you’ll learn how to calculate the hamming distance in Python, using step-by-step examples. In machine learning, the Hamming distance represents the sum of corresponding elements that differ between vectors.
By the end of this tutorial, you’ll have learned:
- Common applications of the Hamming Distance in machine learning,
- How to calculate the Hamming Distance with scipy
- How to calculate the Hamming Distance between binary and numerical arrays
- How to calculate the Hamming distance between string arrays
Table of Contents
What is the Hamming Distance?
The Hamming Distance finds the sum of corresponding elements that differ between two vectors. Practically-speaking, the greater the Hamming Distance is, the more the two vectors differ. Inversely, the smaller the Hamming Distance, the more similar the two vectors are.
Mathematically, the Hamming Distance is represents by the formula below:
If the pairwise distance between each vector is 0, then the distance becomes 0 – this means that the arrays are exactly the same.
Practically, the Hamming Distance is often used to calculate the difference between two strings. In this case, the Hamming Distance between two strings of the same length measures the number of positions at which the pairwise characters are difference.
How is the Hamming Distance Used in Machine Learning?
The Hamming distance is often used in machine learning when comparing different strings or binary vectors. For example, the Hamming Distance can be used to compare be used to determine the degree to which the strings differ.
Similarly, the Hamming distance is often used with one-hot encoded data. One-hot encoded data is often represented using binary string (or bit strings). Because one-hot encoded vectors are always equal length, they make perfect candidates for using the Hamming distance to calculate differences between two points.
How to Calculate the Hamming Distance in Python with scipy
The Python scipy library comes with a function, hamming()
to calculate the Hamming distance between two vectors. This function is part of the spatial.distance
library, which includes other helpful functions used to calculate distances.
Let’s start by looking at two lists of values to calculate the Hamming distance between them.
# Using scipy to Calculate the Hamming Distance
from scipy.spatial.distance import hamming
values1 = [10, 20, 30, 40]
values2 = [10, 20, 30, 50]
hamming_distance = hamming(values1, values2)
print(hamming_distance)
# Returns: 0.25
In this case, the function returned the value of 0.25
. But how can we interpret this value? The value returns the proportion of values that are different. In order to turn this into the number of items in the array that are different, you can simply multiply that value by the length of the list:
# Calculating the Hamming Distance not as a proportion
from scipy.spatial.distance import hamming
values1 = [10, 20, 30, 40]
values2 = [10, 20, 30, 50]
hamming_distance = hamming(values1, values2) * len(values1)
print(hamming_distance)
# Returns: 1.0
In the next section, you’ll learn how to calculate the Hamming distance between two binary arrays.
How to Calculate the Hamming Distance between Binary Arrays in Python
Calculating the Hamming distance between two binary arrays works exactly the same as calculating the Hamming distance between numerical arrays.
An interesting note about the Hamming distance doesn’t take into account how far apart items are, only that they are apart. Let’s take a look at an example to use Python calculate the Hamming distance between two binary arrays:
# Using scipy to calculate the Hamming distance
from scipy.spatial.distance import hamming
values1 = [1, 1, 0, 0, 1]
values2 = [0, 1, 0, 0, 0]
hamming_distance = hamming(values1, values2) * len(values1)
print(hamming_distance)
# Returns: 2.0
In this example, because the first and last element vary, the Hamming distance is 2.
How to Calculate the Hamming Distance between String Arrays in Python
A common use case for the Hamming distance is to calculate the difference between strings. The function expects to array-like structures, which means that any strings that we want to compare need to be converted into arrays.
This can be done using the list()
function, which will convert the string passed in into a list of values. Let’s compare two strings to see how different they are:
# Calculating the Hamming Distance between 2 Strings
from scipy.spatial.distance import hamming
string1 = 'datagy'
string2 = 'tatagi'
hamming_distance = hamming(list(string1), list(string2)) * len(string1)
print(hamming_distance)
# Returns: 2.0
One thing to keep in mind is that the arrays in the be of equal length. If we wanted to compare the strings datagy
and tadag
, Python will raise a ValueError
. This is because the arrays that are passed in can only be compared if they are of equal length:
from scipy.spatial.distance import hamming
string1 = 'datagy'
string2 = 'tadag'
hamming_distance = hamming(list(string1), list(string2)) * len(string1)
print(hamming_distance)
# Raises: ValueError: The 1d arrays must have equal lengths.
Conclusion
In this tutorial, you learned how to calculate the Hamming distance in Python. The Hamming distance compares two strings or arrays to see how many elements differ pair-wise. You learned that the Hamming distance is often using in machine learning to compare strings, as well as compare one-hot encoded arrays.
Finally, you learned how to use the scipy
library to calculate the Hamming distance. You learned how to do this using numerical arrays, binary arrays, and strings.
Additional Resources
To learn more about related topics, check out the tutorials below: