In this post, you’ll learn how to use the Levenshtein Distance to calculate the similarity between two different sequences of text. The Levenshtein Distance is a robust measure that can be used for many different applications, including natural language processing and spell-checking, to data cleaning, to even bioinformatics.
By the end of this tutorial, you’ll have learned the following:
- How to understand what the Levenshtein Distance does
- How to calculate the Levenshtein Distance in Python
- How to calculate the normalized Levenshtein Distance in Python
- How to apply what you have learned to fuzzy match strings in Python
This tutorial is action-focused. Rather than focusing on a from-scratch implementation, you’ll dive right into how to use Python libraries to calculate this important metric. Let’s get started!
Table of Contents
Understanding the Levenshtein Distance
The Levenshtein Distance, also known as the edit distance, is a fundamental measure in string comparison. It allows you to quantify the dissimilarity between two sequences. It does this by allowing you to quantify the number of single-character edits that are required to turn one string into another.
At its core, the Levenshtein distance algorithm transforms one string into another through a series of one-character edit operations. These operations include:
- Insertion: adding a character into the string
- Deletion: removing a character from a string
- Substitution: replacing a character with another
The edit distance between two strings is then calculated by the minimum number of operations required to convert one string into another. What is great about the edit distance is that it doesn’t rely on contextual understanding, allowing it to be applied to diverse fields, such as natural language processing and DNA sequencing.
The metric has some unique advantages over other distance measures, such as the Hamming distance, since the strings don’t need to be the same length.
An Example of Levenshtein Distance
Let’s calculate the Levenshtein Distance between two strings “kitten” and “sitting”.
- kitten (length: 6)
- sitting (length: 7)
To calculate the distance:
- Replace ‘k’ in “kitten” with ‘s’: “sitten” (1 operation)
- Replace ‘e’ in “sitten” with ‘i’: “sittin” (1 operation)
- Insert ‘g’ at the end of “sittin”: “sitting” (1 operation)
Therefore, the Levenshtein Distance between “kitten” and “sitting” is 3, representing the minimum number of single-character edits required to transform one string into the other.
Let’s now dive into how we can calculate the Levenshtein distance in Python.
How to Calculate the Levenshtein Distance in Python
The easiest way to calculate the Levenshtein distance in Python is to use the Levenshtein
library. Because this library isn’t part of the standard library, it first needs to be installed. This can be done using:
pip install Levenshtein
Note: the library was previously called python-Levenshtein
. The original is being maintained in parallel, but the correct one to install is just Levenshtein
.
Once the library is installed, we can import the distance
function using the code below:
from Levenshtein import distance
Let’s now take a look at how we can replicate our earlier example. We can take a look at our two strings and pass them into the distance()
function as arguments.
# Calculating the Levenshtein Distance in Python
str1 = 'kitten'
str2 = 'sitting'
dist = distance(str1, str2)
print(f'The Levenshtein distance is {dist}.')
# Returns: The Levenshtein distance is 3.
In the code block above, we declared two variables to hold our strings. We then passed these strings into our distance()
function, which, as expected, returned 3.
In this case, we were able to find the absolute edit distance. However, when you’re working with different strings, it can often be helpful to normalize these values. Let’s dive into this in the next section.
How to Calculate the Normalized Levenshtein Distance in Python
When dealing with the Levenshtein distance, it can often be helpful to normalize the distances. This is especially true when comparing strings of different lengths or when you want to compare the similarity across datasets.
The normalized Levenshtein distance scales the distance by the maximum possible edit distance, resulting in a value between 0 and 1. If the return value is 1, then the strings are identical. The value returned is actually 1 - normalized distance
, meaning the higher the value, the closer the strings are.
In order to normalize our distances, we can use the ratio()
function from the Levenshtein
library. Let’s take a look at how we can calculate our normalized edit distance between the two strings:
# Calculating the Normalized Levenshtein Distance
from Levenshtein import ratio
str1 = 'kitten'
str2 = 'sitting'
dist = ratio(str1, str2)
print(f'The normalized Levenshtein distance is {dist}.')
# Returns: The normalized Levenshtein distance is 0.6153846153846154.
We can see that the normalized edit distance is 0.61. This gives us an indication of how much of the original word needed to be modified in order to get to the new word.
Normalizing the distances is particularly helpful when you need to create a standardized measure of similarity or dissimilarity that is fair and consistent across different datsets or varying strength lengths. This is because the normalized values simplify the interpretation of string similarity, enabling efficient decision-making in various applications, such as spell-checking or fuzzy matching.
Applying the Levenshtein Distance: Fuzzy Matching
Let’s now take a look at an example of how we can apply the Levenshtein distance to a practical use case. Recently, I had to perform fuzzy matching for a set of names across databases. This allowed me to use the normalized Levenshtein distance to find the closest match.
For this tutorial, let’s take a look at a simpler example. We’ll create a list of fruit and find give a misspelled version for one of the fruits to see how we can apply the distance.
# Fuzzy Matching Strings in Python
word_list = ["apple", "banana", "orange", "grape", "kiwi", "strawberry", "blueberry", "pineapple"]
target_word = "kiwii"
distances = []
for word in word_list:
distances.append((word, ratio(word, target_word)))
distances.sort(key=lambda x: x[1], reverse=True)
print(distances[:3])
# Returns:
# [('kiwi', 0.8888888888888888), ('pineapple', 0.1428571428571429), ('strawberry', 0.1333333333333333)]
In the example above, we first loaded a list of strings and then defined a misspelled string to fuzzy-match against. In order to do this, we first defined a list to hold our matched values. We then looped over each fruit in our list, appending both the word and the ratio to that list.
From there, we sorted our list based on the ratio score in reverse order. Finally, we printed the first three values, showing the words with the highest score.
Unsurprisingly, we can see that 'kiwi'
is ranked very high, with all other words much lower in score. This allows us to be confident that it’s the closest match.
It can be helpful to set a cut-off value, or a minimum ratio. That way, you can have some level of certainty that the fuzzy matches are somewhat relevant to the word you’re looking for.
Conclusion
In this guide, you’ve delved into the Levenshtein Distance, a powerful tool for measuring string similarity in Python. You first learned how the Levenshtein Distance quantifies the difference between two sequences by counting single-character edits.
By leveraging Python’s Levenshtein
library, you learned how to calcualte the distance between strings, determining the minimum edits needed for transformation. You then learned how normalizing the Levenshtein Distance allows for standardized similarity comparisons across datasets with varying string lengths, aiding decision-making in spell-checking and other applications.
Finally, you explored a practical example of fuzzy matching showcased how the Levenshtein Distance can find close matches, even with misspelled words, demonstrating its real-world utility.
To learn more about the Levenshtein library in Python, check out the official documentation here.
To learn about related topics, check out these tutorials: