Fuzzy Search

n-grams by the book with a twist

22.04.2023

Introduction

Fuzzy search, or approximate string matching, is a topic that is still not properly addressed in many online services. E.g. open Twitter and search for your user name with a typo. As of the date of this blog post, you will not get the desired result.

Information retrieval, and fuzzy search in particular, is of deep interest to me, and I was lucky to work on it for quite some time. There are straight forward concepts and algorithms that provide a framework for solving this problem efficiently. In this blog post I introduce you to the n-gram method by the book before I share my street knowledge and personal experience.

Fuzzy search by the book

Given a query string we want to match and retrieve strings that are approximately equal, i.e. similar. First of all, what does 'similar' mean?

Edit distances

Consider the following strings:

Alice, Alcie, Bob

As a human reader we immediately recognize that the first two strings are similar and the latter string is rather different.

Fortunately, there is a straight forward way to formalize this similarity. The Levenshtein distance (\(d_{lev}\)) is the minimal number of character transformations required to transform one string into the other. The allowed transformations are:

  • insertion
  • deletion
  • substitution

For our example we get:

\[\begin{align}&d_{\textrm{lev}}(\textrm{Alice}, \textrm{Alcie}) = 2 \nonumber \\ \nonumber \\ &d_{\textrm{lev}}(\textrm{Alice}, \textrm{Bob}) = 5 \nonumber\end{align}\]

The first distance is two because the minimal number of operations is achieved by:

  • substitute (replace) 'i' in Alice by 'c'
  • substitute the second 'c' in Alcce by 'i'

The distance between Alice and Bob is five, e.g:

  • substitute 'A' by 'B'
  • substitute 'l' by 'o'
  • substitute 'i' by 'b'
  • delete 'c'
  • delete 'e'

The discerning reader may suggest that there is a simpler way to go from Alice to Alcie, which is a transposition of the characters 'i' and 'c'. This is indeed a good idea and leads us to the Damerau-Levenshtein distance (\(d_{dam}\)) with the following allowed transformations:

  • insertion
  • deletion
  • substitution
  • transposition of adjacent characters

With this definition we get:

\[d_{\textrm{dam}}(\textrm{Alice}, \textrm{Alcie}) = 1\]

An inspection of those items rejected because of spelling errors showed that over 80 percent fell into one of four classes of single error - one letter was wrong, or one letter was missing, or an extra letter had been inserted, or two adjacent characters had been transposed. These are the errors one would expect as a result of misreading, hitting a key twice, or letting the eye move faster than the hand. - Fred Damerau in "A Technique for Computer Detection and Correction of Spelling Errors", 1964

The Levenshtein distance and Damerau-Levenshtein distance allow us to compare two strings \(x\) and \(y\). The time complexity for computing these distances is \(\mathcal{O}(|x||y|)\), where \(|\cdot|\) denotes the string length. For small datasets we can compute the distances between a given query string and all other strings in the dataset. For large datasets this approach is not feasible performance wise, which leads us to the next concept.

n-gram index

Given many points \(P_i(x_i, y_i)\) in 2D space, how can we retrieve all points that are close to a given point \(P_0(x_0, y_0)\) in a fast way? There is a simple solution, we can insert all points in a 2D grid. For the query point \(P_0\) we can then compute its grid cell and fetch all points from this cell and all neighboring cells. This gives us a small set of candidates and we obtain the final matches by checking \(\sqrt{(x_j-x_0)^2-(y_j-y_0)^2}<d\) for all candidates \(j\) and a desired distance \(d\). This approach allows us to find all nearby points efficiently, without having to compare \(P_0\) to every point in the dataset.

Note that \(d\) has to be smaller then the length of a grid cell in order to get all the points that could match. But this is a detail; I guess you got the idea.

The data structure (in the example above the 2D grid), which allows for a fast query of candidates, is called an index. The general idea of an index is to discretize the feature space of your data objects in a clever way.

Let's return to our example string Alice, which we normalize to lower case for the following discussion. An obvious approach to consider is to separate the string into individual characters:

a, l, i, c, e

By doing so we express that the string contains these characters and hence that these characters are contained in the string. The problem with this simple approach is that there are many words having the character 'a' or the other characters 'l', 'i', 'c', 'e'. Consequently, this approach will result in a very broad index that performs not very well.

What else can we do? We can separate the string into character sequences instead of single characters:

al, li, ic, ce

These are the so-called 2-grams of the string alice. They are obtained by sliding a window with a size of two characters across the string. If we choose a window size of three characters, we get the 3-grams of the string:

ali, lic, ice

The general concept is called n-grams, where n is a natural number. It turns out that for constructing an index the choice of \(n=3\), i.e. 3-grams, is the commonly adopted sweet spot. This seems reasonable, because on the one hand 3-grams are small enough such that strings are divided into many features, on the other hand 3-grams preserve significant information, as every character is connected to its preceding and succeeding character.

Before I present how the index looks, there is one additional trick that we need to explore. If we pad the string alice at the front and at the end with a special character, e.g. '$', we get $$alice$$ and the 3-grams are:

$$a, $al, ali, lic, ice, ce$, e$$

The first two 3-grams indicate that the string starts with "a" and "al", respectively. The last two 3-grams indicate that the string ends with "ce" and "e", respectively. The purpose of padding is two-fold. Firstly, the additional 3-grams contain valuable information and each string yields more 3-grams. Secondly, it's possible to compute 3-grams for single character strings as well.

Finally, I can present to you the shape of an n-gram index. It consists of a dictionary from n-grams to inverted lists. For every n-gram, all the strings that contain the given n-gram are stored. The lists are referred to as 'inverted' because the data entities occur in the values of the n-gram dictionary. To illustrate, consider a dataset of first names, where a 3-gram index may appear as follows:

$$a: Andrew, Ashley, Alice, ...
⋮ 
$$b: Barbara, Betty, Bob, ...
⋮
$al: Albert, Alexander, Alice, ...
⋮
ali: Alice, Natalie, ...
⋮
lic: Alice, ...
⋮
ce$: Alice, Lawrence, Joyce, ...
⋮
e$$: Alice, George, Michelle, ...

How this index can be used in the fuzzy search algorithm is discussed in the next section.

In practice, a list of ids is stored for every n-gram instead of a list of strings. This reduces the memory consumption and simplifies the processing of a query.

I have briefly mentioned that the strings should be lower cased before indexing. In practice additional normalization is required such as replacing individual special characters and taking care of surrogate pairs. When you plan to work on string search, it is a good idea to familiarize yourself with character encoding in general, as well as with the specific encoding used in your programming language.

Fuzzy search algorithm

Given a query string \(x\) and an n-gram index, we want to find all matches \(y\) such that \(d_{\textrm{lev}}(x, y)\leq\delta\). First of all, we have to normalize the query string and split it into n-grams. Then, use these n-grams to retrieve the corresponding inverted lists from the n-gram index. Once we have the lists, we need to combine them and calculate the frequency of each string by counting the number of occurrences across these lists. The determined frequency of a string corresponds to the number of n-grams that are in common with the query string.

This procedure gives us the number of common n-grams for all the strings that have at least one n-gram in common with the query string. The number of common n-grams is a proxy for the distance according to the following equation:

\[\textrm{comm}(x, y) \geq \max(|x|, |y|) - n + 1 - n \cdot \delta\]

This equation is easy to understand if you consider the following: A string \(s\) has \(|s|-n+1\) n-grams, and hence the longer string of \(x\) and \(y\) has \(\max(|x|, |y|) - n + 1\) n-grams. Moreover, any Levenshtein transformation (insertion, deletion or substitution) changes at most \(n\) n-grams. Consequently, \(\delta\) transformations affect at most \(n\cdot \delta\) n-grams. If you now imagine that the longer string is transformed into the shorter string with \(\delta\) transformations, you get the lower bound for the number of common n-grams expressed by the equation.

We can use the equation as a predicate and filter our candidate matches, which will greatly reduce their number. However, it's important to note that the remaining candidates may still have a distance larger than the desired \(\delta\), as the equation only considers the number of common n-grams. It is therefore necessary to compute the Levenshtein distance explicitly for the remaining candidates in order to obtain the final matches.

If you would like to learn more about fuzzy search and Information Retrieval in general I highly recommend checking out the excellent lectures by Prof. Hannah Bast.

Sorted n-grams

In this section I share with you personal improvement ideas and findings. I have put them into practice with great success but please note that they have not been reviewed and published scientifically.

The algorithm I have presented in the previous section is highly efficient in computing all strings that have a Levenshtein distance to the query string of \(\delta\) or less.

The Damerau-Levenshtein distance improves upon the Levenshtein distance by adding transpositions to the set of allowed transformations. Ideally, we would like to compute this distance instead of the Levenshtein distance.

Unfortunately, a transposition changes not just \(n\) n-grams, but \(n+1\). Consider the 3-grams of $$alice$$ and $$alcie$$:

$$a, $al, ali, lic, ice, ce$, e$$
$$a, $al, alc, lci, cie, ie$, e$$

Four 3-grams differ and there are only three 3-grams in common, which is less than fifty percent.

The algorithm from the previous section could, in principle, be used for the Damerau-Levenshtein distance by replacing \(n\cdot \delta\) with \((n+1)\cdot \delta\) in the filter equation. However, this will not be very efficient since the filter gets coarser.

Instead, is there something else we can tweak in order to account for transpositions? In particular, can we save some of the 3-grams of the Alice-Alcie example above?

The twist I came up with after staring at the 3-grams long enough is to sort the 3-grams. And by that I mean sort the characters within the 3-grams alphabetically. This gives us the following 3-grams for $$alice$$ and $$alcie$$:

$$a, $al, ail, cil, cei, $ce, $$e
$$a, $al, acl, cil, cei, $ei, $$e

By construction there are more 3-grams in common. In particular, the transposition in our example changed only two 3-grams and five of seven are unaffected. In the general case, a transposition combined with sorted n-grams changes only \(n-1\) n-grams.

With this approach, padding the strings with the same characters at the front and at the end is not ideal, as information is lost after sorting. This can be easily fixed if we use another special character for the padding at the end, e.g. '!'. Using this modification, the padded strings are $$alice!! and $$alcie!!, resulting in the following sorted 3-grams:

$$a, $al, ail, cil, cei, !ce, !!e
$$a, $al, acl, cil, cei, !ie, !!e

In the implementation linked in the beginning of this blog post, I use a slightly more sophisticated padding scheme. To put more weight on the correct beginning of a term, a single exclamation mark '!' is used for the padding at the end. Additionally, to handle word boundaries in the form of spaces and equivalent characters, they are replaced by '!$$'. With this approach, the term Alice King becomes $$alice!$$king! after padding. Subsequently, n-grams that end with '$' are discarded and n-grams that don't contain '$' are sorted. The latter rule is based on the assumption that a transposition typo is unlikely to occur at the beginning of a word.

How does the fuzzy search algorithm change with sorted n-grams? Not much, the filter equation from the previous section still holds. The n-grams of the query string have to be sorted before processing and the n-grams of the data strings have to be sorted before indexing. Finally, we can use the Damerau-Levenshtein distance instead of the Levenshtein distance for computing the matches.

The downside of sorted n-grams is that the inverted lists are fewer and longer, which results in a coarser index. However, in practice, I have not observed that this leads to an issue performance wise.

Another adaption I can recommend in combination with the sorted n-gram approach is to skip the filtering as well as the explicit distance calculation. Instead, directly calculate a quality for each candidate from the index by using the number of common n-grams of the query string and the indexed strings as well as their n-gram counts. An obvious choice for this calculation is the Jaccard index. However, I found that the following formula is more effective:

\[q=\frac{\textrm{comm(x, y)}}{\max(\textrm{ngrams}(x),\textrm{ngrams}(y))}\]

In this approach, the intersection count is divided by the count of the larger set.

Summary

In this blog post I have introduced you to fuzzy search by the book utilizing an n-gram index and the Levenshtein distance. The index provides candidate matches and the number of common n-grams between the candidates and the query string. The candidates can be filtered with an equation that connects the number of common n-grams with the distance, before the final matches are computed with the Levenshtein distance.

Additionally, I have provided insights into my own experiences and shared my adaptations for the classical algorithm. By sorting the characters within the n-grams, the impact of transpositions on the n-grams can be mitigated. This allows the usage of the Damerau-Levenshtein distance instead of the Levenshtein distance in the algorithm, yielding more accurate results.

Lastly, I have suggested that the sorted n-gram approach enables us to skip the filtering and the explicit distance calculation for the candidate matches. Instead, reasonable qualities can be computed by comparing the number of common n-grams of the query string and candidate strings from the index with their n-gram counts.

Sources

Join the discussion on GitHub.

Join the discussion on Hacker News.

Join the discussion on Reddit.

Thank you for reading. If you have any questions, thoughts, or feedback you'd like to share, please feel free to connect with me via mail. For more updates on upcoming blog posts and library releases, you may follow me on Twitter.

Happy Coding!
- Kevin