Check out our new migration tool - Move your Postgres data to Neon by pasting your connection URL, no manual steps

The fuzzystrmatch extension

new

Perform fuzzy string matching for names, typos, and similar-sounding words in Postgres

The fuzzystrmatch extension for Postgres provides a suite of functions to determine similarities and distances between strings. This is useful for applications that need to handle misspellings, phonetic variations, or simply find "close enough" matches in text data. Whether you're building a search engine, cleaning customer data, or trying to identify duplicate records, fuzzystrmatch offers powerful tools to compare strings beyond exact equality.

Imagine a user searching for "John Doe" but typing "Jon Dow", or needing to match "Smith" with "Smythe". fuzzystrmatch provides algorithms like Soundex, Levenshtein distance, Metaphone, and Daitch-Mokotoff Soundex to tackle these challenges.

Try it on Neon!

Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.

Sign Up

Enable the fuzzystrmatch extension

You can enable the extension by running the following CREATE EXTENSION statement in the Neon SQL Editor or from a client such as psql that is connected to your Neon database.

CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;

Version availability:

Please refer to the list of all extensions available in Neon for up-to-date extension version information.

Core functions and usage

fuzzystrmatch offers several algorithms, each with its strengths for different types of string comparisons.

1. Soundex

The Soundex system is a phonetic algorithm for indexing names by sound, as pronounced in English. It converts a string into a four-character code, where the first character is the first letter of the string, and the remaining three digits encode the consonants.

Functions:

  • soundex(text) returns text: Computes the Soundex code of a string.
  • difference(text, text) returns int: Computes the difference between the Soundex codes of two strings. The result ranges from 0 (no match) to 4 (exact match on Soundex codes).

Examples:

The soundex function generates the phonetic code, and difference measures how similar these codes are. For instance, names that sound similar often share Soundex codes or have very similar ones:

  • Pairs like ("Smith"/"Smythe") and ("John"/"Jon") yield the same Soundex codes (S530 and J500 respectively), indicating they sound very similar. The difference function confirms this with a score of 4 (an exact match on the Soundex code).
  • Similarly, ("Robert"/"Rupert") both produce the Soundex code R163 and thus also have a difference score of 4.
  • In contrast, a pair like ("Anne"/"Andrew") yields different Soundex codes (A500 vs A536) and a difference score of 2, reflecting a lesser degree of phonetic similarity according to Soundex.

Let's see these in action with SQL:

SELECT soundex('Smith'), soundex('Smythe');
-- s530, s530

SELECT difference('Smith', 'Smythe');
-- 4

SELECT soundex('John'), soundex('Jon');
-- J500, J500

SELECT difference('John', 'Jon');
-- 4

SELECT soundex('Robert'), soundex('Rupert');
-- R163, R163

SELECT difference('Anne', 'Andrew');
-- 2 (A500 vs A536)

Use case: Useful for matching English names that sound similar but are spelled differently. Note that Soundex is not very effective for non-English names.

2. Levenshtein distance

The Levenshtein distance measures the similarity between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. A lower distance indicates greater similarity.

Functions:

  • levenshtein(source text, target text) returns int: Calculates Levenshtein distance with a cost of 1 for each insertion, deletion, or substitution.
  • levenshtein(source text, target text, ins_cost int, del_cost int, sub_cost int) returns int: Calculates Levenshtein distance with specified costs for operations.
  • levenshtein_less_equal(source text, target text, max_d int) returns int: An accelerated version. If the actual distance is less than or equal to max_d, it returns the correct distance; otherwise, it returns a value greater than max_d. This is faster if you only care about small distances.
  • levenshtein_less_equal(source text, target text, ins_cost int, del_cost int, sub_cost int, max_d int) returns int: Accelerated version with custom costs.

Both source and target strings can be any non-null strings and be up to 255 characters long.

Examples:

The Levenshtein distance quantifies the "edit effort" between strings. Consider these transformations:

  1. To change "kitten" to "sitting":

    • Substitute 'k' with 's' (kitten -> sitten)
    • Substitute 'e' with 'i' (sitten -> sittin)
    • Insert 'g' at the end (sittin -> sitting)

    This requires 3 edits, so the Levenshtein distance is 3.

  2. To change "apple" to "apply":

    • Substitute 'e' with 'y' (apple -> apply)

    This is 1 edit, giving a distance of 1.

  3. If comparing "book" and "back":

    • Substitute 'o' with 'a' (book -> baok)
    • Substitute 'o' with 'c' (baok -> back) This requires 2 edits, resulting in a distance of 2.

The function can also take custom costs for insertion, deletion, and substitution, which can be useful for domain-specific needs.

Let's see these in action with SQL:

-- kitten to sitting (default costs: 1 for ins, del, sub)
SELECT levenshtein('kitten', 'sitting');
-- 3

-- apple to apply
SELECT levenshtein('apple', 'apply');
-- 1

-- book to back
SELECT levenshtein('book', 'back');
-- 2

-- Levenshtein distance is case-sensitive
SELECT levenshtein('book', 'Book');
-- 1

-- Example with custom costs: 1 for ins, 2 for del, 3 for sub
SELECT levenshtein('book', 'back', 1, 2, 3);
-- 6
-- Few possible paths of minimum cost:
-- 2 substitutions(o -> a, o -> c): cost = 2 x 3 (for substitution) = 6)
-- delete 2 o's and insert a c: cost = 2 x 2 (for deletion) + 2 x 1 (for insertion) = 6

-- Using levenshtein_less_equal for efficiency when only small distances matter
SELECT levenshtein_less_equal('banana', 'bandana', 1);
-- Returns 1 (correct, as only one insertion 'd' is needed)

SELECT levenshtein_less_equal('longstringexample', 'short', 2);
-- Returns a value > 2 (actual distance is much higher, so it stops early)

Use case: Excellent for general typo correction, finding strings with minor differences, and when character-level edit distance is important. Works well with various languages, including those with multibyte encodings. Remember that Levenshtein is case-sensitive.

3. Metaphone and double metaphone

Metaphone algorithms, like Soundex, generate phonetic codes for strings. They are generally more accurate than Soundex for English words. Double metaphone provides primary and alternate encodings, offering better support for non-English words.

Functions:

  • metaphone(text, max_output_length int) returns text: Computes the metaphone code for a string, up to a specified maximum length.
  • dmetaphone(text) returns text: Computes the primary Double metaphone code.
  • dmetaphone_alt(text) returns text: Computes the alternate Double metaphone code (returns NULL if no alternate exists).

Examples:

SELECT metaphone('Michael', 8);
-- MXL

SELECT metaphone('algorithm', 10);
-- ALKR0M

SELECT dmetaphone('Smith'), dmetaphone_alt('Smith');
-- SM0, XMT

SELECT dmetaphone('Schmidt');
-- XMT

SELECT dmetaphone_alt('Schmidt');
-- SMT

-- Primary and alternate for a name with multiple pronunciations
SELECT dmetaphone('Joan'), dmetaphone_alt('Joan'); -- Spanish 'Joan Miró'
-- JN, AN

Use case: Good for matching English words phonetically. Double Metaphone is an improvement, especially with its alternate codes for handling variations in pronunciation and non-English names.

4. Daitch-Mokotoff Soundex

Daitch-Mokotoff (DM) Soundex is another phonetic algorithm, significantly more useful for non-English names than the original Soundex.

Key improvements over original Soundex:

  • Codes are based on the first six meaningful letters (not four).
  • Maps letters/combinations to ten possible codes (not seven).
  • Multiple codes can be emitted if a letter/combination has different sounds.

Function:

daitch_mokotoff(source text) returns text[]: Generates an array of Daitch-Mokotoff Soundex codes for the input string. The result is an array because a name can have multiple plausible pronunciations.

DM codes are 6 digits long. source should preferably be a single word or name.

Examples:

SELECT daitch_mokotoff('George');
-- {595000}

SELECT daitch_mokotoff('John');
-- {160000,460000}  (Reflects 'J' vs 'Y' sound possibilities)

SELECT daitch_mokotoff('Bierschbach');
-- {794575,794574,794750,794740,745750,745740,747500,747400}

Matching Daitch-Mokotoff codes: Since daitch_mokotoff returns an array, you can use the array overlap operator && for matching:

CREATE TABLE surnames (name TEXT);
INSERT INTO surnames VALUES ('Peterson'), ('Petersen'), ('Pietersen');

SELECT name FROM surnames
WHERE daitch_mokotoff(name) && daitch_mokotoff('Petterson');
name
-----------
 Peterson
 Petersen
 Pietersen
(3 rows)

Use case: Best for phonetic matching of European names, particularly when Soundex is insufficient. Works with multibyte encodings.

Practical usage scenarios

Let's see how these functions can be applied in common scenarios.

Finding misspelled names in a customer database

Suppose you have a customers table and want to find customers whose names are similar to "Jon Smithe".

CREATE TABLE customers (id INT, name TEXT);
INSERT INTO customers VALUES
  (1, 'John Smith'),
  (2, 'Jon Smythe'),
  (3, 'Jane Doe'),
  (4, 'Jonathan Smithson');

-- Using Levenshtein distance
SELECT * FROM customers
WHERE levenshtein(lower(name), lower('Jon Smithe')) <= 3;
id |      name
----+-----------------
  1 | John Smith
  2 | Jon Smythe
(2 rows)
-- Using Soundex difference
SELECT * FROM customers
WHERE difference(name, 'Jon Smithe') >= 3;
id |    name
----+------------
  1 | John Smith
  2 | Jon Smythe
  4 | Jonathan Smithson
(3 rows)

Typo correction in search input

It can be useful to suggest corrections for user input. For example, if a user types "Portgeasql", you can suggest "PostgreSQL":

WITH potential_matches AS (
  SELECT 'PostgreSQL' AS term
  UNION ALL SELECT 'MySQL'
  UNION ALL SELECT 'SQLite'
)
SELECT term, levenshtein(lower(term), 'portgeasql') AS distance
FROM potential_matches
ORDER BY distance
LIMIT 1;
term       | distance
-----------+----------
PostgreSQL |    3
(1 row)

Let's say the user types "sequelite" instead of "SQLite":

WITH potential_matches AS (
  SELECT 'PostgreSQL' AS term
  UNION ALL SELECT 'MySQL'
  UNION ALL SELECT 'SQLite'
)
SELECT term, levenshtein(lower(term), 'sequelite') AS distance
FROM potential_matches
ORDER BY distance
LIMIT 1;
term       | distance
-----------+----------
 SQLite    |        3
(1 row)

The values can then be used to suggest corrections or alternatives to the user.

Limitations and considerations

  • Multibyte encodings: soundex, metaphone, dmetaphone, and dmetaphone_alt are not reliable for UTF-8 or other multibyte encodings. Use daitch_mokotoff or levenshtein for such cases.
  • Phonetic nuances: Phonetic algorithms simplify pronunciation. They might not always align perfectly with intended pronunciations or capture all linguistic subtleties, potentially leading to false positives or negatives.
  • Computational cost: Levenshtein distance can be resource-intensive on large datasets without levenshtein_less_equal or proper indexing strategies.

Conclusion

The fuzzystrmatch extension is useful for tackling the common problem of inexact string matching. By understanding the strengths and weaknesses of functions like soundex, levenshtein, metaphone, dmetaphone, and daitch_mokotoff, you can enhance your application's ability to handle typos, phonetic variations, and find similar text data effectively. Always consider the nature of your data (especially language and encoding) and your specific matching requirements when choosing the right function.

Resources

Need help?

Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.

Last updated on

Was this page helpful?