This rapidfuzz
extension adds high-performance fuzzy string matching and string edit distance functions to DuckDB, powered by the RapidFuzz C++ library.
rapidfuzz
is a DuckDB Community Extension.
You can use it in DuckDB SQL:
install rapidfuzz from community;
load rapidfuzz;
Fuzzy string matching allows you to compare strings and measure their similarity, even when they are not exactly the same. This is useful for:
- Data cleaning and deduplication
- Record linkage
- Search and autocomplete
- Spell checking
RapidFuzz provides fast, high-quality algorithms for string similarity and matching.
This extension exposes several core RapidFuzz algorithms as DuckDB scalar functions:
- Returns:
DOUBLE
(similarity score between 0 and 100) - Description: Computes the similarity ratio between two strings.
SELECT rapidfuzz_ratio('hello world', 'helo wrld');
┌─────────────────────────────────────────────┐
│ rapidfuzz_ratio('hello world', 'helo wrld') │
│ double │
├─────────────────────────────────────────────┤
│ 90.0 │
└─────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: Computes the best partial similarity score between substrings of the two inputs.
SELECT rapidfuzz_partial_ratio('hello world', 'world');
┌─────────────────────────────────────────────────┐
│ rapidfuzz_partial_ratio('hello world', 'world') │
│ double │
├─────────────────────────────────────────────────┤
│ 100.0 │
└─────────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: Compares strings after sorting their tokens (words), useful for matching strings with reordered words.
SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');
┌──────────────────────────────────────────────────────────┐
│ rapidfuzz_token_sort_ratio('world hello', 'hello world') │
│ double │
├──────────────────────────────────────────────────────────┤
│ 100.0 │
└──────────────────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: A similarity metric that compares sets of tokens between two strings, ignoring duplicated words and word order.
SELECT rapidfuzz_token_set_ratio('new york new york city', 'new york city');
┌──────────────────────────────────────────────────────────────────────┐
│ rapidfuzz_token_set_ratio('new york new york city', 'new york city') │
│ double │
├──────────────────────────────────────────────────────────────────────┤
│ 100.0 │
└──────────────────────────────────────────────────────────────────────┘
- Returns:
DOUBLE
(similarity score between 0 and 100) - Description: Computes the similarity ratio between two strings.
SELECT rapidfuzz_ratio('hello world', 'helo wrld');
┌─────────────────────────────────────────────┐
│ rapidfuzz_ratio('hello world', 'helo wrld') │
│ double │
├─────────────────────────────────────────────┤
│ 90.0 │
└─────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: Computes the best partial similarity score between substrings of the two inputs.
SELECT rapidfuzz_partial_ratio('hello world', 'world');
┌─────────────────────────────────────────────────┐
│ rapidfuzz_partial_ratio('hello world', 'world') │
│ double │
├─────────────────────────────────────────────────┤
│ 100.0 │
└─────────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: Compares strings after sorting their tokens (words), useful for matching strings with reordered words.
SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');
┌──────────────────────────────────────────────────────────┐
│ rapidfuzz_token_sort_ratio('world hello', 'hello world') │
│ double │
├──────────────────────────────────────────────────────────┤
│ 100.0 │
└──────────────────────────────────────────────────────────┘
- Returns:
DOUBLE
- Description: A similarity metric that compares sets of tokens between two strings, ignoring duplicated words and word order.
SELECT rapidfuzz_token_set_ratio('new york new york city', 'new york city');
┌──────────────────────────────────────────────────────────────────────┐
│ rapidfuzz_token_set_ratio('new york new york city', 'new york city') │
│ double │
├──────────────────────────────────────────────────────────────────────┤
│ 100.0 │
└──────────────────────────────────────────────────────────────────────┘
In addition to the main functions above, the extension provides a wide range of distance, similarity, and normalized functions for various algorithms. For each algorithm, the following function variants are available:
<algorithm>_distance(a, b)
<algorithm>_similarity(a, b)
<algorithm>_normalized_distance(a, b)
<algorithm>_normalized_similarity(a, b)
All functions take two VARCHAR
arguments and return a DOUBLE
..
- Jaro: Measures similarity based on the number and order of matching characters. Good for short strings and typos.
- Jaro-Winkler: Extension of Jaro that gives more weight to common prefixes. Useful for short strings, names, and typos.
- Hamming: Counts the number of differing characters at the same positions. Only defined for strings of equal length.
- Indel: Measures the minimum number of insertions and deletions to transform one string into another (no substitutions).
- Prefix: Measures the edit distance/similarity considering only prefixes of the strings.
- Postfix: Measures the edit distance/similarity considering only postfixes (suffixes) of the strings.
- OSA (Optimal String Alignment): Like Levenshtein, but allows for transpositions of adjacent characters (each substring can be edited only once).
- LCS Sequence (Longest Common Subsequence): Measures similarity based on the length of the longest common subsequence (not necessarily contiguous).
For each algorithm below, the following functions are available:
rapidfuzz_<algorithm>_distance(a, b)
rapidfuzz_<algorithm>_similarity(a, b)
rapidfuzz_<algorithm>_normalized_distance(a, b)
rapidfuzz_<algorithm>_normalized_similarity(a, b)
SELECT rapidfuzz_jaro_distance('duck', 'duke');
SELECT rapidfuzz_jaro_similarity('duck', 'duke');
SELECT rapidfuzz_jaro_normalized_distance('duck', 'duke');
SELECT rapidfuzz_jaro_normalized_similarity('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_distance('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_similarity('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_normalized_distance('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_normalized_similarity('duck', 'duke');
SELECT rapidfuzz_hamming_distance('karolin', 'kathrin');
SELECT rapidfuzz_hamming_similarity('karolin', 'kathrin');
SELECT rapidfuzz_hamming_normalized_distance('karolin', 'kathrin');
SELECT rapidfuzz_hamming_normalized_similarity('karolin', 'kathrin');
SELECT rapidfuzz_indel_distance('kitten', 'sitting');
SELECT rapidfuzz_indel_similarity('kitten', 'sitting');
SELECT rapidfuzz_indel_normalized_distance('kitten', 'sitting');
SELECT rapidfuzz_indel_normalized_similarity('kitten', 'sitting');
SELECT rapidfuzz_prefix_distance('prefix', 'pretext');
SELECT rapidfuzz_prefix_similarity('prefix', 'pretext');
SELECT rapidfuzz_prefix_normalized_distance('prefix', 'pretext');
SELECT rapidfuzz_prefix_normalized_similarity('prefix', 'pretext');
SELECT rapidfuzz_postfix_distance('postfix', 'pretext');
SELECT rapidfuzz_postfix_similarity('postfix', 'pretext');
SELECT rapidfuzz_postfix_normalized_distance('postfix', 'pretext');
SELECT rapidfuzz_postfix_normalized_similarity('postfix', 'pretext');
SELECT rapidfuzz_osa_distance('abcdef', 'azced');
SELECT rapidfuzz_osa_similarity('abcdef', 'azced');
SELECT rapidfuzz_osa_normalized_distance('abcdef', 'azced');
SELECT rapidfuzz_osa_normalized_similarity('abcdef', 'azced');
SELECT rapidfuzz_lcs_seq_distance('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_similarity('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_normalized_distance('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_normalized_similarity('abcdef', 'acbcf');
All functions support DuckDB VARCHAR
type. For best results, use with textual data.
SELECT rapidfuzz_ratio('database', 'databse');
SELECT rapidfuzz_partial_ratio('duckdb extension', 'extension');
SELECT rapidfuzz_token_sort_ratio('fuzzy string match', 'string fuzzy match');
SELECT rapidfuzz_token_set_ratio('fuzzy string match', 'string fuzzy match');
SELECT name, rapidfuzz_ratio(name, 'Jon Smith') AS similarity
FROM users
WHERE rapidfuzz_ratio(name, 'Jon Smith') > 80;
SELECT a.id, b.id, rapidfuzz_ratio(a.name, b.name) AS score
FROM table_a a
JOIN table_b b ON rapidfuzz_ratio(a.name, b.name) > 85;
SELECT query, candidate, rapidfuzz_partial_ratio(query, candidate) AS score
FROM search_candidates
ORDER BY score DESC
LIMIT 10;
- General similarity: Use
rapidfuzz_ratio
for overall similarity. - Partial matches: Use
rapidfuzz_partial_ratio
for substring matches. - Reordered words: Use
rapidfuzz_token_sort_ratio
for strings with the same words in different orders.
- RapidFuzz algorithms are highly optimized for speed and accuracy.
- For large datasets, use WHERE clauses to filter by similarity threshold.
- Preprocess your data (e.g., lowercase, trim) for best results.
MIT Licensed