RapidFuzz Extension for DuckDB

This rapidfuzz extension adds high-performance fuzzy string matching and string edit distance functions to DuckDB, powered by the RapidFuzz C++ library.

Installation

rapidfuzz is a DuckDB Community Extension.

You can use it in DuckDB SQL:

install rapidfuzz from community;
load rapidfuzz;

What is Fuzzy String Matching?

Fuzzy string matching allows you to compare strings and measure their similarity, even when they are not exactly the same. This is useful for:

Data cleaning and deduplication
Record linkage
Search and autocomplete
Spell checking

RapidFuzz provides fast, high-quality algorithms for string similarity and matching.

Available Functions

This extension exposes several core RapidFuzz algorithms as DuckDB scalar functions:

`rapidfuzz_ratio(a, b)`

Returns: DOUBLE (similarity score between 0 and 100)
Description: Computes the similarity ratio between two strings.

SELECT rapidfuzz_ratio('hello world', 'helo wrld');
┌─────────────────────────────────────────────┐
│ rapidfuzz_ratio('hello world', 'helo wrld') │
│                   double                    │
├─────────────────────────────────────────────┤
│                    90.0                     │
└─────────────────────────────────────────────┘

`rapidfuzz_partial_ratio(a, b)`

Returns: DOUBLE
Description: Computes the best partial similarity score between substrings of the two inputs.

SELECT rapidfuzz_partial_ratio('hello world', 'world');
┌─────────────────────────────────────────────────┐
│ rapidfuzz_partial_ratio('hello world', 'world') │
│                     double                      │
├─────────────────────────────────────────────────┤
│                      100.0                      │
└─────────────────────────────────────────────────┘

`rapidfuzz_token_sort_ratio(a, b)`

Returns: DOUBLE
Description: Compares strings after sorting their tokens (words), useful for matching strings with reordered words.

SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');
┌──────────────────────────────────────────────────────────┐
│ rapidfuzz_token_sort_ratio('world hello', 'hello world') │
│                          double                          │
├──────────────────────────────────────────────────────────┤
│                          100.0                           │
└──────────────────────────────────────────────────────────┘

`rapidfuzz_token_set_ratio(a, b)`

Returns: DOUBLE
Description: A similarity metric that compares sets of tokens between two strings, ignoring duplicated words and word order.

SELECT rapidfuzz_token_set_ratio('new york new york city', 'new york city');
┌──────────────────────────────────────────────────────────────────────┐
│ rapidfuzz_token_set_ratio('new york new york city', 'new york city') │
│                                double                                │
├──────────────────────────────────────────────────────────────────────┤
│                                100.0                                 │
└──────────────────────────────────────────────────────────────────────┘

`rapidfuzz_ratio(a, b)`

Returns: DOUBLE (similarity score between 0 and 100)
Description: Computes the similarity ratio between two strings.

SELECT rapidfuzz_ratio('hello world', 'helo wrld');
┌─────────────────────────────────────────────┐
│ rapidfuzz_ratio('hello world', 'helo wrld') │
│                   double                    │
├─────────────────────────────────────────────┤
│                    90.0                     │
└─────────────────────────────────────────────┘

`rapidfuzz_partial_ratio(a, b)`

Returns: DOUBLE
Description: Computes the best partial similarity score between substrings of the two inputs.

SELECT rapidfuzz_partial_ratio('hello world', 'world');
┌─────────────────────────────────────────────────┐
│ rapidfuzz_partial_ratio('hello world', 'world') │
│                     double                      │
├─────────────────────────────────────────────────┤
│                      100.0                      │
└─────────────────────────────────────────────────┘

`rapidfuzz_token_sort_ratio(a, b)`

Returns: DOUBLE
Description: Compares strings after sorting their tokens (words), useful for matching strings with reordered words.

SELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');
┌──────────────────────────────────────────────────────────┐
│ rapidfuzz_token_sort_ratio('world hello', 'hello world') │
│                          double                          │
├──────────────────────────────────────────────────────────┤
│                          100.0                           │
└──────────────────────────────────────────────────────────┘

`rapidfuzz_token_set_ratio(a, b)`

Returns: DOUBLE
Description: A similarity metric that compares sets of tokens between two strings, ignoring duplicated words and word order.

SELECT rapidfuzz_token_set_ratio('new york new york city', 'new york city');
┌──────────────────────────────────────────────────────────────────────┐
│ rapidfuzz_token_set_ratio('new york new york city', 'new york city') │
│                                double                                │
├──────────────────────────────────────────────────────────────────────┤
│                                100.0                                 │
└──────────────────────────────────────────────────────────────────────┘

Distance and Similarity Functions

In addition to the main functions above, the extension provides a wide range of distance, similarity, and normalized functions for various algorithms. For each algorithm, the following function variants are available:

<algorithm>_distance(a, b)
<algorithm>_similarity(a, b)
<algorithm>_normalized_distance(a, b)
<algorithm>_normalized_similarity(a, b)

All functions take two VARCHAR arguments and return a DOUBLE..

Algorithm Descriptions

Jaro: Measures similarity based on the number and order of matching characters. Good for short strings and typos.
Jaro-Winkler: Extension of Jaro that gives more weight to common prefixes. Useful for short strings, names, and typos.
Hamming: Counts the number of differing characters at the same positions. Only defined for strings of equal length.
Indel: Measures the minimum number of insertions and deletions to transform one string into another (no substitutions).
Prefix: Measures the edit distance/similarity considering only prefixes of the strings.
Postfix: Measures the edit distance/similarity considering only postfixes (suffixes) of the strings.
OSA (Optimal String Alignment): Like Levenshtein, but allows for transpositions of adjacent characters (each substring can be edited only once).
LCS Sequence (Longest Common Subsequence): Measures similarity based on the length of the longest common subsequence (not necessarily contiguous).

Example Function List

For each algorithm below, the following functions are available:

rapidfuzz_<algorithm>_distance(a, b)
rapidfuzz_<algorithm>_similarity(a, b)
rapidfuzz_<algorithm>_normalized_distance(a, b)
rapidfuzz_<algorithm>_normalized_similarity(a, b)

Jaro

SELECT rapidfuzz_jaro_distance('duck', 'duke');
SELECT rapidfuzz_jaro_similarity('duck', 'duke');
SELECT rapidfuzz_jaro_normalized_distance('duck', 'duke');
SELECT rapidfuzz_jaro_normalized_similarity('duck', 'duke');

Jaro-Winkler

SELECT rapidfuzz_jaro_winkler_distance('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_similarity('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_normalized_distance('duck', 'duke');
SELECT rapidfuzz_jaro_winkler_normalized_similarity('duck', 'duke');

Hamming

SELECT rapidfuzz_hamming_distance('karolin', 'kathrin');
SELECT rapidfuzz_hamming_similarity('karolin', 'kathrin');
SELECT rapidfuzz_hamming_normalized_distance('karolin', 'kathrin');
SELECT rapidfuzz_hamming_normalized_similarity('karolin', 'kathrin');

Indel

SELECT rapidfuzz_indel_distance('kitten', 'sitting');
SELECT rapidfuzz_indel_similarity('kitten', 'sitting');
SELECT rapidfuzz_indel_normalized_distance('kitten', 'sitting');
SELECT rapidfuzz_indel_normalized_similarity('kitten', 'sitting');

Prefix

SELECT rapidfuzz_prefix_distance('prefix', 'pretext');
SELECT rapidfuzz_prefix_similarity('prefix', 'pretext');
SELECT rapidfuzz_prefix_normalized_distance('prefix', 'pretext');
SELECT rapidfuzz_prefix_normalized_similarity('prefix', 'pretext');

Postfix

SELECT rapidfuzz_postfix_distance('postfix', 'pretext');
SELECT rapidfuzz_postfix_similarity('postfix', 'pretext');
SELECT rapidfuzz_postfix_normalized_distance('postfix', 'pretext');
SELECT rapidfuzz_postfix_normalized_similarity('postfix', 'pretext');

OSA (Optimal String Alignment)

SELECT rapidfuzz_osa_distance('abcdef', 'azced');
SELECT rapidfuzz_osa_similarity('abcdef', 'azced');
SELECT rapidfuzz_osa_normalized_distance('abcdef', 'azced');
SELECT rapidfuzz_osa_normalized_similarity('abcdef', 'azced');

LCS Sequence

SELECT rapidfuzz_lcs_seq_distance('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_similarity('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_normalized_distance('abcdef', 'acbcf');
SELECT rapidfuzz_lcs_seq_normalized_similarity('abcdef', 'acbcf');

Supported Data Types

All functions support DuckDB VARCHAR type. For best results, use with textual data.

Usage Examples

Basic Similarity

SELECT rapidfuzz_ratio('database', 'databse');
SELECT rapidfuzz_partial_ratio('duckdb extension', 'extension');
SELECT rapidfuzz_token_sort_ratio('fuzzy string match', 'string fuzzy match');
SELECT rapidfuzz_token_set_ratio('fuzzy string match', 'string fuzzy match');

Data Deduplication

SELECT name, rapidfuzz_ratio(name, 'Jon Smith') AS similarity
FROM users
WHERE rapidfuzz_ratio(name, 'Jon Smith') > 80;

Record Linkage

SELECT a.id, b.id, rapidfuzz_ratio(a.name, b.name) AS score
FROM table_a a
JOIN table_b b ON rapidfuzz_ratio(a.name, b.name) > 85;

Search and Autocomplete

SELECT query, candidate, rapidfuzz_partial_ratio(query, candidate) AS score
FROM search_candidates
ORDER BY score DESC
LIMIT 10;

Algorithm Selection Guide

General similarity: Use rapidfuzz_ratio for overall similarity.
Partial matches: Use rapidfuzz_partial_ratio for substring matches.
Reordered words: Use rapidfuzz_token_sort_ratio for strings with the same words in different orders.

Performance Tips

RapidFuzz algorithms are highly optimized for speed and accuracy.
For large datasets, use WHERE clauses to filter by similarity threshold.
Preprocess your data (e.g., lowercase, trim) for best results.

License

MIT Licensed

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
duckdb @ e502515		duckdb @ e502515
extension-ci-tools @ b808e51		extension-ci-tools @ b808e51
scripts		scripts
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

License

Query-farm/rapidfuzz

Folders and files

Latest commit

History

Repository files navigation

RapidFuzz Extension for DuckDB

Installation

What is Fuzzy String Matching?

Available Functions

rapidfuzz_ratio(a, b)

rapidfuzz_partial_ratio(a, b)

rapidfuzz_token_sort_ratio(a, b)

rapidfuzz_token_set_ratio(a, b)

rapidfuzz_ratio(a, b)

rapidfuzz_partial_ratio(a, b)

rapidfuzz_token_sort_ratio(a, b)

rapidfuzz_token_set_ratio(a, b)

Distance and Similarity Functions

Algorithm Descriptions

Example Function List

Jaro

Jaro-Winkler

Hamming

Indel

Prefix

Postfix

OSA (Optimal String Alignment)

LCS Sequence

Supported Data Types

Usage Examples

Basic Similarity

Data Deduplication

Record Linkage

Search and Autocomplete

Algorithm Selection Guide

Performance Tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`rapidfuzz_ratio(a, b)`

`rapidfuzz_partial_ratio(a, b)`

`rapidfuzz_token_sort_ratio(a, b)`

`rapidfuzz_token_set_ratio(a, b)`

`rapidfuzz_ratio(a, b)`

`rapidfuzz_partial_ratio(a, b)`

`rapidfuzz_token_sort_ratio(a, b)`

`rapidfuzz_token_set_ratio(a, b)`

Packages