SQLite Fuzzy Search

Ever wanted to learn how Google-like fuzzy search works?

Firstly, make the table using the fts5¹ extension:

CREATE VIRTUAL TABLE notes_fts USING fts5(contents);

Now you can insert some dummy data:

INSERT INTO notes_fts(contents) VALUES
    ('quick brown fox jumped over the lazy dog'),
    ('i am drawing a nice orange dog'),
    ('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.');

To do a basic search:

SELECT
    contents
FROM notes_fts('brown dog');
👉quick brown fox jumped over the lazy dog

You can also highlight the words:

SELECT
-   contents
+   highlight(notes_fts, 0, '[', ']')
FROM notes_fts('brown dog');
👉quick [brown] fox jumped over the lazy [dog]

or send back a snippet if the text is looong:

SELECT
-   contents
+   snippet(notes_fts, 0, '[', ']', '...', 6)
FROM notes_fts('lorem sit');
👉[Lorem] ipsum dolor [sit] amet, consectetur...

Ok, can we have partial matches though?

SELECT contents FROM notes_fts('draw');
👉

Not for free.

Partial Words

Partial matches can be hacked by adding a wildcard *:

SELECT contents FROM notes_fts('draw*');
👉i am drawing a nice orange dog

But this is a hack, any good search engine should be able to match draw <=> drawing. Can we do any better?

Yes. We can use a trigram tokenizer which is good at generalized substring matching. First, setup the index again:

DROP TABLE notes_fts;
CREATE VIRTUAL TABLE notes_fts USING fts5(contents, tokenize="trigram");
-- NOW re-apply the inserts, need to populate the data.

And search:

SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('draw');
👉i am [draw]ing a nice orange dog

Great, but what if I misspell drawing as draving?

Spelling mistake tolerance

To fix spelling mistakes (query = 'quick brovn fox'), we: 1. split the query into tokens - tokens = ['quick', ' ', 'brovn', ' ', 'fox'] 2. for each token, find closest word used in our vocabulary - tokens.map(findBestMatch) - ['quick', ' ', 'brown', ' ', 'fox'] 3. join tokens ('quick brown fox')

How does this happen in practice? We use spellfix1² to build a table containing all words used in the notes:

CREATE VIRTUAL TABLE notes_vocab USING spellfix1;
INSERT INTO notes_vocab(word)
VALUES
    ('quick'),
    ('brown'),
    ('fox'),
    ('jumped');
-- and more...

To implement findBestMatch(word), just query the table using the word column:

SELECT word FROM notes_vocab
WHERE word MATCH 'quicc' AND top=1;
👉quick

Yay! But you might be left with a problem of weak matches:

SELECT word
FROM notes_vocab
WHERE
    word MATCH 'qic' AND
    top=1;
👉quick

To fix these weak matches, just play with the score column:

SELECT word
FROM notes_vocab
WHERE
    word MATCH 'qic' AND
-   top=1
+   top=1 AND
+   score < 50;
👉

Longer words and variants

We previously established how to match substrings:

SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('draw');
👉i am [draw]ing a nice orange dog

A similar problem that needs to be solved is what if the query is a longer word, like jumps vs jumped. Currently no match will be returned:

SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('jumps');
👉

The solution is simple: stemming. It means removing prefixes and postfixes of words that are unnecessary.

This is super language-dependent, here are some examples in English:

stemmer('considerations') // => 'consider'
stemmer('detestable')     // => 'detest'
stemmer('vileness')       // => 'vile'
stemmer('jumps')          // => 'jump'

Feeding the words through the stemmer gives us what we were looking for:

SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('jump');
👉quick brown fox [jump]ed over the lazy dog

Summary

use fts for fuzzy searching
- use trigrams for subset matching
apply pre-processing
- use spellfix for spelling mistake tolerance
- stem your words to match variants of the words

Alternatively to fuzzy searching, we can also try matching based on the meaning of the text. Look up “semantic search” for more info.

If you like this article format - check out Diataxis’ How-To Guides.

https://www.sqlite.org/fts5.html↩︎
https://sqlite.org/spellfix1.html↩︎