Ever wanted to learn how Google-like fuzzy search works?
Firstly, make the table using the fts51 extension:
CREATE VIRTUAL TABLE notes_fts USING fts5(contents);
Now you can insert some dummy data:
INSERT INTO notes_fts(contents) VALUES
'quick brown fox jumped over the lazy dog'),
('i am drawing a nice orange dog'),
('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.'); (
To do a basic search:
SELECT
contents
FROM notes_fts('brown dog');
over the lazy dog 👉quick brown fox jumped
You can also highlight the words:
SELECT- contents
+ highlight(notes_fts, 0, '[', ']')
FROM notes_fts('brown dog'); 👉quick [brown] fox jumped over the lazy [dog]
or send back a snippet if the text is looong:
SELECT- contents
+ snippet(notes_fts, 0, '[', ']', '...', 6)
FROM notes_fts('lorem sit'); 👉[Lorem] ipsum dolor [sit] amet, consectetur...
Ok, can we have partial matches though?
SELECT contents FROM notes_fts('draw');
👉
Not for free.
Partial Words
Partial matches can be hacked by adding a wildcard *
:
SELECT contents FROM notes_fts('draw*');
👉i am drawing a nice orange dog
But this is a hack, any good search engine should be able to match
draw <=> drawing
. Can we do any better?
Yes. We can use a trigram
tokenizer which is good at generalized
substring matching. First, setup the index again:
DROP TABLE notes_fts;
CREATE VIRTUAL TABLE notes_fts USING fts5(contents, tokenize="trigram");
-- NOW re-apply the inserts, need to populate the data.
And search:
SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('draw');
👉i am [draw]ing a nice orange dog
Great, but what if I misspell drawing
as draving
?
Spelling mistake tolerance
To fix spelling mistakes (query = 'quick brovn fox'
), we: 1.
split the query into tokens -
tokens = ['quick', ' ', 'brovn', ' ', 'fox']
2. for each token,
find closest word used in our vocabulary -
tokens.map(findBestMatch)
-
['quick', ' ', 'brown', ' ', 'fox']
3. join tokens ('quick brown fox'
)
How does this happen in practice? We use spellfix12 to build a table containing all words used in the notes:
CREATE VIRTUAL TABLE notes_vocab USING spellfix1;
INSERT INTO notes_vocab(word)
VALUES
'quick'),
('brown'),
('fox'),
('jumped');
(-- and more...
To implement findBestMatch(word)
, just query the table using the
word
column:
SELECT word FROM notes_vocab
WHERE word MATCH 'quicc' AND top=1;
👉quick
Yay! But you might be left with a problem of weak matches:
SELECT word
FROM notes_vocab
WHERE
'qic' AND
word MATCH =1;
top 👉quick
To fix these weak matches, just play with the score
column:
SELECT word
FROM notes_vocab
WHERE
word MATCH 'qic' AND- top=1
+ top=1 AND
+ score < 50;
👉
Longer words and variants
We previously established how to match substrings:
SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('draw');
👉i am [draw]ing a nice orange dog
A similar problem that needs to be solved is what if the query is a longer
word, like jumps
vs jumped
. Currently no match will
be returned:
SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('jumps');
👉
The solution is simple: stemming. It means removing prefixes and postfixes of words that are unnecessary.
This is super language-dependent, here are some examples in English:
stemmer('considerations') // => 'consider'
stemmer('detestable') // => 'detest'
stemmer('vileness') // => 'vile'
stemmer('jumps') // => 'jump'
Feeding the words through the stemmer gives us what we were looking for:
SELECT highlight(notes_fts, 0, '[', ']')
FROM notes_fts('jump');
over the lazy dog 👉quick brown fox [jump]ed
Summary
-
use
fts
for fuzzy searching- use
trigrams
for subset matching
- use
-
apply pre-processing
- use
spellfix
for spelling mistake tolerance - stem your words to match variants of the words
- use
Alternatively to fuzzy searching, we can also try matching based on the meaning of the text. Look up “semantic search” for more info.
If you like this article format - check out Diataxis’ How-To Guides.