...

/

Trigrams, Similarity, and Searches

Trigrams, Similarity, and Searches

Learn about the similarity and search functions offered by the pg_trgm extension.

The idea behind trigrams is simple and very effective. Split your text into a consecutive series of three letters. That’s it. Then you can compare two texts based on how many consecutive three-letter series (trigrams) are common. That’s the notion of similarity. It works surprisingly well and doesn’t depend on the language used.

Similarity function in trigrams

In the following query, we show trigrams extracted from several attempts at spelling the name “Tommy” and then the similarity value obtained when comparing tomy and dim to tom.

Press + to interact
select show_trgm('tomy') as tomy,
show_trgm('Tomy') as "Tomy",
show_trgm('tom torn') as "tom torn",
similarity('tomy', 'tom'),
similarity('dim', 'tom');

Note that when using small units of text, the similarity might look more like a guess than anything. Also, before we read the result of the query, here’s what the pg_trgm documentation says about the similarity function:

“Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).”

-[ RECORD 1 ]-------------------------------------
tomy       | {"  t"," to","my ",omy,tom}
Tomy       | {"  t"," to","my ",omy,tom}
tom torn   | {"  t"," to","om ",orn,"rn ",tom,tor}
similarity | 0.5
similarity | 0

As you can read in the PostgreSQL trigram extension documentation, the default similarity threshold is 0.3, and you can tweak it by using the GUC setting pg_trgm.similarity_threshold.

Use case: Searching for songs

Now we can search for songs about love, in our ...