Trigrams, Similarity, and Searches
Learn about the similarity and search functions offered by the pg_trgm extension.
The idea behind trigrams is simple and very effective. Split your text into a consecutive series of three letters. That’s it. Then you can compare two texts based on how many consecutive three-letter series (trigrams) are common. That’s the notion of similarity. It works surprisingly well and doesn’t depend on the language used.
Similarity function in trigrams
In the following query, we show trigrams extracted from several attempts at spelling the name
“Tommy” and then the similarity value obtained when comparing tomy
and dim
to tom
.
select show_trgm('tomy') as tomy,show_trgm('Tomy') as "Tomy",show_trgm('tom torn') as "tom torn",similarity('tomy', 'tom'),similarity('dim', 'tom');
Note that when using small units of text, the similarity might look more like a guess than anything. Also, before we read the result of the query, here’s what the pg_trgm
documentation says about the similarity
function:
“Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).”
-[ RECORD 1 ]-------------------------------------
tomy | {" t"," to","my ",omy,tom}
Tomy | {" t"," to","my ",omy,tom}
tom torn | {" t"," to","om ",orn,"rn ",tom,tor}
similarity | 0.5
similarity | 0
As you can read in the PostgreSQL trigram extension documentation, the default similarity threshold is 0.3
, and you can tweak it by using the GUC setting pg_trgm.similarity_threshold
.
Use case: Searching for songs
Now we can search for songs about love, in our ...