Character and Text
Learn about text processing, pattern matching, and regular expressions in PostgreSQL.
PostgreSQL knows how to deal with characters and text, and it implements several data types for that, all documented in the character types chapter of the documentation.
About the data type itself, it must be noted that text
and varchar
are the same things as far as PostgreSQL is concerned, and character varying is an alias for varchar
. When using varchar(15)
, we’re basically telling PostgreSQL to manage a text
column with a check
constraint of 15 characters.
Yes, PostgreSQL knows how to count characters even with Unicode encoding.
Text processing in PostgreSQL
There’s a very rich set of PostgreSQL functions to process the text—we can find them all in the String Functions and
Operators documentation chapter—with functions such as overlay()
, substring()
, position()
, or trim()
. Or aggregates such as string_agg()
. There are also regular expression functions, including the very powerful regexp_split_to_table()
.
For more about PostgreSQL regular expressions, read the main documentation about them in the chapter on pattern matching.
Pattern matching
Additionally, to the classic like
and ilike
patterns and to the SQL standard similar to
operators, PostgreSQL embeds support for a full-blown regular expression matching engine. The main operator implementing REGEXP is ~
, and then we find the derivatives for not matching and match either case. In total, we have four operators: ~
, !~
, ~*
, and !~*
.
Regular expressions
The regular expression split functions are powerful in many use cases. In particular, they’re very helpful when we have to work with a messy schema in which a single column represents several bits of information in a pseudo-specified way. An example of such a dataset is available in open data: the Archives de la Planète or “planet archives.” The data is available as CSV and, once loaded, looks like the following, all in French (but it doesn’t matter very much for our purposes here):
─[ RECORD 1 ]──────────────────────────────────────────────
id │ IF39599
inventory │ A 2 037
orig_legend │ Serbie, Monastir Bitolj, Un Turc
legend │ Un Turc
location │ Monastir (actuelle Bitola), Macédoine
date │ mai 1913
operator │ Auguste Léon
...
themes │ Habillement > Habillement traditionnel,Etres …
│…humains > Homme,Etres humains > Portrait,Rela…
│…tions internationales > Présence étrangère
...
collection │ Archives de la Planète
...
Note: PostgreSQL also supports indexing for regular expressions thanks to its trigram extension:
pg_trgm
.
You can see that the themes
column contains several categories for a single entry, separated by a comma. Within that comma-separated list, we find another classification, this time separated with a greater than sign, which looks like a hierarchical categorization of the themes.
The value IF39599
from the id
column is relevant to that series of themes:
id │ cat1 │ cat2
═════════╪═══════════════════════════╪══════════════════════════
IF39599 │ Habillement │ Habillement traditionnel
IF39599 │ Etres humains │ Homme
IF39599 │ Etres humains │ Portrait
IF39599 │ Relations internationales │ Présence étrangère
(4 rows)
The question is, how do we get that information? Also, is it possible to have an idea of the distribution of the whole dataset in relation to the categories embedded in the themes
column?
With PostgreSQL, this is easy enough to achieve.
Pattern matching and regular expressions
Let’s use the techniques available in PostgreSQL to achieve the required output.
Using split in PostgreSQL
We’re going to split the themes
column using a regular expression:
Get hands-on with 1400+ tech skills courses.