Overview

It’s really hard to parse strings using object-oriented principles to match arbitrary patterns. There have been a fair number of academic papers written in which object-oriented design is used to set up string-parsing, but the results seem too verbose and hard to read, and they are not widely used in practice.

In the real world, string-parsing in most programming languages is handled by regular expressions. These are not verbose, but, wow, are they ever hard to read, at least until we learn the syntax. Even though regular expressions are not object-oriented, the Python regular expression library provides a few classes and objects that we can use to construct and run regular expressions.

Regular expressions as mathematical rules

While we use regular expressions to match a string, this is only a partial description of what a regular expression really is. It can help to think of a regular expression as a mathematical rule that could generate a (potentially infinite) collection of strings. When we match a regular expression, it’s similar to asking if a given string is in the set generated by the expression. What’s tricky is rewriting some fancy math using the paltry collection of punctuation marks available in the original ASCII character set. To help explain the syntax of regular expressions, we’ll take a little side-tour through some of these typographic problems that make regular expressions a challenge to read.

Visualizing regular expressions

Here’s an idealized mathematical regular expression for a small set of strings: world. We want to match these five characters. The set has one string, "world", that matches. This doesn’t seem too complex; the expression amounts to w AND o AND r AND l AND d with “AND” being implied. This parallels the way d = rt means d = r times t; the multiplication is implied.

Here’s a regular expression for a pattern with repeats: hel2o{hel}^2o. We want to match five characters, but one of them must occur twice. This set has one string, "hello", that matches. This emphasizes the parallel between regular expressions, multiplication, and exponents. It also points out the use of exponents to distinguish between matching the 2 character and matching the previous regular expression two times.

Flexibility with fonts and digits

Sometimes, we want some flexibility, and we want to match any digit. Mathematical typesetting lets us use a new font for 𝔻4{𝔻}^4. This fancy-looking D means any digit, or 𝔻=𝔻 = {0,1,2,3,4,5,6,7,8,9}, and the raised 4 means four copies. This describes a set that has 10,000 possible matching strings from “0000” to “9999.” Why use the fancy math typesetting? We can use different fonts and letter arrangements to distinguish the concept of “any digit” and “four copies” from the letter D and the digit 4. Code lacks the fancy fonts, forcing designers to work around the distinction between letters meaning themselves, like D, and letters having other useful meanings, like 𝔻𝔻.

And yes, a regular expression looks a lot like a long multiplication. There’s a very strong parallel with “must have these” and multiplication. Is there a parallel with addition? Yes, it’s the idea of optional or alternative constructs; in effect an “or” instead of the default “and.”

Handling variable-length patterns

What if we want to describe years in a date where there could be two digits or four digits? Mathematically, we might say 𝔻2{𝔻}^2|𝔻4{𝔻}^4. What if we’re not sure how many digits? We have a special “to any power,” the Kleene star. We can say 𝔻{𝔻}^* to mean any number of repeats of a character in the 𝔻𝔻 set.

All of this math typesetting has to be implemented in the regular expression language. This can make it difficult to sort out precisely what a regular expression means.

Use cases

Regular expressions are used to solve a common problem: given a string, determine whether that string matches a given pattern and, optionally, collect substrings that contain relevant information. They can be used to answer questions such as the following:

  • Is this string a valid URL?
  • What is the date and time of all warning messages in a log file?
  • Which users in /etc/passwd are in a given group?
  • What username and document were requested by the URL a visitor typed?

There are many similar scenarios where regular expressions are the correct answer. In this section, we’ll gain enough knowledge of regular expressions to compare strings against relatively common patterns.

Limitations of regular expressions

There are important limitations here. Regular expressions don’t describe languages with recursive structures. When we look at XML or HTML, for example, a <p> tag can contain inline <span> tags, like this: <p><span>hello</span><span>world</ span></p>. This recursive nesting of tag-within-tag is generally not a great thing to try and process with a regular expression. We can recognize the individual elements of the XML language, but higher-level constructs like a paragraph tag with other tags inside it require more powerful tools than regular expressions. The XML parsers in the Python standard library can handle these more complex constructs.

Matching patterns

Regular expressions are a complicated mini-language. We need to be able to describe individual characters as well as classes of characters, as well as operators that group and combine characters, all using a few ASCII-compatible characters. Let’s start with literal characters, such as letters, numbers, and the space character, which always match themselves.

Example

Let’s see a basic example:

Get hands-on with 1200+ tech skills courses.