Regular Expressions

Learn how to use the regular expressions in Python programming.

Overview

It’s really hard to parse strings using object-oriented principles to match arbitrary patterns. There have been a fair number of academic papers written in which object-oriented design is used to set up string-parsing, but the results seem too verbose and hard to read, and they are not widely used in practice.

In the real world, string-parsing in most programming languages is handled by regular expressions. These are not verbose, but, wow, are they ever hard to read, at least until we learn the syntax. Even though regular expressions are not object-oriented, the Python regular expression library provides a few classes and objects that we can use to construct and run regular expressions.

Regular expressions as mathematical rules

While we use regular expressions to match a string, this is only a partial description of what a regular expression really is. It can help to think of a regular expression as a mathematical rule that could generate a (potentially infinite) collection of strings. When we match a regular expression, it’s similar to asking if a given string is in the set generated by the expression. What’s tricky is rewriting some fancy math using the paltry collection of punctuation marks available in the original ASCII character set. To help explain the syntax of regular expressions, we’ll take a little side-tour through some of these typographic problems that make regular expressions a challenge to read.

Visualizing regular expressions

Here’s an idealized mathematical regular expression for a small set of strings: world. We want to match these five characters. The set has one string, "world", that matches. This doesn’t seem too complex; the expression amounts to w AND o AND r AND l AND d with “AND” being implied. This parallels the way d = rt means d = r times t; the multiplication is implied.

Here’s a regular expression for a pattern with repeats: hel2o{hel}^2o. We want to match five characters, but one of them must occur twice. This set has one string, "hello", that matches. This emphasizes the parallel between regular ...