Regular Expressions for Text Preprocessing
Learn about metacharacters, quantifiers, shorthand character classes, escape sequences, and character classes.
We'll cover the following...
Introduction
There are many elements of regular expressions that we can use. We’ll look at some basic and the most common ones in this lesson and then cover the rarer, more advanced ones later.
Character classes
A character class is a way of defining a group of characters that can match any one character from a given set. We use character classes in regex when we want to match a single character that can be any one of a specific set of characters instead of writing out all the possible characters that can be matched. To define a character class, we enclose a set of characters within square brackets []
.
Here are some character classes and what they match:
[a-z]
: Matches any lowercase letter[A-Z]
: Matches any uppercase letter[0-9]
: Matches any digit[a-zA-Z]
: Matches any letter (both lowercase and uppercase)[a-zA-Z0-9]
: Matches any letter or digit
Metacharacters
Metacharacters are characters with special meanings, and we use them when we want to search, extract, or manipulate text data based on specific patterns or rules. For example, let’s say we have a text document containing a list of email addresses. We can use metacharacters to search for all the email addresses that follow a specific pattern. Suppose we want to extract all the email addresses that end with @hello.com
from the document. We can use the .*
metacharacter to represent any number of characters before @hello.com
. Here’s an example regular expression pattern using metacharacters: .*@hello\.com
. In this pattern, the .*
metacharacter sequence matches any number of characters (including zero characters) before the @
symbol. The .
represents the literal dot character because the dot is also a metacharacter that matches any character. Finally, the \.com
pattern matches the .com
characters in the email addresses.
Here are some common metacharacters:
Dot (
.
): This metacharacter matches exactly one ...