Advanced Regular Expressions for Text Preprocessing
Learn about groupings, anchors, lookarounds, modifiers, and backreferences.
We'll cover the following...
Groupings
We use groupings to create subpatterns within a larger pattern to match, capture, and backreference text. There are two types of groupings in regex: capturing groups that capture the matched subpatterns and save them for referencing later in the pattern or in a replacement string, and non-capturing groups that we use to group subpatterns without storing them for later use. We represent capturing groups using parentheses ()
, and non-capturing groups using (?:)
.
In the following code example, we’ll define regular expressions to search for a cat name and a dog name, find matches for colors in the text, and print the results of the capturing and non-capturing groups.
import reimport pandas as pddf = pd.read_csv('reviews.csv')text = "\n".join(df['review_text'])cat_regex = re.compile(r"cat named (\w+)")cat_match = cat_regex.search(text)dog_regex = re.compile(r"dog named (\w+)")dog_match = dog_regex.search(text)color_regex = re.compile(r"(?:black|brown and white)")color_matches = color_regex.findall(text)print("Capturing group result 1:", cat_match)print(f"John's cat's name is {cat_match.group(1)}\n")print("Capturing group result 2:", dog_match)print(f"John's dog's name is {dog_match.group(1)}\n")print("Non-capturing group result:", color_matches)print(f"The colors of John's pets are: {', '.join(color_matches)}")
Let’s review the code line by line:
Lines 1–2: We import the
re
andpandas
libraries.Lines 4–5: We read the
reviews.csv
file into a DataFrame nameddf
and combine all thereview_text
values into a single string calledtext
, separated by newlines.Lines 6–7: We define the
cat_regex
regular expression pattern to match the phrasecat named
followed by a word(\w+)
. We use there.compile()
function to compile the regular expression pattern into a regular expression object for later use in matching operations. We then usecat_regex.search(text)
to search for a match intext
and store the result incat_match
.Lines 8–9: We define and compile another regular expression pattern called
dog_regex
to match the phrasedog named
followed by a word(\w+)
, and usedog_regex.search(text)
to search for a match intext
and store the result indog_match
.Lines 10–11: We define and compile a regular expression pattern
color_regex
to match either the wordblack
or the phrasebrown and white
, and usecolor_regex.findall(text)
to find all non-overlapping matches intext
and store the results incolor_matches
.Lines 12–17: Lastly, we print the result of the first capturing group match for the cat pattern, the extracted cat’s name called
cat_match.group(1)
, the result of the second capturing group match for the dog pattern, the extracted dog’s name usingdog_match.group(1)
, and the non-capturing group matches for the color pattern and the colors of John’s pets by joiningcolor_matches
with commas.
Lookarounds
We use lookarounds (also known as assertions) to assert that a pattern is preceded or followed by another pattern without including the preceding or following pattern in the match. There are two types—lookaheads, and lookbehinds—and they can be either positive or negative.
Positive lookahead
(?=...)
: This lookaround asserts that the succeeding characters must match the pattern inside the lookahead but doesn’t include them in the match result. For example,foo(?=bar)
would matchfoo
...