...

/

Advanced Regular Expressions for Text Preprocessing

Advanced Regular Expressions for Text Preprocessing

Learn about groupings, anchors, lookarounds, modifiers, and backreferences.

Groupings

We use groupings to create subpatterns within a larger pattern to match, capture, and backreference text. There are two types of groupings in regex: capturing groups that capture the matched subpatterns and save them for referencing later in the pattern or in a replacement string, and non-capturing groups that we use to group subpatterns without storing them for later use. We represent capturing groups using parentheses (), and non-capturing groups using (?:).

In the following code example, we’ll define regular expressions to search for a cat name and a dog name, find matches for colors in the text, and print the results of the capturing and non-capturing groups.

Press + to interact
main.py
reviews.csv
import re
import pandas as pd
df = pd.read_csv('reviews.csv')
text = "\n".join(df['review_text'])
cat_regex = re.compile(r"cat named (\w+)")
cat_match = cat_regex.search(text)
dog_regex = re.compile(r"dog named (\w+)")
dog_match = dog_regex.search(text)
color_regex = re.compile(r"(?:black|brown and white)")
color_matches = color_regex.findall(text)
print("Capturing group result 1:", cat_match)
print(f"John's cat's name is {cat_match.group(1)}\n")
print("Capturing group result 2:", dog_match)
print(f"John's dog's name is {dog_match.group(1)}\n")
print("Non-capturing group result:", color_matches)
print(f"The colors of John's pets are: {', '.join(color_matches)}")

Let’s review the code line by line:

  • Lines 1–2: We import the re and pandas libraries.

  • Lines 4–5: We read the reviews.csv file into a DataFrame named df and combine all the review_text values into a single string called text, separated by newlines.

  • Lines 6–7: We define the cat_regex regular expression pattern to match the phrase cat named followed by a word (\w+). We use the re.compile() function to compile the regular expression pattern into a regular expression object for later use in matching operations. We then use cat_regex.search(text) to search for a match in text and store the result in cat_match.

  • Lines 8–9: We define and compile another regular expression pattern called dog_regex to match the phrase dog named followed by a word (\w+), and use dog_regex.search(text) to search for a match in text and store the result in dog_match.

  • Lines 10–11: We define and compile a regular expression pattern color_regex to match either the word black or the phrase brown and white, and use color_regex.findall(text) to find all non-overlapping matches in text and store the results in color_matches.

  • Lines 12–17: Lastly, we print the result of the first capturing group match for the cat pattern, the extracted cat’s name called cat_match.group(1), the result of the second capturing group match for the dog pattern, the extracted dog’s name using dog_match.group(1), and the non-capturing group matches for the color pattern and the colors of John’s pets by joining color_matches with commas.

Lookarounds

We use lookarounds (also known as assertions) to assert that a pattern is preceded or followed by another pattern without including the preceding or following pattern in the match. There are two types—lookaheads, and lookbehinds—and they can be either positive or negative.

  • Positive lookahead (?=...): This lookaround asserts that the succeeding characters must match the pattern inside the lookahead but doesn’t include them in the match result. For example, foo(?=bar) would match foo ...

Access this course and 1400+ top-rated courses and projects.