Groupings

We use groupings to create subpatterns within a larger pattern to match, capture, and backreference text. There are two types of groupings in regex: capturing groups that capture the matched subpatterns and save them for referencing later in the pattern or in a replacement string, and non-capturing groups that we use to group subpatterns without storing them for later use. We represent capturing groups using parentheses (), and non-capturing groups using (?:).

In the following code example, we’ll define regular expressions to search for a cat name and a dog name, find matches for colors in the text, and print the results of the capturing and non-capturing groups.

Press + to interact

Python 3.8

Files

import re
import pandas as pd
df = pd.read_csv('reviews.csv')
text = "\n".join(df['review_text'])
cat_regex = re.compile(r"cat named (\w+)")
cat_match = cat_regex.search(text)
dog_regex = re.compile(r"dog named (\w+)")
dog_match = dog_regex.search(text)
color_regex = re.compile(r"(?:black|brown and white)")
color_matches = color_regex.findall(text)
print("Capturing group result 1:", cat_match)
print(f"John's cat's name is {cat_match.group(1)}\n")
print("Capturing group result 2:", dog_match)
print(f"John's dog's name is {dog_match.group(1)}\n")
print("Non-capturing group result:", color_matches)
print(f"The colors of John's pets are: {', '.join(color_matches)}")

Let’s review the code line by line:

Lines 1–2: We import the re and pandas libraries.
Lines 4–5: We read the reviews.csv file into a DataFrame named df and combine all the review_text values into a single string called text, separated by newlines.
Lines 6–7: We define the cat_regex regular expression pattern to match the phrase cat named followed by a word (\w+). We use the re.compile() function to compile the regular expression pattern into a regular expression object for later use in matching operations. We then use cat_regex.search(text) to search for a match in text and store the result in cat_match.
Lines 8–9: We define and compile another regular expression pattern called dog_regex to match the phrase dog named followed by a word (\w+), and use dog_regex.search(text) to search for a match in text and store the result in dog_match.
Lines 10–11: We define and compile a regular expression pattern color_regex to match either the word black or the phrase brown and white, and use color_regex.findall(text) to find all non-overlapping matches in text and store the results in color_matches.
Lines 12–17: Lastly, we print the result of the first capturing group match for the cat pattern, the extracted cat’s name called cat_match.group(1), the result of the second capturing group match for the dog pattern, the extracted dog’s name using dog_match.group(1), and the non-capturing group matches for the color pattern and the colors of John’s pets by joining color_matches with commas.

Lookarounds

We use lookarounds (also known as assertions) to assert that a pattern is preceded or followed by another pattern without including the preceding or following pattern in the match. There are two types—lookaheads, and lookbehinds—and they can be either positive or negative.

Positive lookahead (?=...): This lookaround asserts that the succeeding characters must match the pattern inside the lookahead but doesn’t include them in the match result. For example, foo(?=bar) would match foo ...

About This Course

Introduction To Text Preprocessing

Regular Expressions

Irrelevant Text Data

Basic Text Preprocessing Techniques

Indexing

Text Transformation

Text Representation

Text Feature Engineering

Advanced Text Preprocessing

N-grams

Text Classification of Customer Reviews

Conclusion

Text Classification Using PyTorch

Advanced Regular Expressions for Text Preprocessing

Groupings

Lookarounds