Basics of Regular Expressions

Learn about regular expressions, their applications, and how to use them in text preprocessing.

Introduction

Regular expressions, commonly known as regex, are a handy tool for finding patterns in text data. More specifically, they’re a sequence of characters that define a search pattern, which can be used to match or replace specific parts of a string. We use regular expressions in text preprocessing tasks such as cleaning, tokenization, and pattern recognition. The ability to use regular expressions to quickly and accurately process large volumes of text makes them an essential tool for NLP and data science applications.

Press + to interact
A regular expression that matches cat or hat
A regular expression that matches cat or hat

Learning regular expressions can be very helpful for data scientists, especially those who frequently work with text data. Therefore, it’s recommended that data science learners start exploring regular expressions early on in their learning journey because it will be a valuable skill for them.

Applications of regex

Here are two simple use cases of regex in data science:

  • We can use regex to carry out various NLP tasks. For example, in tokenization, we use regular expressions to identify delimiters between words, sentences, or paragraphs. We use regex to remove unwanted characters, punctuation, or whitespace in text cleaning.

  • Regular expressions are also instrumental in web scraping and data mining tasks, where we need to extract specific information from web pages or large datasets. Regex can identify patterns in web pages’ HTML or XML source code and extract relevant information such as email addresses, phone numbers, URLs, or other structured data. A great example is using regex to collect data on job openings from LinkedIn, Indeed, and other job sites.

Press + to interact
Applications of regex in data science
Applications of regex in data science

Regex in text preprocessing

Let’s now see how to use regular expressions in text preprocessing using Python.

Tokenization

In the following example, we use the \w+ regular expression to match (or search for) one or more consecutive word characters (letters, digits, and underscores), which we extract as tokens from the input text using the re.findall() function.

Press + to interact
main.py
reviews.csv
import re
import pandas as pd
df = pd.read_csv('reviews.csv')
def tokenize_text(text):
return re.findall(r'\w+', text)
df['tokens'] = df['review_text'].apply(tokenize_text)
print(df['tokens'])

Let’s review the code line by line:

  • Lines 1–2: We import the re module and the pandas library.

  • Line 4: We read the reviews.csv file and store its contents in a pandas DataFrame called df.

  • Lines 5–6: We define a function called tokenize_text that takes a text string as input using the def tokenize_text(text) line. Inside the function, we use the re.findall() function to find all sequences of word characters (letters, digits, and underscores) in the given text string. This effectively tokenizes the text by splitting it into individual words. We return a list of the tokens using return re.findall(r'\w+', text).

Note: The \w+ regular expression matches one or more consecutive word characters. Word characters typically include letters (a–z, A–Z), digits (0–9), and underscores (_). We’ll learn more about these regular expresssions in detail later.

  • Line 7: We then apply the tokenize_text function to each element in the review_text column of the df DataFrame. The resulting tokens are stored in tokens.

  • Line 8: Lastly, we print the tokens column which contains the tokenized versions of the review texts.

We can also observe from the output that in the last review, We're is tokenized into “We” and “re.” Shortened words are broken down using this approach, but for proper tokenization, we need to have the text in its full form (e.g., “we are” for “we’re”). Similar cases apply to words like “I’ve,” “it’s,” etc.

Text cleaning and normalization

We can also use regex to clean up a given text by removing extra spaces and converting all the text to lowercase, as shown in the following example.

Press + to interact
main.py
reviews.csv
review_id,review_text,rating
1,"Great product! I highly recommend it.",5
2,"The quality of this item is excellent.",4
3,"Not satisfied with the purchase. The product arrived damaged.",2
4,"Amazing service! Prompt delivery and great customer support.",5
5,"This product is a complete waste of money.",1
6,"Barack Obama was an American President.",5
7,"I met John Doe yesterday. He was very friendly.",4
8,"Jane Smith lives in New York.",3
9,"Hawaii is a beautiful place for vacation.",5
10,"United States is my dream destination.",4

Let’s review the code line by line:

  • Lines 1–2: We import the re module and the pandas library.

  • Line 4: We read the reviews.csv file and store its contents in a DataFrame called df.

  • Lines 5–7: We define the clean_text function that removes multiple spaces and non-alphanumeric characters and converts the text to lowercase. We use the re.sub() function to perform a regex-based substitution on the given text string. The first re.sub() function replaces multiple consecutive spaces with a single space using the ' +' pattern. The second re.sub() function replaces any non-alphanumeric characters with a space using the [^0-9a-zA-Z]+ pattern. We convert the resulting text to lowercase using text.lower() and return the cleaned text from the clean_text function using return cleaned_text.

  • Line 8: We apply the clean_text function to the review_text column of the DataFrame, storing the cleaned text in a new column called cleaned_text.

  • Line 9: We print the cleaned_text column that contains the cleaned versions of the review texts.

Named entity recognition

We can also extract named entities (such as persons and locations) from the given text using regular expressions and storing them in a dictionary, as shown below.

Press + to interact
main.py
reviews.csv
import re
import pandas as pd
df = pd.read_csv('reviews.csv')
patterns = {
'PERSON': r'(Barack Obama|John Doe|Jane Smith)',
'LOCATION': r'(Hawaii|United States|New York)'
}
named_entities = {}
for entity, pattern in patterns.items():
def find_entities(text):
return re.findall(pattern, text)
df[entity] = df['review_text'].apply(find_entities)
named_entities[entity] = df[entity].tolist()
print(named_entities)

Let’s review the code line by line:

  • Lines 1–2: We import the re module and the pandas library.

  • Line 4: We read the reviews.csv file into a pandas DataFrame called df.

  • Lines 5–7: We create the patterns dictionary, which contains two key-value pairs. Each key represents a named entity type (PERSON and LOCATION), and each value is a regular expression pattern corresponding to the named entity type.

  • Lines 9–14: We initialize an empty named_entities dictionary to store the extracted entities. Later, we use a for loop to iterate over each item in patterns, where entity represents the entity name and pattern represents the regular expression pattern. Inside the loop, we define the find_entities function that takes a text input and uses re.findall() to find all matches of the pattern in the text. We apply the find_entities function to the review_text column using df['review_text'].apply(find_entities) and assign the result to a new column in df with the same name as the entity. Then, we convert the values in the entity column to a list and assign it to the corresponding key in the named_entities dictionary.

  • Line 15: Finally, we print the named_entities dictionary to display the extracted entities.