Introduction

Regular expressions, commonly known as regex, are a handy tool for finding patterns in text data. More specifically, they’re a sequence of characters that define a search pattern, which can be used to match or replace specific parts of a string. We use regular expressions in text preprocessing tasks such as cleaning, tokenization, and pattern recognition. The ability to use regular expressions to quickly and accurately process large volumes of text makes them an essential tool for NLP and data science applications.

Press + to interact

Learning regular expressions can be very helpful for data scientists, especially those who frequently work with text data. Therefore, it’s recommended that data science learners start exploring regular expressions early on in their learning journey because it will be a valuable skill for them.

Applications of regex

Here are two simple use cases of regex in data science:

We can use regex to carry out various NLP tasks. For example, in tokenization, we use regular expressions to identify delimiters between words, sentences, or paragraphs. We use regex to remove unwanted characters, punctuation, or whitespace in text cleaning.
Regular expressions are also instrumental in web scraping and data mining tasks, where we need to extract specific information from web pages or large datasets. Regex can identify patterns in web pages’ HTML or XML source code and extract relevant information such as email addresses, phone numbers, URLs, or other structured data. A great example is using regex to collect data on job openings from LinkedIn, Indeed, and other job sites.

Press + to interact

Let’s review the code line by line:

Lines 1–2: We import the re module and the pandas library.
Line 4: We read the reviews.csv file and store its contents in a pandas DataFrame called df.
Lines 5–6: We define a function called tokenize_text that takes a text string as input using the def tokenize_text(text) line. Inside the function, we use the re.findall() function to find all sequences of word characters (letters, digits, and underscores) in the given text string. This effectively tokenizes the text by splitting it into individual words. We return a list of the tokens using return re.findall(r'\w+', text).

Note: The \w+ regular expression matches one or more consecutive word characters. Word characters typically include letters (a–z, A–Z), digits (0–9), and underscores (_). We’ll learn more about these regular expresssions in detail later.

Line 7: We then apply the tokenize_text function to each element in the review_text column of the df DataFrame. The resulting tokens are stored in tokens.
Line 8: Lastly, we print the tokens column which contains the tokenized versions of the review texts.

We can also observe from the output that in the last review, We're is tokenized into “We” and “re.” Shortened words are broken down using this approach, but for proper tokenization, we need to have the text in its full form (e.g., “we are” for “we’re”). Similar cases apply to words like “I’ve,” “it’s,” etc.

Text cleaning and normalization

We can also use regex to clean up a given text by removing extra spaces and converting all the text to lowercase, as shown in the following example.

Press + to interact

main.py

reviews.csv

review_id,review_text,rating
1,"Great product! I highly recommend it.",5
2,"The quality of this item is excellent.",4
3,"Not satisfied with the purchase. The product arrived damaged.",2
4,"Amazing service! Prompt delivery and great customer support.",5
5,"This product is a complete waste of money.",1
6,"Barack Obama was an American President.",5
7,"I met John Doe yesterday. He was very friendly.",4
8,"Jane Smith lives in New York.",3
9,"Hawaii is a beautiful place for vacation.",5
10,"United States is my dream destination.",4

Let’s review the code line by line:

Lines 1–2: We import the re module and the pandas library.
Line 4: We read the reviews.csv file and store its contents in a DataFrame called df.
Lines 5–7: We define the clean_text function that removes multiple spaces and non-alphanumeric characters and converts the text to lowercase. We use the re.sub() function to perform a regex-based substitution on the given text string. The first re.sub() function replaces multiple consecutive spaces with a single space using the ' +' pattern. The second re.sub() function replaces any non-alphanumeric characters with a space using the [^0-9a-zA-Z]+ pattern. We convert the resulting text to lowercase using text.lower() and return the cleaned text from the clean_text function using return cleaned_text.
Line 8: We apply the clean_text function to the review_text column of the DataFrame, storing the cleaned text in a new column called cleaned_text.
Line 9: We print the cleaned_text column that contains the cleaned versions of the review texts.

Named entity recognition

We can also extract named entities (such as persons and locations) from the given text using regular expressions and storing them in a dictionary, as shown below.

Press + to interact

Let’s review the code line by line:

Lines 1–2: We import the re module and the pandas library.
Line 4: We read the reviews.csv file into a pandas DataFrame called df.
Lines 5–7: We create the patterns dictionary, which contains two key-value pairs. Each key represents a named entity type (PERSON and LOCATION), and each value is a regular expression pattern corresponding to the named entity type.
Lines 9–14: We initialize an empty named_entities dictionary to store the extracted entities. Later, we use a for loop to iterate over each item in patterns, where entity represents the entity name and pattern represents the regular expression pattern. Inside the loop, we define the find_entities function that takes a text input and uses re.findall() to find all matches of the pattern in the text. We apply the find_entities function to the review_text column using df['review_text'].apply(find_entities) and assign the result to a new column in df with the same name as the entity. Then, we convert the values in the entity column to a list and assign it to the corresponding key in the named_entities dictionary.
Line 15: Finally, we print the named_entities dictionary to display the extracted entities.

About This Course

Introduction To Text Preprocessing

Regular Expressions

Irrelevant Text Data

Basic Text Preprocessing Techniques

Indexing

Text Transformation

Text Representation

Text Feature Engineering

Advanced Text Preprocessing

N-grams

Text Classification of Customer Reviews

Conclusion

Basics of Regular Expressions

Introduction

Applications of regex

Regex in text preprocessing

Tokenization

Text cleaning and normalization

Named entity recognition