Basics of Regular Expressions
Learn about regular expressions, their applications, and how to use them in text preprocessing.
Introduction
Regular expressions, commonly known as regex, are a handy tool for finding patterns in text data. More specifically, they’re a sequence of characters that define a search pattern, which can be used to match or replace specific parts of a string. We use regular expressions in text preprocessing tasks such as cleaning, tokenization, and pattern recognition. The ability to use regular expressions to quickly and accurately process large volumes of text makes them an essential tool for NLP and data science applications.
Learning regular expressions can be very helpful for data scientists, especially those who frequently work with text data. Therefore, it’s recommended that data science learners start exploring regular expressions early on in their learning journey because it will be a valuable skill for them.
Applications of regex
Here are two simple use cases of regex in data science:
We can use regex to carry out various NLP tasks. For example, in tokenization, we use regular expressions to identify delimiters between words, sentences, or paragraphs. We use regex to remove unwanted characters, punctuation, or whitespace in text cleaning.
Regular expressions are also instrumental in web scraping and data mining tasks, where we need to extract specific information from web pages or large datasets. Regex can identify patterns in web pages’ HTML or XML source code and extract relevant information such as email addresses, phone numbers, URLs, or other structured data. A great example is using regex to collect data on job openings from LinkedIn, Indeed, and other job sites.
Regex in text preprocessing
Let’s now see how to use regular expressions in text preprocessing using Python.
Tokenization
In the following example, we use the \w+
regular expression to match (or search for) one or more consecutive word characters (letters, digits, and underscores), which we extract as tokens from the input text using the re.findall()
function.
import reimport pandas as pddf = pd.read_csv('reviews.csv')def tokenize_text(text):return re.findall(r'\w+', text)df['tokens'] = df['review_text'].apply(tokenize_text)print(df['tokens'])
Let’s review the code line by line:
Lines 1–2: We import the
re
module and thepandas
library.Line 4: We read the
reviews.csv
file and store its contents in a pandas DataFrame calleddf
.Lines 5–6: We define a function called
tokenize_text
that takes a text string as input using thedef tokenize_text(text)
line. Inside the function, we use there.findall()
function to find all sequences of word characters (letters, digits, and underscores) in the giventext
string. This effectively tokenizes the text by splitting it into individual words. We return a list of the tokens usingreturn re.findall(r'\w+', text)
.
Note: The
\w+
regular expression matches one or more consecutive word characters. Word characters typically include letters (a–z, A–Z), digits (0–9), and underscores (_). We’ll learn more about these regular expresssions in detail later.
Line 7: We then apply the
tokenize_text
function to each element in thereview_text
column of thedf
DataFrame. The resulting tokens are stored intokens
.Line 8: Lastly, we print the
tokens
column which contains the tokenized versions of the review texts.
We can also observe from the output that in the last review, We're
is tokenized into “We” and “re.” Shortened words are broken down using this approach, but for proper tokenization, we need to have the text in its full form (e.g., “we are” for “we’re”). Similar cases apply to words like “I’ve,” “it’s,” etc.
Text cleaning and normalization
We can also use regex to clean up a given text by removing extra spaces and converting all the text to lowercase, as shown in the following example.
review_id,review_text,rating1,"Great product! I highly recommend it.",52,"The quality of this item is excellent.",43,"Not satisfied with the purchase. The product arrived damaged.",24,"Amazing service! Prompt delivery and great customer support.",55,"This product is a complete waste of money.",16,"Barack Obama was an American President.",57,"I met John Doe yesterday. He was very friendly.",48,"Jane Smith lives in New York.",39,"Hawaii is a beautiful place for vacation.",510,"United States is my dream destination.",4
Let’s review the code line by line:
Lines 1–2: We import the
re
module and thepandas
library.Line 4: We read the
reviews.csv
file and store its contents in a DataFrame calleddf
.Lines 5–7: We define the
clean_text
function that removes multiple spaces and non-alphanumeric characters and converts the text to lowercase. We use there.sub()
function to perform a regex-based substitution on the giventext
string. The firstre.sub()
function replaces multiple consecutive spaces with a single space using the' +'
pattern. The secondre.sub()
function replaces any non-alphanumeric characters with a space using the[^0-9a-zA-Z]+
pattern. We convert the resulting text to lowercase usingtext.lower()
and return the cleaned text from theclean_text
function usingreturn cleaned_text
.Line 8: We apply the
clean_text
function to thereview_text
column of the DataFrame, storing the cleaned text in a new column calledcleaned_text
.Line 9: We print the
cleaned_text
column that contains the cleaned versions of the review texts.
Named entity recognition
We can also extract named entities (such as persons and locations) from the given text using regular expressions and storing them in a dictionary, as shown below.
import reimport pandas as pddf = pd.read_csv('reviews.csv')patterns = {'PERSON': r'(Barack Obama|John Doe|Jane Smith)','LOCATION': r'(Hawaii|United States|New York)'}named_entities = {}for entity, pattern in patterns.items():def find_entities(text):return re.findall(pattern, text)df[entity] = df['review_text'].apply(find_entities)named_entities[entity] = df[entity].tolist()print(named_entities)
Let’s review the code line by line:
Lines 1–2: We import the
re
module and thepandas
library.Line 4: We read the
reviews.csv
file into a pandas DataFrame calleddf
.Lines 5–7: We create the
patterns
dictionary, which contains two key-value pairs. Each key represents a named entity type (PERSON
andLOCATION
), and each value is a regular expression pattern corresponding to the named entity type.Lines 9–14: We initialize an empty
named_entities
dictionary to store the extracted entities. Later, we use afor
loop to iterate over each item inpatterns
, whereentity
represents the entity name andpattern
represents the regular expression pattern. Inside the loop, we define thefind_entities
function that takes a text input and usesre.findall()
to find all matches of the pattern in the text. We apply thefind_entities
function to thereview_text
column usingdf['review_text'].apply(find_entities)
and assign the result to a new column indf
with the same name as the entity. Then, we convert the values in the entity column to a list and assign it to the corresponding key in thenamed_entities
dictionary.Line 15: Finally, we print the
named_entities
dictionary to display the extracted entities.