Introduction to regular expressions

A regular expression, also known as regex or regexp, is a pattern of characters we want to match or search for in a text.

Below is an example of a regular expression:

We can use the above regular expression to search a text and verify if it matches any of the following strings:

  • "regex"
  • "regexp"
  • "regular expression"

The figure below visually represents this regular expression example.

The above regular expression matches the string as per the pattern below:

Press + to interact
Start of line
start capturing Group#1
Match the following:
Either check for the character sequence regex
and check for the optional p parameter
(occurs zero or one time) or check for the
character sequence regular expression"
End of line

Note: We’ll explore many examples throughout the course to ensure we master this concept by the end.

As the example shows, regular expressions can scan the string from left to right to look for matches with a given pattern.

What are regular expressions?

We use regular expressions to represent a set of strings or a pattern to match. The pattern lets us match an input text against it, and performs operations based on the results. These operations include finding and replacing, splitting, extracting, and so on.

We derive the term “regular expression” from the fact that these expressions’ general pattern is “regular.” This means that a finite state machine can describe it.

Regular expressions are widely used in the following:

  • Unix tools like Seda stream editor that reads input, applies a set of commands to it, and writes the results to standard output and AWKa programming language that lets us process text files. It’s often used for data extraction and reporting

  • Text editors like Vim and Emacsan extensible, customizable and open-source text editor

  • Programming languages like Perl, Python, and Ruby

  • Domains like bioinformatics and lexical analysis

The most commonly used syntax is the original Basic Regular Expressions (BRE) introduced in Unix during the 1970s. It was standardized by POSIX and ECMA-148.

How regular expressions work

Regular expressions work by matching the provided regex pattern with the target string, character by character, from left to right.

If they match, the expression returns a “match”, and if they don’t match, it returns a “no match.”

Regular expressions are an essential part of many developers' toolboxes. We can use them in programming languages like .NET, JavaScript, Java, PHP, and Python to perform various tasks. These tasks range from validating data to searching patterns and extracting information. We can validate text input, such as username or password, extract data from HTML or XML files, and find all instances of a pattern in large log files.

The need for regular expressions

String patterns in programming are very common. Whether we write a program to find and replace text, extract data from a website, build a UI, or create XML files, there are some patterns we must match and use as part of our code.

Before regular expressions, the only way that computers could find patterns in text was to search for the word starting at the left-most character of a string. This made it hard to perform complex operations like “find all email addresses” or “get all numbers from this site.”

Regular expressions change all of that. They allow us to write patterns that find what we're looking for, and ignore everything else.

Regular expressions help us parse useful information, such as dates, phone numbers, and zip codes, from important text files such as code, log files, spreadsheets, and documents.

Regular expressions in C#

Regular expressions operate on character sequences. The simplest case matches a string against itself, but complex operations are possible with combinations of characters and quantifiers.

Quantifiers define how often a character or group of characters occurs in a pattern. The most common quantifiers are *, +, and ?. We learn about quantifiers in detail later in this course.

The syntax of advanced regular expressions is more complex than its basic counterpart because it has more features. This complexity comes with the ability to concisely express most common text processing tasks.

The .NET Framework provides the System.Text.RegularExpressions namespace as a set of methods and types that we can use to create, match, and manipulate regular expressions.

Regular expressions in C# are instances of the Regex class in this namespace. While it is possible to use this class directly, it is more convenient to use regular expressions with an API that offers both high-level pattern matching and low-level access. For example, Regex.IsMatch() in .NET uses this approach. Using this API, we can create both simple and complex Regex patterns.

The .NET Framework provides full support for regular expressions that include System.String methods, which accept string patterns and perform the following operations on them:

  • Creating and managing compiled patterns
  • Matching strings against patterns
  • Splitting strings according to patterns
  • Replacing parts of strings according to patterns

It offers significant advantages, including performance improvements, unicode character set support, and a more advanced syntax. We discuss this in more detail later in this course.

Note: Regex classes in C# include support for most common textual pattern matching needs with the exception of extended POSIX classes.

Program to demonstrate simple pattern matching

The following code shows how to compile and use a simple pattern in C#:

Press + to interact
using System.Text.RegularExpressions;
class HelloWorld
{
static void Main()
{
Regex r = new Regex(@"(regexp?|regular expression)");
System.Console.WriteLine(r.Match("Mastering regex in C#"));
System.Console.WriteLine(r.Match("Fun with regexp in C#"));
System.Console.WriteLine(r.Match("Mastering regular expressions in C#"));
System.Console.WriteLine(r.Match("Regular Expressions"));
}
}

Explanation

  • Line 7: We first define a pattern within the Regex class as Regex r = new Regex (@"(regexp?|regular expression)"). Here, the @ symbol denotes that string literal should be interpreted literally, without any escape characters.

  • Lines 8–10: We print the matches we find for various strings using the r.Match() method.

  • Line 9–10: We print regex, regexp, and regular expression as they match with the pattern.

  • Line 11: Here, we print a blank or an empty string because we find no match.

Note: By default, string matching is case sensitive in C#.