Introduction to regular expressions

  • A regular expression, also known as regex or regexp, is a pattern of characters we want to match or search for in a text.
  • Regular expressions are widely used in Unix operating systems, text editors, programming languages, and various domains like bioinformatics and lexical analysis.
  • Regular expressions scan the string from left to right to look for matches with a given pattern.
  • The regular expression pattern lets us match an input text against it. It performs operations based on the results, like parsing useful information, finding and replacing texts, splitting a string, extracting data, and so on.

The String class and its methods

  • A string is a sequence of Unicode characters enclosed by double quotes.

  • A string literal prefixed with @ in C# denotes a verbatim string literal, in which escape sequences and interpolation are not processed. For example, the string "C:\\temp" can also be represented as @"C:\temp".

  • The string keyword in C# is an alias for the System.String class in the .NET Framework. It provides various methods and properties to work with strings.

  • Strings are an immutable sequence of System.Char objects.

  • The Concat() method of the String class concatenates the two strings.

  • The Match() method of the String class matches a regular expression against a string.

  • The Replace() method of the String class replaces a given substring with another substring in a string.

  • The Split() method of the String class splits a string into multiple substrings based on the characters in an array.

  • The Substring() method of the String class extracts a part of a string.

  • The Contains() method of a String class checks whether a string contains a given substring.

  • The StartsWith() and EndsWith() methods of the String class check whether a string starts with or ends with a given substring.

  • The IndexOf() method of the String class to find the index of a given character or substring in a string.

Regular expressions APIs in C#

  • The System.Text.RegularExpressions namespace of the .NET Framework provides a set of classes and methods to create, match, and manipulate regular expressions.

  • The Regex class is the primary type of the Regular Expressions API. It provides a set of static methods and properties to work with regular expressions.

  • The Match class represents the results of a single regular expression match. It contains information about each match, such as the value of the matched string, and its start and length within the input string.

  • The MatchCollection class contains a collection of Match objects.

  • The Group class represents a matching subexpression within a regular expression match.

  • The GroupCollection class contains a collection of Group objects that represent all the captured groups within a single regular expression match.

  • The Match() method of the Regex class matches a regular expression against a string. This method returns a Match object that contains information about the match.

  • The Matches() method of the Regex class finds all the matches of a regular expression in a string. This method returns a collection of Match objects that contain information about all the matches.

  • The Replace() method of the Regex class replaces a regular expression with another string.

  • The Split() method of the Regex class splits a string into an array of substrings. This method splits the input string at the positions that match the regular expression.

  • The IsMatch() method of the Regex class checks whether a regular expression matches a given string. This method returns True if the regular expression matches the given string. Otherwise, it returns False.

Special characters in regular expressions

  • The dot (.) character matches any single character, except for the newline characters.

  • The caret (^) character matches the start of the input string.

  • The dollar sign ($) character matches the end of the input string.

  • A pair of square brackets [] represents a character class.

Character classes

  • A character class matches the single character enclosed within the square brackets.

  • A character class also includes a range of characters, represented by two characters separated by a hyphen -. For example, A-Z matches any uppercase letter from A to Z.

Meta characters

  • The backslash (\) symbol is a meta character that represents various predefined character classes.

  • \s matches any whitespace (space, tab, carriage-return,newline, and form-feed).

  • \d matches digits (0 to 9).

Quantifiers

  • The symbols ?, *, + are used as quantifiers.

  • X? matches zero or one occurrence of X.

  • X* matches zero or more occurrences of X.

  • X+ matches one or more occurrences of X.

RegexOptions

  • The RegexOptions enumeration controls how regular expression operations are performed.

  • We can include one or more values from the RegexOptions enumeration in a bitwise combination by using the OR (|) operator. For example, if we want to perform case-insensitive and culture-insensitive matches, we use the value RegexOptions.IgnoreCase | RegexOptions.CultureInvariant.

  • We can pass the value of the RegexOptions enumeration as an argument to methods that expect options. For example, we can specify the options for constructing a Regex object.

RegexOptions.CultureInvariant specifies that cultural differences in language are ignored.

  • RegexOptions.IgnoreCase specifies that the regular expression is case-insensitive.

  • RegexOptions.Multiline specifies that the regular expression matches multiple lines of input.

  • RegexOptions.Singleline specifies that the “.” character matches all characters, including newline characters.

  • RegexOptions.IgnorePatternWhitespace specifies that white space in the regular expression pattern is ignored.

Working with capture groups

  • Groups specified by parenthesis (), subdivide the match found by regular expressions.

  • We can access groups using the Groups property of the Match object.

  • The Value property of the Group object contains the value of the group that is matched.

  • The Success property indicates whether the group matches the input string.

  • The Index property of the Group object contains the index of the matched group.

  • The Length property of the Group object contains the length of the matched group.

  • The captured groups are numbered, starting from 1.

  • $n denotes the nth captured group, where n is the number of the captured group.

Working with backreferences

  • Backreferences let us reuse previously matched sub-strings within a regular expression pattern.

  • \\n ddenotes the nth captured group, where n is the number of the captured group.

  • $& denotes the entire match.

  • $\ denotes the part of the string before the match.

  • ${name} denotes the value of the named captured group name.

Advanced topics

  • Regex patterns are often used to search for sensitive data, such as credit card numbers or social security numbers. We must make sure to not accidentally store or log this sensitive data.

  • A malicious user might try to submit a string causes our regular expression to take a long time to process. This is called a regex denial of service (DoS) attack.

  • We can prevent DoS attacks by using the Timeouts property to specify how long a regular expression operation can take before it times out.

  • We should always use the simplest regular expressions that match the patterns we look for to ensure good performance.

Congratulations

Congratulations on finishing this course! The lessons you’ve learned here will be invaluable as you continue to make more complex and practical C# applications.

Thanks for enrolling in this course, and good luck with your next steps as a programmer!