Regular Expressions

Learn how to use regular expressions to describe regular languages.

Regular expressions are a convenient notation for representing strings that match simple text patterns. The expression aacba^*ac^*b^* illustrates two of the four regular expression operations, namely concatenation (via juxtaposition) and Kleene star. The others appear in the following recursive definition.

Formal definition

  1. These are regular expressions:

    a) ϕ\phi, representing the empty language/set, {}\{ \}

    b) λ\lambda, representing the one-string language/set, {λ}\{\lambda\}

    c) cc, for each character, cc in the alphabet Σ\Sigma, representing the language/set {c}\{c\}

  2. For regular expressions r,r1,r2r, r_1, r_2, the following are also regular expressions:

    a) r1+r2r_1 + r_2 (union)

    b) r1r2r_1r_2 (concatenation)

    c) rr^* (Kleene star)

    d) (r)(r) (grouping)

Note: Some authors (and most programming environments) use the notation r1r2r_1 \mid r_2 for union.

Each regular expression represents a language, which is the set of all strings matching its pattern. Formally, we denote the language associated with a regular expression, rr, as L(r)L(r), so L(ϕ)={}L(\phi)=\{ \}, L(λ)={λ}L(\lambda)=\{\lambda\}, and L(c)={c}L(c)=\{c\} for each symbol cc from the alphabet in use. The following expressions also hold:

  • L(r1+r2)=L(r1)L(r2)L(r_1+r_2)=L(r_1)\cup L(r_2)
  • L(r1r2)=L(r1)L(r2)L(r_1r_2)=L(r_1)L(r_2)
  • L(r)=(L(r))L(r^*)=(L(r))^*
  • L((r))=L(r)L((r))=L(r)

The above definition is a constructive definition. For example, using Σ={a,b,c}\Sigma =\{a,b,c\}, we can construct the regular expression aacbaa^*c^*b^* as:

  • aa, bb, and cc are regular expressions by rule 1–c.
  • aa^*, bb^*, and cc^* are regular expressions by rule 2–c.
  • aacbaa^*c^*b^* is a regular expression by rule 2–b (applied three times).

We now take Σ={a,b}\Sigma =\{a,b\} and construct a regular expression representing the language of aa’s and bb’s that contain the substring abaaba. We don’t care what precedes or follows the substring abaaba, so a regular expression for this language is (a+b)aba(a+b)(a+b)^*aba(a+b)^*.

The formal construction satisfying the definition is:

  • aa and bb are regular expressions by rule 1–c.
  • abaaba is a regular expression by rule 2–b (applied twice).
  • a+ba+b is a regular expression by rule 2–a.
  • (a+b)(a+b) is a regular expression by rule 2–d.
  • (a+b)(a+b)^* is a regular expression by rule 2–c.
  • (a+b)aba(a+b)(a+b)^*aba(a+b)^* is a regular expression by rule 2–b (applied twice).

Here are some more examples of regular expressions over the alphabet Σ={a,b}\Sigma=\{a,b\}.

Regular Language Regular Expression
Strings ending in abab (a+b)ab(a+b)^*ab
Strings that begin and end with different symbols a(a+b)b+b(a+b)aa(a+b)^*b+b(a+b)^*a
String of even length ((a+b)(a+b))((a+b)(a+b))^*
Strings with an even number of aa’s (aba+b)(ab^*a+b)^*

The precedence of regular expression operators parallels that of algebra:

  1. Grouping is the highest-precedence operator.
  2. Followed by Kleene star (like exponentiation).
  3. Followed by concatenation (like multiplication).
  4. Followed by union (like addition).

Note the subexpression abaab^*a in the last example in the above table. We know that two aa’s must be introduced at a time to have an even number, but they don’t have to be consecutive, so we insert a bb^* between the two aa’s.

Remember: Regular expressions constitute a pattern language for text processing.

While we are accustomed to using exponents as a shorthand for repetition, using variables as exponents is not allowed in regular expressions. For example, a3a^3 as a shorthand for aaaaaa is a regular expression because aaaaaa is a regular expression, but ana^n is not a regular expression.

Remember: The expression aka^k, where kk is a constant, is a regular expression; but ana^n, where nn is a variable, is not a regular expression!

Regular expressions also observe certain algebraic rules, such as a distributive law, wherein aa+ab=a(a+b)aa+ab=a(a+b) and b+ba=b(λ+a)b+ba=b(\lambda+a). Order matters here, because what looks like multiplication is actually concatenation, so a(a+b)(a+b)aa(a+b) \neq (a+b)a. The following table summarizes common rules for rewriting regular expressions.

Note: The quality of regular expressions is used to represent the equality of the corresponding languages. When we write r1=r2r_1 = r_2, instead of r1r2r_1\equiv r_2, we mean that L(r1)=L(r2)L(r_1) = L(r_2).

Rule Description
r+s=s+rr+s=s+r Union is commutative
(r+s)+t=r+(s+t)(r+s)+t=r+(s+t) Union is associative
r+r=rr+r=r Union is idempotent
r+ϕ=rr+\phi=r Union’s identity element is phiphi
(rs)t=r(st)(rs)t=r(st) Concatenation is associative
rλ=λr=rr\lambda=\lambda r=r Concatenation’s identity element is λ\lambda
rϕ=ϕr=ϕr\phi=\phi r=\phi Concatenation’s nullity element is ϕ\phi
r(s+t)=rs+rtr(s+t)=rs+rt Concatenation over union is distributive
(r+s)t=rt+rs(r+s)t=rt+rs Concatenation over union is distributive
(r)=r(r^*)^*=r^* Kleene star is idempotent

Apart from the above-mentioned rules, there are some additional rules for the regular expressions mentioned below:

Get hands-on with 1300+ tech skills courses.