Formal Languages
Get an introduction to the concept of formal languages, grammars, and fundamental operations.
Introduction to formal languages
A formal language is a set of strings formed with symbols from some finite alphabet. When studying formal languages, we are interested in the many ways symbols can be validly combined to form valid strings for a given context. We talk about tokens (or words), sentences (or phrases), which are sequences of tokens, and language structure, which defines how strings may be properly classified. We do not concern ourselves here with the semantics of the words—only the structure of the strings.
Remember: A formal language is a set of strings formed with symbols from some finite alphabet.
To illustrate this, let’s consider the simple alphabet . Strings such as and are among the infinite number of strings that can be formed with these two symbols. To represent an empty string (which has length 0), we can choose none of the symbols. It can also be represented by the string literal ""
in programming languages, and in the theory of computation, we denote it by the lowercase Greek letter lambda: .
The infinite set of all possible strings using this alphabet is depicted like this:
Observe the natural way to enumerate this set, which we call canonical order. Strings are grouped according to length, the groups are arranged in increasing length, and the strings within each group are alphabetized, i.e., they appear in lexicographical order based on the alphabet used ( comes before , etc.).
Operations in formal languages
The set of all possible strings is formed by considering all possible ways of concatenating any number of ’s and ’s together. The set of all possible concatenations of elements of an arbitrary set of symbols, , is called the Kleene star of that set and is denoted by . Concatenation of strings is a fundamental operation in formal languages. Since a formal language is a set of strings over some alphabet, a language is, therefore, a subset of its alphabet’s Kleene star.
Note: A formal language over the alphabet is a subset of .
Suppose the variable represents the string and stands for . The table below uses these variables to illustrate common operations on strings in a formal language.
Name | Operation | Example |
---|---|---|
Length | ||
Concatenation | ||
Repetition | , | |
Kleene star | ||
Reversal |
Since languages are sets, the usual set operations, such as union, intersection, difference, complement, etc., also apply.
Suppose , and we want to form , which is the Kleene star of the strings denoted by and above. As we always have the option to choose none, lambda is the first element of a Kleene star. This is followed by all possible concatenations of the original strings in canonical order:
The string precedes in because of its length, not because it appears first in the original set (sets, generally, are unordered, but a canonical ordering is by nature an ordered set). We use the alphabet to determine the lexicographical ordering within each length-group.
Generating strings in a language using formal grammar
Consider the language, , consisting of all even-length strings over the alphabet :
Another way to describe this language is the Kleene star of the strings from the set . In other words, . It is obvious that any string in has an even length. Take a moment to verify that any even-length string of ’s and ’s comes from repeated concatenations of elements of .
It is possible to generate all the strings in a formal language by using a set of “substitution rules” (aka “rewrite rules”) known as a formal grammar. A formal grammar consists of substitution rules that stand for substrings that form part of a complete string in a language. A grammar for the language is:
Since the variable appears first, it is the start variable. The vertical bar is an alternation symbol that separates the various ways to substitute for the variable on the left-hand side, so has three substitution rules and has two. To generate a string, begin by choosing a right-hand side to replace the start variable, , and then continue substituting for variables at will, finally ending by choosing a rule containing no variables (in this case, the rule ). We can, for example, use this grammar to derive the string as
Since lambda is the empty string, it disappears in concatenation. When all we have left are symbols of the alphabet, we are done generating a string.
Another grammar for comes from using the elements of the set from the first example, :
Using this grammar, we can derive the string as:
Quiz on formal languages
Test your knowledge of formal languages.
Take .
Take two strings and from such that and . Which option is the string ?