Regular Expressions
Let's learn how we can use regular expressions in Ruby.
Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.
That’s a pretty famous joke, and it refers to the fact that regular expressions can be difficult to solve.
However, once we know some basics about them, they’re also extremely powerful, and we can do amazing things with them, not only in Ruby but also, for example, in our editor and command-line tools.
Regular expressions are sort of a swiss army knife for finding things in strings (text), extracting parts of them, or mass replacing certain bits with something else.
For example, we could do any of these tasks:
-
Extract area codes from phone numbers.
-
Validate the format of an email address.
-
For a list of files ,
a-01.mpeg
,b-02.mpeg
, andc-03.mpeg
, change their names to01-a.mpeg
,02-b.mpeg
, and03-c.mpeg
.
Remember: Regular expressions are a language to describe patterns of text. Wikipedia calls them “a sequence of characters that define a search pattern.”
For example, the pattern [0-9]+!
means that there needs to be at least one digit, and it needs to be followed by an exclamation mark. Does the pattern ([\w]+)-([\d]+)\.mpeg
look intimidating and cryptic? It does, and that’s why regular expressions have a kind of strange reputation in programming. They’re super powerful, but they’re also kind of a pain.
A little bit of history
The main reason regular expressions are so hard to read is that they date as far back as 1956, and their first implementations in programming came in the late 1960s. Back then, every single character of code was worth a lot. Memory was extremely limited, and code had to be as terse as possible.
Now, the most commonly used features of this language are the following:
-
String literals: Find a particular piece of text.
-
Anchors: The beginning and the end of a string, or a word.
-
Character classes: Define a set of allowed characters.
-
Quantifiers: Define how often a character is expected to occur.
-
Captures: Once found, capture a particular part of the text so that we can use it.
String literals
Let’s walk through some examples to make this more practical. Let’s say we
have the following text
string:
text = "A regular expression is a sequence of characters that define a search pattern."
Suppose we want to know if it contains the words character
and sentence
. In Ruby, we could use a regular expression, like so:
matches = text.match(/character/)puts matches
Note that in Ruby, we can define a regular expression by enclosing it with slashes (/
). There are other ways to define regular expressions too, but this is the most common one.
If we run the code above, it returns an instance of the MatchData
class. Whereas when we look for sentence
, we get nil:
matches = text.match(/sentence/)p matches
We could just use the include?
method for strings, which lets us determine the same thing. Let’s spice this up a little.
Anchors (boundaries)
The most commonly used anchors are the beginning or end of the string, the beginning or end of a line, and the beginning or end of a word.
For example:
text = "A regular expression is a sequence of characters that define a search pattern."puts 'Found "A" at the beginning of the string.' if text.match(/^A/)puts 'Found "O" at the beginning of the string.' if text.match(/^O/)
The ^
sign followed by a character indicates that the line must begin with that character.
text = "A regular expression is a sequence of characters that define a search pattern."puts 'Found the string "character".' if text.match(/character/)puts 'Found the word "character".' if text.match(/character\b/)
It finds the string “character”
but not the word “character”
. This is because the regular expression /character\b/
requires a word boundary to be found after the string literal, (the literal piece of text) “character”
. Because in our example, the text “character”
is followed by another “s”
, the regular expression won’t match.
Character classes
Let’s say we want to find all words that start with a vowel. For that, we can use a character class, or a set of allowed characters. A character class is specified by enclosing the allowed characters in square brackets, like [aeiou]
. Again, we use the anchor word boundary \b
before this character class to express that the vowel needs to be at the beginning of the word.
Instead of the match
method, which returns an object (something truthy) when the pattern matches and nil
when it doesn’t, we use the scan
method. This returns an array with all occurrences of text that match the pattern.
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[aeiou][a-z]*\b/)
Our regular expression defines that we’re looking for a piece of text that:
-
Starts with a word boundary.
-
Is followed by a character that’s either “a,” “e,” “i,” “o,” or “u.”
-
Is potentially (
*
) followed by any number of characters between “a” and “z.” -
Ends with a word boundary.
We’ll explain the star (*
) quantifier soon.
Notice that our piece of code doesn’t match the word A
at the beginning of the string! This is because regular expressions are case sensitive.
To fix that, we need to allow uppercase letters as well:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[AEIOUaeiou][a-z]*\b/)
Our output includes the capital A
at the beginning of the string as well.
This example also highlights the difference between a word boundary and whitespace. A single space counts as whitespace, and we could use it to match our words too. However, this would not be a good substitute for \b
. For one thing, this wouldn’t match the words at the beginning and end of a string. It also wouldn’t match a word when the word is followed by punctuation, such as a comma or a full stop. The word boundary \b
allows all of these too.
So, what about the star (*
) in the expression above?
Quantifiers
This * symbol is what we call a quantifier. It allows whatever appears before it to appear the indicated number of times.
In the case of our example above, it means that we’re looking for a single vowel, followed by either nothing or one or many characters between “a”
and “z”
.
This is why we match both the words “A”
and “a”
, as well as the words “is”
, “of”
, and “expression”
, which are followed by one or many characters.
In addition, we specify that we want all such matches to not be preceded or followed by anything before or after the word boundary. This is why “equence” (contained in the word “sequence”
) isn’t matched.
Let’s say we want to change the example above to omit single character words, but we do want to allow all words longer than one character. For that, we could change the none, one, or many quantifier, *
, to another quantifier, +
, meaning at least one, or many:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[AEIOUaeiou][a-z]+\b/)
Notice that this won’t match the words “A”
and “a”
.
What if we’re looking for words that start with a vowel and are no more than two characters long? We could use the quantifier ?
, which means none, or exactly one:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[AEIOUaeiou][a-z]?\b/)
If we remove the quantifier entirely, then the regular expression looks for a word that starts with a vowel, followed by exactly one character:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[AEIOUaeiou][a-z]\b/)
Captures
Using the scan
method with regular expressions like this is quite useful in many situations. Sometimes, though, we need something more powerful.
Imagine we need to find all words that are followed by a word that starts with a vowel.
Let’s try using scan
for that:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b[A-Za-z]+\b +\b[AEIOUaeiou][a-z]*\b/)
The second part of this regular expression is just the same as above. Any word that starts with a vowel and is one or many characters long.
It matches something that starts at a word boundary, then has one or many characters between “A”
and “Z”
or “a”
and “z”
(that bit is new; we can combine ranges as character classes) that’s followed by at least one space.
If we run this, we’ll get the following output:
["regular expression", "is a", "sequence of", "define a"]
Note that our strings contain two words. What if we were only interested in the first of these two words?
We might have to work on these strings more (use the split
method to split off the second word), but there’s a smarter way of doing the same: captures.
In regular expressions, we can mark certain parts of a pattern, requesting the parts that match. To mark a part of the pattern to be captured, we enclose it with parentheses, like so:
/\b([A-Za-z]+)\b +\b[AEIOUaeiou][a-z]*\b/
Note how we’ve enclosed the first part of the pattern with parentheses. This matches the full pattern, but only captures the parts that we’ve marked as interesting.
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b([A-Za-z]+)\b +\b[AEIOUaeiou][a-z]*\b/)
This returns a nested array like this:
[["regular"], ["is"], ["sequence"], ["define"]]
Awesome! We get all the words we were interested in.
Why is this a nested array? The scan
method looks for each bit of text that matches the given pattern (regular expression). It then extracts all the marked (captured) parts from it and keeps these as an array. Because there can be many occurrences that match the pattern, and each of them can have many captures, we get back a nested array.
Let’s capture the second word starting with a vowel as well to demonstrate this:
text = "A regular expression is a sequence of characters that define a search pattern."p text.scan(/\b([A-Za-z]+)\b +\b([AEIOUaeiou][a-z]*)\b/)
This returns:
[["regular", "expression"], ["is", "a"], ["sequence", "of"], ["define", "a"]]
Interesting, right? Let’s explore character classes a bit more in the next lesson.