...

/

Regular Expressions

Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.

That’s a pretty famous joke, and it refers to the fact that regular expressions can be difficult to solve.

However, once we know some basics about them, they’re also extremely powerful, and we can do amazing things with them, not only in Ruby but also, for example, in our editor and command-line tools.

Regular expressions are sort of a swiss army knife for finding things in strings (text), extracting parts of them, or mass replacing certain bits with something else.

For example, we could do any of these tasks:

Extract area codes from phone numbers.
Validate the format of an email address.
For a list of files ,a-01.mpeg, b-02.mpeg, and c-03.mpeg, change their names to 01-a.mpeg, 02-b.mpeg, and 03-c.mpeg.

Remember: Regular expressions are a language to describe patterns of text. Wikipedia calls them “a sequence of characters that define a search pattern.”

For example, the pattern [0-9]+! means that there needs to be at least one digit, and it needs to be followed by an exclamation mark. Does the pattern ([\w]+)-([\d]+)\.mpeg look intimidating and cryptic? It does, and that’s why regular expressions have a kind of strange reputation in programming. They’re super powerful, but they’re also kind of a pain.

A little bit of history

The main reason regular expressions are so hard to read is that they date as far back as 1956, and their first implementations in programming came in the late 1960s. Back then, every single character of code was worth a lot. Memory was extremely limited, and code had to be as terse as possible.

It finds the string “character” but not the word “character”. This is because the regular expression /character\b/ requires a word boundary to be found after the string literal, (the literal piece of text) “character”. Because in our example, the text “character” is followed by another “s”, the regular expression won’t match.

Character classes

Let’s say we want to find all words that start with a vowel. For that, we can use a character class, or a set of allowed characters. A character class is specified by enclosing the allowed characters in square brackets, like [aeiou]. Again, we use the anchor word boundary \b before this character class to express that the vowel needs to be at the beginning of the word.

Instead of the match method, which returns an object (something truthy) when the pattern matches and nil when it doesn’t, we use the scan method. This returns an array with all occurrences of text that match the pattern.

Our regular expression defines that we’re looking for a piece of text that:

Starts with a word boundary.
Is followed by a character that’s either “a,” “e,” “i,” “o,” or “u.”
Is potentially (*) followed by any number of characters between “a” and “z.”
Ends with a word boundary.

We’ll explain the star (*) quantifier soon.

Notice that our piece of code doesn’t match the word A at the beginning of the string! This is because regular expressions are case sensitive.

To fix that, we need to allow uppercase letters as well:

Our output includes the capital A at the beginning of the string as well.

This example also highlights the difference between a word boundary and whitespace. A single space counts as whitespace, and we could use it to match our words too. However, this would not be a good substitute for \b. For one thing, this wouldn’t match the words at the beginning and end of a string. It also wouldn’t match a word when the word is followed by punctuation, such as a comma or a full stop. The word boundary \b allows all of these too.

So, what about the star (*) in the expression above?

Quantifiers

This * symbol is what we call a quantifier. It allows whatever appears before it to appear the indicated number of times.

In the case of our example above, it means that we’re looking for a single vowel, followed by either nothing or one or many characters between “a” and “z”.

This is why we match both the words “A” and “a”, as well as the words “is”, “of”, and “expression”, which are followed by one or many characters.

In addition, we specify that we want all such matches to not be preceded or followed by anything before or after the word boundary. This is why “equence” (contained in the word “sequence”) isn’t matched.

Let’s say we want to change the example above to omit single character words, but we do want to allow all words longer than one character. For that, we could change the none, one, or many quantifier, *, to another quantifier, +, meaning at least one, or many:

The Big Picture

Variables

Built-in Class: Numeric

Built-in Class: String

Built-in Classes: TrueClass, FalseClass, and NilClass

Built-in Class: Symbol

Built-in Class: Array

Built-in Class: Hash

Objects

Methods

Operators are Methods

Blocks

Get Ready for Some More Drill!

Conditionals

Coding Challenge: Truthiness and Equivalence

Lets Run Another Lap with Hashes

Writing Classes

The Mailbox Project

Spotlight on Things Unremarked

Advanced Topics

Your Toolkit

Appendix: Mailbox Project Prerequisite

Regular Expressions

A little bit of history

Common features

String literals

Anchors (boundaries)

Character classes

Quantifiers

Captures