More on Character Classes (Regex)

This lesson discusses the character classes of regular expressions in detail.

Complement of a character class

So far, we’ve used character classes like [aeiou] (listing all allowed characters literally) and [a-z] (specifying a range of characters).

There’s more to these.

We can negate classes by prepending a not character (^) inside the square brackets. For example, [^AEIOUaeiou] allows every character that’s not a vowel. So, we can find all words that don’t start with a vowel:

Press + to interact
text = "A regular expression is a sequence of characters that define a search pattern."
p text.scan(/\b[^AEIOUaeiou ][^ ]*\b/)

This starts at a word boundary and allows everything that’s not a vowel or a space as a first character, when it’s optionally followed by one or many characters that aren’t a space, followed by a word boundary.

Predefined classes

Regular expressions also come with predefined classes. For example, \d means any digit. Here’s a list of common classes:

  • \d is the same as [0-9] (any digit).

  • \D is the same as [^0-9] (everything except digits).

  • \w is the same as [A-Za-z_\-], called word character (this allows all lowercase and uppercase Latin letters, as well as underscores and dashes).

  • \W is the same as [^A-Za-z_\-] (everything that’s not a word character).

  • \s means any whitespace, including spaces, tabs, and line breaks.

  • \S means everything that’s not whitespace.

That means we could refine our expression from above:

/\b[A-Za-z]+\b +\b[AEIOUaeiou][a-z]*\b/

To this:

Press + to interact
text = "A regular expression is a sequence of characters that define a search pattern."
p text.scan(/\b\w+\b +\b[AEIOUaeiou]\w*\b/)

This might yield slightly different results if we have words that contained dashes or underscores, but it’s the same in our case:

Press + to interact
["regular expression", "is a", "sequence of", "define a"]

We can also combine these predefined classes with each other and other literal characters. For example, [\w!?]+ finds a sequence of at least one or many characters, each of which is a word character, an exclamation, or a question mark.

Anything

Finally, there’s one special character that matches anything: the dot (.).

The regular expression .* matches any character, zero, or any number of times. This may be useful if we’re looking, for example, for whatever text is enclosed in parentheses:

Press + to interact
text = "Regular expressions are powerful (and sometimes confusing, even to experienced developers)."
p text.scan(/\(.*\)/)

Notice the backslashes before the opening and closing parentheses? We want to match these literal characters and not use them with their special meaning of capturing their content here. Therefore, we need to escape them to tell Ruby that we mean a parenthesis here.

If we run the code above, we must have the following output:

Press + to interact
["(and sometimes confusing, even to experienced developers)"]

Perhaps we also want to capture a part of this and omit the actual parentheses from the result. We can do that by placing an extra pair of unescaped (capturing) parentheses inside the escaped (literal) ones:

Press + to interact
text = "Regular expressions are powerful (and sometimes confusing, even to experienced developers)."
p text.scan(/\((.*)\)/)

And now we’ll get this result, which has the parentheses stripped off:

Press + to interact
[["and sometimes confusing, even to experienced developers"]]

It’s an admittedly confusing topic.

Try to remember some of the most basic, simple stuff. Then try using it, maybe in your text editor, when you search for a certain phrase. Over time, you’ll remember a few more things, bit by bit, and things will become a little less confusing. Writing a long, complicated regular expression that works without thinking is something that only a few developers can actually do.

If you can’t figure out a certain regular expression, or if you want to experiment with something, then Rubular is a great tool for that. Enter some text to the “Your test string” text area, and start writing a regular expression, one bit after another. The app displays the parts that match and your captures if you define some.