Special Characters: Character Classes
Let's discuss one of the main components of a regular expression, i.e., character classes.
We'll cover the following
As the name indicates, special characters are characters with extra meaning. You probably noticed a lot of characters in the previous lesson’s example that, although not strange, don’t really make much sense in the way they’re arranged. For example, there are a few +
and even some ?
that modify the behavior of the rest of the expression.
So, in order to start understanding how to read and write regular expressions, the first thing you need to understand is how to interpret these characters.
Interpretation of character classes
Character classes are everything you put inside brackets. Essentially, you’re letting the parser know which characters you want to match.
For example, /[abc]/
would match the first ‘a’ inside the string: bbbabcdebbb. Why just the first one? Because that’s how the default behavior of the parser works. If you want to change that, you’ll need to use flags (more on this in a moment).
Examples of character classes
Adding characters into your character class is perfectly fine. However, if you’re trying to match things like all numbers and all characters, writing something like /[0123456789]/
(not to mention if you add characters as well) might be a bit cumbersome. Instead, there are some abbreviations you can use:
Character Class | Description | Example |
---|---|---|
Lists of valid characters | Using a list of valid characters, as we mentioned above, is perfectly fine, and you should be using it if your char class is small enough. |
/[abc01234_]/ |
Range of characters | This is great for abbreviating your expression. It makes no sense to write all alphanumeric characters if you want to match them all. Instead, use a range. | /[a-z]/ or /[0-5]/ |
All characters (except newlines and line endings) | What if you wanted to match every possible character in the ASCII table except the newline character and the line ending character? This is done using the dot character. That’s right, only a single dot. | /[.]/ |
Any word character | This refers to any uppercase or lowercase character that can be used to create a word. So, anything from the lowercase a all the way up to the uppercase Z. | /[\w]/ |
Any non-word character | What about the opposite of the previous character class? Something like ¬, @, {, or anything similar. | /[\W]/ |
Any digit character | Matching any number can be done using this shortcut as well. | /[\d]/ |
Any non-digit character | The opposite of the previous character class is also available. This would match anything but numbers. | /[\D]/ |
Whitespace characters | What if you wanted to match anything but all of the above? We’ve covered all characters, symbols, and digits. What about whitespaces (meaning blank space, tabs, end of lines, and so on)? | /[\s]/ |
Non-whitespace characters | What if you want to match anything but whitespace? This comes in handy if you’re trying to mix all of the previous classes into a single expression. | /[\S]/ |
Matching something at the start or end of a word | This one is a bit more complicated. What if you wanted to match the LO part of LOOP but not from HELLO? Or the other way around? | /\bLO/ : This would match LOOK but not HELLO./LO\b/ : This would match HELLO but not LOOK. |
Matching something NOT at the start or end of a word | Clearly, this is the opposite from the previous character class, but let’s analyze it a bit further. Matching anything that is NOT at the beginning of a word doesn’t mean anything at the end of it. It means anything at the end or inside of it as long as it’s not in the first characters. | /\BLO/ : This would match LOLOP and LOLO but would not match the first “LO” in those words./LO\B/ : This would match LOLOP and LOLO (notice how the first one has two matches since neither of them is at the end of the word). |
Match end-of-line characters | Specifically matching the end-of-line character (otherwise known as \n). | /\n/ |
Match the form feed character | In case you didn’t know, the form feed character is the “new page” character. So, instead of simply signaling a new line, it signals a whole new page. | /\f/ |
Match the carriage return character | This one can usually be found next to the newline character and signals the return of the cursor to the beginning of the line. | /\r/ |
Match the tab character | Another individual whitespace character. | /\t/ |
We have left out some obscure classes, as they are not essential at this point in the course. You can see that there is definite overlap in some of the cases listed above, but most of these are meant to help you simplify the structure of your regular expression. This is why in some cases you have one class followed by its complete opposite.
Now that we’ve covered this, you can start making some sense out of the regular expressions you see online. We’re still missing other strange symbols, but it’ll all make sense in a bit!
Get hands-on with 1200+ tech skills courses.