Backtracking

Learn about backtracking.

Overview of backtracking

To match a RegEx, a text must match the whole regular expression, from the 0th0^{th} character to the nthn^{th} character. If the initial parts of the RegEx succeed in matching the text but cause the latter parts of the pattern to fail, the RegEx engine backs up and recalculates the beginning section. This is why it is termed backtracking.

When RegEx takes all the characters due to its greedy nature, it must release them to match the other parts of the expression. Due to the lazy nature of RegEx, the previously matched characters by the expression’s preceding portions must be recalculated, i.e., backtracked. So, why is backtracking effective. For example, you are provided the text "123456789 " and your RegEx is /^(\d+)*$/. This RegEx will look for any length of numbers in the text from the beginning to the end. But in this case, the last character is a non-word character. While it might be obvious to a coder that there is no match, RegEx must undergo an observable computation to determine that there is no match in the given text.

You can re-write the RegEx pattern as /^$/ or /^(\d+)$/ or /^(\d+)(\d+)$/ or /^(\d+)(\d+)(\d+)...$/ Notice how the greedy nature is applied to the digit character set.

Let us see how this will work. When the RegEx engine parses the RegEx, it will come across the ^, followed by (\d+). The text should start with a digit only. Because the given text begins with a digit, the ^ will generate a green signal. When the greedy component starts working, it will swallow all of the characters until the second to last character. The $ is the following RegEx element, which means the ending should also be a number, but it is a space character in the text. Here, the RegEx will backtrack and the first (\d+) will release its characters. But the moment the first (\d+) releases its character, it will be consumed by the second (\d+). The second expression will match the second to last character. When the \d does not match the space character, it will hand over to the $. Again the same process will occur, the end of the text should be a digit and not a non-word character. Again (\d+) will release its character, and it will be consumed by the second (\d+).

Next the second (\d+) will release its characters which will be consumed by the third group. This will continue until each group has a single character and no more backtracking can occur. The RegEx will not stop matching until all the possible combinations are tried out.

Combinations due to backtracking

Let’s look at the combination for the text 1234 and the same RegEx as mentioned above.

  • (\d+)[1] (1234)
  • (\d+)[1] (123) (\d+)[2] (4)
  • (\d+)[1] (12) (\d+)[2] (34)
  • (\d+)[1] (12) (\d+)[2] (3) (\d+)[3] (4)
  • (\d+)[1] (1) (\d+)[2] (2) (\d+)[3] (34)
  • (\d+)[1] (1) (\d+)[2] (2) (\d+)[3] (3) (\d+)[4] (4) ...

(\d+)[1] represents the characters held by the first group. (\d+)[2] represents the character held by the second group.

Similarly, before reporting the null outcome, all potential combinations will be tested. In total, the possible ways to break the text into parts is (2n2^{n} - 1) ways. Imagine your text contains 100 digits. This process would come in handy as checking manually would take an enormous amount of time and effort. Word characters will undergo the same procedure of backtracking, creating combination, and matching texts.

To prevent this kind of behavior, make the RegEx specific, avoid nested hierarchy, and a highly complex RegEx because this causes multiplication between the different parts of the expression. If you write a very generalized RegEx, many combinations will be formed that need to be backtracked.

Get hands-on with 1400+ tech skills courses.