...

/

Syntax Support and Regex

Syntax Support and Regex

Let's learn about regex.

Extended syntax support

Matcher allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in, and comparison operators. Here's the list of the operators:

Attribute

Value Type

Description

IN

any

Attribute value is member of a list

NOT_IN

any

Attribute value is not a member of a list

==, >=, <=, >, <

int, float

Attribute value is equal, greater, or equal, greater or smaller

Previously, we matched good evening and good morning with two different patterns. Now, we can match good morning/evening with one pattern with the help of IN as follows:

Press + to interact
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
doc = nlp("Good morning, I'm here. I'll say good evening!!")
pattern = [{"LOWER": "good"},
{"LOWER": {"IN": ["morning", "evening"]}},{"IS_PUNCT": True}]
matcher.add("greetings", [pattern])
matches = matcher(doc)
for mid, start, end in matches:
print(start, end, doc[start:end])

Comparison operators usually go together with the LENGTH attribute. Here's an example of finding long tokens:

Press + to interact
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
doc = nlp("I suffered from Trichotillomania when I was in college. The doctor prescribed me Psychosomatic medicine.")
pattern = [{"LENGTH": {">=" : 10}}]
matcher.add("longWords", [pattern])
matches = matcher(doc)
for mid, start, end in matches:
print(start, end, doc[start:end])

They were fun words to process! Now, we'll move on to another very practical feature of Matcher patterns.

Regex-like operators

At the beginning of the chapter, we pointed out that spaCy's Matcher class offers a cleaner and more readable equivalent to regex operations, indeed much cleaner and much more readable. The most common regex operations are optional match (?), match at least once (+), and match zero or more times (*). spaCy's Matcher also offers these operators by using the following syntax:

OP

Description

!

Negate the patttern, by requiring it to match exactly 0. times

?

Make the pattern optional, by allowing it to match 0 or 1 times

+

Require the pattern to match 1 or more times

*

Allow the pattern to match zero or more times

...