Syntax Support and Regex
Let's learn about regex.
We'll cover the following...
Extended syntax support
Matcher allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in
, not in
, and comparison operators. Here's the list of the operators:
Attribute | Value Type | Description |
IN | any | Attribute value is member of a list |
NOT_IN | any | Attribute value is not a member of a list |
==, >=, <=, >, < | int, float | Attribute value is equal, greater, or equal, greater or smaller |
Previously, we matched good evening
and good morning
with two different patterns. Now, we can match good morning/evening
with one pattern with the help of IN
as follows:
import spacyfrom spacy.matcher import Matchernlp = spacy.load("en_core_web_md")matcher = Matcher(nlp.vocab)doc = nlp("Good morning, I'm here. I'll say good evening!!")pattern = [{"LOWER": "good"},{"LOWER": {"IN": ["morning", "evening"]}},{"IS_PUNCT": True}]matcher.add("greetings", [pattern])matches = matcher(doc)for mid, start, end in matches:print(start, end, doc[start:end])
Comparison operators usually go together with the LENGTH
attribute. Here's an example of finding long tokens:
import spacyfrom spacy.matcher import Matchernlp = spacy.load("en_core_web_md")matcher = Matcher(nlp.vocab)doc = nlp("I suffered from Trichotillomania when I was in college. The doctor prescribed me Psychosomatic medicine.")pattern = [{"LENGTH": {">=" : 10}}]matcher.add("longWords", [pattern])matches = matcher(doc)for mid, start, end in matches:print(start, end, doc[start:end])
They were fun words to process! Now, we'll move on to another very practical feature of Matcher patterns.
Regex-like operators
At the beginning of the chapter, we pointed out that spaCy's Matcher
class offers a cleaner and more readable equivalent to regex operations, indeed much cleaner and much more readable. The most common regex operations are optional match (?
), match at least once (+
), and match zero or more times (*
). spaCy's Matcher also offers these operators by using the following syntax:
OP | Description |
! | Negate the patttern, by requiring it to match exactly 0. times |
? | Make the pattern optional, by allowing it to match 0 or 1 times |
+ | Require the pattern to match 1 or more times |
* | Allow the pattern to match zero or more times |