Combining spaCy Models and Matchers
Let's go over some recipes that will guide us through entity extraction.
In this lesson, we'll go through some recipes that will guide us through the entity extraction types you'll encounter in your NLP career. All the examples are ready-to-use and real-world recipes. Let's start with number-formatted entities.
Extraction IBAN and account numbers
IBAN and account numbers are two important entity types that occur in finance and banking frequently. We'll learn how to parse them out.
An IBAN is an international number format for bank account numbers. It has the format of a two-digit country code followed by numbers. Here are some IBANs from different countries:
Country | IBAN formatting example |
Beligium | BE71 0961 2345 6769 |
Brazil | BR15 0000 0000 0000 1093 2840 814 P3 |
France | FR76 3000 6000 0112 3456 7890 189 |
Germany | DE91 1000 0000 0123 4567 89 |
Greece | GR96 0810 0010 0000 0123 4567 890 |
Mauritius | MU43 BOMM 0101 1234 5678 9101 000 MUR |
Pakistan | PK70 BANK 0000 1234 5678 9000 |
Poland | PL10 1050 0099 7603 1234 5678 9123 |
Romania | RO09 BCYP 0000 0012 3456 7890 |
Saint Lucia | LC14 BOSL 1234 5678 9012 3456 7890 1234 |
Saudi Arabia | SA44 2000 0001 2345 6789 1234 |
Spain | ES79 2100 0813 6101 2345 6789 |
Switzerland | CH56 0483 5012 3456 7800 9 |
United Kingdom | GB98 MIDL 0700 9312 3456 78 |
How can we create a pattern for an IBAN? Obviously, in all cases, we start with two capital letters, followed by two digits. Then any number of digits can follow. We can express the country code and the next two digits as follows:
{"SHAPE": "XXdd"}
Here, XX
corresponds to two capital letters and dd
is two digits. Then XXdd
pattern matches the first block of the IBAN perfectly. How about the rest of the digit blocks? For the rest of the blocks, we need to match a block of 1–4 digits. The regex \d{1,4}
means a token consisting of 1–4 digits. This pattern will match a digit block:
{"TEXT": {"REGEX": "\d{1,4}"}}
We have a number of these blocks, so the pattern to match the digit blocks of an IBAN is as follows:
{"TEXT": {"REGEX": "\d{1,4}"}, "OP": "+"}
Then, we combine the first block with the rest of the blocks. Let's see the code and the matches:
doc = nlp("My IBAN number is BE71 0961 2345 6769, please send the money there.")doc1 = nlp("My IBAN number is FR76 3000 6000 0112 3456 7890 189, please send the money there.")pattern = [{"SHAPE": "XXdd"}, {"TEXT": {"REGEX": "\d{1,4}"}, "OP":"+"}]matcher = Matcher(nlp.vocab)matcher.add("ibanNum", [pattern])for mid, start, end in matcher(doc):print(start, end, doc[start:end])for mid, start, end in matcher(doc1):print(start, end, doc1[start:end])
We can always follow a similar strategy when parsing numeric entities: first, divide the entity into some meaningful parts/blocks, then try to determine the shape or the length of the individual blocks.
We successfully parsed IBANs; ...