...

/

Combining spaCy Models and Matchers

Combining spaCy Models and Matchers

Let's go over some recipes that will guide us through entity extraction.

In this lesson, we'll go through some recipes that will guide us through the entity extraction types you'll encounter in your NLP career. All the examples are ready-to-use and real-world recipes. Let's start with number-formatted entities.

Extraction IBAN and account numbers

IBAN and account numbers are two important entity types that occur in finance and banking frequently. We'll learn how to parse them out.

An IBAN is an international number format for bank account numbers. It has the format of a two-digit country code followed by numbers. Here are some IBANs from different countries:

Country

IBAN formatting example

Beligium

BE71 0961 2345 6769

Brazil

BR15 0000 0000 0000 1093 2840 814 P3

France

FR76 3000 6000 0112 3456 7890 189

Germany

DE91 1000 0000 0123 4567 89

Greece

GR96 0810 0010 0000 0123 4567 890

Mauritius

MU43 BOMM 0101 1234 5678 9101 000 MUR

Pakistan

PK70 BANK 0000 1234 5678 9000

Poland

PL10 1050 0099 7603 1234 5678 9123

Romania

RO09 BCYP 0000 0012 3456 7890

Saint Lucia

LC14 BOSL 1234 5678 9012 3456 7890 1234

Saudi Arabia

SA44 2000 0001 2345 6789 1234

Spain

ES79 2100 0813 6101 2345 6789

Switzerland

CH56 0483 5012 3456 7800 9

United Kingdom

GB98 MIDL 0700 9312 3456 78

How can we create a pattern for an IBAN? Obviously, in all cases, we start with two capital letters, followed by two digits. Then any number of digits can follow. We can express the country code and the next two digits as follows:

{"SHAPE": "XXdd"}
Expressing the country code

Here, XX corresponds to two capital letters and dd is two digits. Then XXdd pattern matches the first block of the IBAN perfectly. How about the rest of the digit blocks? For the rest of the blocks, we need to match a block of 1–4 digits. The regex \d{1,4} means a token consisting of 1–4 digits. This pattern will match a digit block:

{"TEXT": {"REGEX": "\d{1,4}"}}
Pattern to match the digit block

We have a number of these blocks, so the pattern to match the digit blocks of an IBAN is as follows:

{"TEXT": {"REGEX": "\d{1,4}"}, "OP": "+"}
Pattern to match multiple digit blocks of an IBAN

Then, we combine the first block with the rest of the blocks. Let's see the code and the matches:

Press + to interact
doc = nlp("My IBAN number is BE71 0961 2345 6769, please send the money there.")
doc1 = nlp("My IBAN number is FR76 3000 6000 0112 3456 7890 189, please send the money there.")
pattern = [{"SHAPE": "XXdd"}, {"TEXT": {"REGEX": "\d{1,4}"}, "OP":"+"}]
matcher = Matcher(nlp.vocab)
matcher.add("ibanNum", [pattern])
for mid, start, end in matcher(doc):
print(start, end, doc[start:end])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])

We can always follow a similar strategy when parsing numeric entities: first, divide the entity into some meaningful parts/blocks, then try to determine the shape or the length of the individual blocks.

We successfully parsed IBANs; ...