...

/

Entity Extraction

Entity Extraction

Let's see how we will extract the entities that our chatbot will use.

We'll now implement the first step of our chatbot NLU pipeline and extract entities from the dataset utterances. The following are the entities marked in our dataset:

Press + to interact
city
date
time
phone_number
cuisine
restaurant_name
street_address

To extract the entities, we'll use the spaCy NER model and the spaCy Matcher class. Let's get started by extracting the city entities.

Extracting city entities

We'll first extract the city entities. We'll get started by recalling some information about the spaCy NER model and entity labels:

  • First, we recall that the spaCy named entity label for cities and countries is GPE. Let's ask spaCy to explain what GPE label corresponds to once again:

Press + to interact
import spacy
nlp = spacy.load("en_core_web_md")
print(spacy.explain("GPE"))
  • Secondly, we also recall that we can access entities of a Doc object via the ents property. We can find all entities in an utterance that are labeled by the spaCy NER model as follows:

Press + to interact
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Can you please confirm that you want to book a table for 2 at 11:30 am at the Bird restaurant in Palo Alto for today")
print(doc.ents)
for ent in doc.ents:
print(ent.text, ent.label_)

In this code segment, we listed all named entities of this utterance by calling doc.ents. Then, we examined the entity labels by calling ent.label_. Examining the output, we see that this utterance contains five entities—one cardinal number entity (2), one TIME entity (11:30 am), one PRODUCT entity (Bird, which is not an ideal label for a restaurant), one CITY entity (Palo Alto), and one DATE entity (today). The GPE type entity is what we're looking for; Palo Alto is a city in the US and hence is labeled by the spaCy NER model as GPE.

The code below outputs all the utterances that include a city entity together with the city entities. From the output of this script, we can see that the spaCy NER model performs very well on this corpus ...