Traditional NER and Relationship Extraction Techniques

Traditional methods for entity and relationship extraction include:

  • Heuristic-based methods: These methods rely on predefined rules and patterns to identify entities and relationships within text.

  • Statistical and machine learning methods: These methods identify entities and relationships based on patterns learned from labeled data.

Named entity recognition (NER)

Named entity recognition involves identifying and categorizing entities such as names, locations, and dates from raw text.

Press + to interact
Extracting entities from plain text
Extracting entities from plain text

Heuristic-based methods

Two common heuristic-based methods for entity recognition are pattern matching and dictionary lookup.

Pattern matching uses regular expressions to identify entities based on predefined patterns. This method is useful when we already have a clear idea of the patterns that entities follow. For example, if we want to extract the date entities from the text, we can use this method, as we know dates often follow specific patterns like “MM/DD/YYYY” or “Month Day, Year.”

Press + to interact
import re
# Sample text
text = "John's birthday is on 08/15/1990 and his anniversary is on August 15, 1990."
# Define a regular expression pattern for dates in MM/DD/YYYY format
pattern_mmddyyyy = r'\b\d{2}/\d{2}/\d{4}\b'
# Define a regular expression pattern for dates in Month Day, Year format
pattern_monthdayyear = r'\b[A-Za-z]+ \d{2}, \d{4}\b'
# Find all matches for both patterns
dates_mmddyyyy = re.findall(pattern_mmddyyyy, text)
dates_monthdayyear = re.findall(pattern_monthdayyear, text)
print("Date in MM/DD/YYYY format:", dates_mmddyyyy)
print("Date in Month Day, Year format:", dates_monthdayyear)

Pattern matching is effective for identifying entities with known, consistent formats (like dates), it can become impractical when dealing with a wide variety of entity types across diverse text. Each type of entity, such as names, locations, or organizations, may not follow a single, predictable pattern, making it difficult to create comprehensive regular expressions that account for all possible variations. Consider defining a regular expression for person names. We need to think about the formats and patterns that the person’s names follow. We may ask ourselves the following question: Does a person’s name always start with a capital letter? If, in a text, a person’s name is written in small letters, would we not consider it a name? Should it be limited to a single word, or can it consist of multiple words? A person’s name may also include special characters or be accompanied by titles such as “Dr.” or “Eng.” How to distinguish a person’s name from addresses or location names? Both can start with capital letters and may share similar structures like lengths. Finding a universal pattern for such entities is very challenging making it difficult to achieve accurate entity extraction through the use of regular expressions.

Dictionary lookup involves using a predefined list of entities (a dictionary) to identify entities in the text. This method is useful when we have a specific list of terms we want to extract, such as names of cities or companies.

Press + to interact
# Sample text
text = "John has visited Paris, New York, and London this year."
# List of known city names
city_dictionary = ['Paris', 'New York', 'London', 'Berlin', 'Tokyo']
# Find and extract city names from the text
extracted_cities = [city for city in city_dictionary if city in text]
print("Extracted cities:", extracted_cities)

Dictionary lookup works well for identifying entities with a predefined list, it struggles with variations, misspellings, or newly introduced entities that have not yet been cataloged. Moreover, dictionary lookups cannot handle the context in which an entity appears, potentially leading to false positives or missed entities if the word is ambiguous or used in a different sense. For example, if we have a dictionary that includes “Apple” as a known entity referring to the technology company. In the sentence, “She ate an apple after lunch.”, the word “apple” refers to the fruit, not the company. However, a dictionary lookup might incorrectly identify “Apple” as the company, resulting in a false positive.

CRFs: An improvement over heuristic-based methods

Conditional random fields (CRFs) are probabilistic graphical models trained on labeled data to learn the conditional dependencies between words and their corresponding labels. They work by:

  • Sequence labeling: CRFs treat NER as a sequence labeling problem. Instead of looking at individual words, CRFs consider the entire sequence of words in a sentence to make label predictions for each word. For example, given the following sentence, “John lives in Colombia.” if “John” is labeled as a PERSON, the model considers that “John” is typically followed by a word that isn’t a named entity (like “lives”). This reinforces the idea that “lives” should be labeled as OTHER (O). This allows CRFs to understand the context in which a word appears, which is important for accurately identifying entities.

Press + to interact
canvasAnimation-image
1 / 2
  • Feature engineering: CRFs use features, for example, word capitalization, part-of-speech tags, and neighboring words to make predictions. These features are crafted by experts based on linguistic knowledge. For example, in the sentence “Barack Obama was born in Hawaii.”, the CRF might use the capitalization of “Barack” and “Obama”, the fact that “Barack” is followed by “Obama,” and the presence of the verb “was” to identify “Barack Obama” as a PERSON.

Press + to interact
canvasAnimation-image
1 / 2
  • Probabilistic model: CRFs model the probability of a sequence of labels given the input text. They consider not only the likelihood of each individual label but also how labels relate to each other. For example, in a sentence, “John lives in New York.” “New” might have a 70% chance of being part of a LOCATION, and “York” might have a 90% chance if “New” is a LOCATION.

Press + to interact
canvasAnimation-image
1 / 2

Below is an example of part-of-speech tags used as features to identify the entities:

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

# Function to extract entities and relationships from text
def extract_entities(text):
  tokens = nltk.word_tokenize(raw_text)
  pos_tags = nltk.pos_tag(tokens)
  named_entities = nltk.ne_chunk(pos_tags)
  return named_entities
  
# Sample raw text
raw_text = '''Sarah is an avid traveler who recently visited New 
York City. During her trip, she saw the Statue of Liberty, which was
designed by Frédéric Auguste Bartholdi and completed in 1886.
Sarah also visited the Empire State Building, which was completed
in 1931 and was designed by Shreve, Lamb & Harmon. 
Sarah took a memorable photo in front of the Brooklyn Bridge, 
which was designed by John A. Roebling and completed in 1883. 
She also visited Central Park, a large public park in New York City.'''
named_entities = extract_entities(raw_text)
print("\n******* Extracted entitities and their types ********\n")
for chunk in named_entities:
  if hasattr(chunk, 'label'):
    entity_name = ' '.join(c[0] for c in chunk)
    entity_type = chunk.label()
    print(f'({entity_name}, {entity_type})')
Entity extraction with conditional random fields (CRFs) using NLTK

Note: GPE in the output stands for geopolitical entity.

  • Lines 1–4: We import the nltk library, a natural language processing toolkit, for tokenizing, tagging, parsing, and other text processing functions.

    • The punkt-tab tokenizer is downloaded and used to split the text into words.

    • averaged_perceptron_tagger_eng assigns part-of-speech (POS) tags to each word in a text. For example, it can label words as nouns, verbs, adjectives, etc.

    • maxent_ne_chunker_tab is a maximum entropy named entity chunker. It breaks down text into “chunks” and recognizes named entities such as people, organizations, and locations (e.g., “New York City”). This model is essential for extracting named entities from text, like “Barack Obama” (PERSON) or “Google” (ORGANIZATION).

  • Line 7: The text is tokenized into words using word_tokenize.

  • Line 8: Each token is assigned a POS tag using pos_tag.

  • Line 9: The ne_chunk function is used to detect named entities based on the POS tags. The output is a tree of chunks, where each named entity is a subtree labeled with its entity type.

While CRFs are powerful than heuristic-based methods, they have some limitations, which include the following:

  • Manual feature engineering: We need to manually design features like capitalization or word position to help the CRF model understand the text.

  • Struggle with complex patterns: CRFs might miss out on complex language patterns if the features aren’t comprehensive.

Deep learning models: An improvement over CRFs

Deep learning models are further an improvement over CRF based models for NER. Unlike CRFs, deep learning models automatically learn features from raw text data without requiring manual feature engineering. With spaCy, a natural language processing library, we can load pretrained deep learning models to perform natural language processing tasks. Below, we load en_core_web_sm, a small pretrained English language model designed for natural language processing tasks. We use it to classify entities in the text, such as people, organizations, locations, dates, etc.

import spacy
nlp = spacy.load("en_core_web_sm")

# Function to extract entities and relationships from text
def extract_entities(text):
  doc = nlp(text)
  entities = [(ent.text, ent.label_) for ent in doc.ents]
  return entities

# Sample raw text
raw_text = '''Sarah is an avid traveler who recently visited New 
York City. During her trip, she saw the Statue of Liberty, which was
designed by Frédéric Auguste Bartholdi and completed in 1886.
Sarah also visited the Empire State Building, which was completed
in 1931 and was designed by Shreve, Lamb & Harmon. 
Sarah took a memorable photo in front of the Brooklyn Bridge, 
which was designed by John A. Roebling and completed in 1883. 
She also visited Central Park, a large public park in New York City.'''
entities= extract_entities(raw_text)

print("\n******* Extracted entitities and their types ********\n")
for ent_obj in entities:
  print(ent_obj)
Entity extraction with deep learning model using spaCy
  • Line 2: We load the en_core_web_sm model with spaCy. It returns a Language object that encapsulates the entire natural language processing pipeline.

  • Line 6: Using the nlp Language object, we convert the input text into a doc object. This object contains the parsed text with various annotations, such as tokenization, part-of-speech tags, and named entities.

  • Line 7: We iterate through the named entities identified in the doc object (doc.ents). For each entity, we extract the text of the entity (ent.text) and its corresponding label (ent.label_). We store the extracted entities and their labels as tuples in the entities list.

We can see the difference between the CRF-based method and the deep learning-based method through an example. If, in the text, there is a Geopolitical Entity, “new york city,” written in small letters, the CRF method wouldn’t identify it as an entity, while the deep learning method does. Even if we put effort into manually designing features for CRF-based entity extraction, we can only identify entities in the text that conform to the features we designed. In comparison, deep learning models learn to recognize entities from large datasets, enabling them to generalize better and identify entities regardless of their capitalization or structure. This allows deep learning models to achieve higher accuracy in entity recognition, especially in diverse and unstructured text.

There are two “New York City” occurrences in the text. To see the difference, please write it in small letters in both places.

Relationship extraction: Heuristic-based method

Relationship extraction involves identifying relationships between entities.

Press + to interact
Identifying relationships between entities
Identifying relationships between entities

Heuristic-based methods for relationship extraction utilize predefined rules and patterns to identify relationships between entities in text. These methods rely on grammatical structures and linguistic cues to understand how entities are connected. In the below example, we infer relationship tuples based on the grammatical structure of sentences and pattern matching. We extract the sentences from the text using spaCy. Each sentence is then analyzed for grammatical structures to find dependencies between words in a sentence to find relationship tuples.

import spacy
nlp = spacy.load("en_core_web_sm")

# Function to extract relationships from text
def extract_relationships(text):
  doc = nlp(text)
  relationships = []
  for sent in doc.sents:
    # Find subject and object in the sentence
    subjects = [token.text for token in sent if token.dep_ in ("nsubj", "nsubjpass")]
    objects = [token.text for token in sent if token.dep_ in ("dobj", "attr")]
    # Match relationships based on dependency parsing
    for subject in subjects:
      for obj in objects:
        if subject and obj:
          verb = [token.lemma_ for token in sent if token.dep_ == "ROOT"]
          if verb:
            relationship = (subject, verb[0], obj)
            relationships.append(relationship)
  return relationships

# Sample raw text
raw_text = '''Sarah is an avid traveler who recently visited New 
York City. During her trip, she saw the Statue of Liberty, which was
designed by Frédéric Auguste Bartholdi and completed in 1886.
Sarah also visited the Empire State Building, which was completed
in 1931 and was designed by Shreve, Lamb & Harmon. 
Sarah took a memorable photo in front of the Brooklyn Bridge, 
which was designed by John A. Roebling and completed in 1883. 
She also visited Central Park, a large public park in New York City.'''
relationships = extract_relationships(raw_text)
print("Relationships:")
for relationship in relationships:
    print(relationship)
Relationship extraction using dependency parsing and pattern matching
  • Lines 10–11: For each sentence in the text, we identify subjects (who or what is doing the action) and objects (who or what is receiving the action). This is done by checking specific grammatical roles (tags) that words can have.

  • Lines 13–19: We look for relationships by pairing each subject with each object and finding the main verb of the sentence (the action that connects the subject and object). We capture the relationship as a tuple (subject, verb, object) and append it to the relationships list.

As expected, the results are poor. Many important relationships are missing, and the ones that are identified are also inaccurate. In the next lesson, we'll learn how to leverage large language models to find entities and relationships better than all these methods.

Quiz: Traditional NER and Relationship Extraction Techniques

1

Which method uses regular expressions to identify entities in a text?

A)

Pattern matching

B)

Dictionary lookup

C)

Conditional random fields (CRFs)

D)

Deep learning models

Question 1 of 40 attempted