...

/

Training a Pipeline Component From Scratch

Training a Pipeline Component From Scratch

Let's see how we create a brand-new NER component.

We'll cover the following...

Previously, we saw how to update the existing NER component according to our data. In this lesson, we will create a brand-new NER component for the medicine domain.

Let's start with a small dataset to understand the training procedure. Then we'll be experimenting with a real medical NLP dataset. The following sentences belong to the medicine domain and include medical entities such as drug and disease names:

Methylphenidate/DRUG is effectively used in treating children
with epilepsy/DISEASE and ADHD/DISEASE.
Patients were followed up for 6 months.
Antichlamydial/DRUG antibiotics/DRUG may be useful for curing
coronary-artery/DISEASE disease/DISEASE.
Dataset belonging to the medical domain

The following code block shows how to train an NER component from scratch. As we mentioned before, it's better to create our own NER rather than updating spaCy's default NER model as medical entities are not recognized by spaCy's NER component at all. Let's see the code and also compare it to the code done previously. We'll go step by step:

  1. In the first three lines, we made the necessary imports. We imported spacy and spacy.training.Example. We also imported random to shuffle our dataset:

Press + to interact
import random
import spacy
from spacy.training import Example
  1. We defined our training set of three examples. For each example, we included a sentence and its annotated entities:

train_set = [
("Methylphenidate is effectively used in
treating children with epilepsy and ADHD.", {"entities":
[(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79,
"DISEASE")]}),
("Patients were followed up for 6
months.", {"entities": []}),
("Antichlamydial antibiotics may be
useful for curing coronary-artery disease.", {"entities":
[(0, 26, "DRUG"), (52, 75, "DIS")]})
]
A sentence and its annotated entities
  1. We also listed the set of entities we want to recognize—DIS for disease names, and DRUG for drug names:

entities = ["DIS", "DRUG"]
Set of entities we want to recognize
  1. We created a blank model. This is different from what we did in the previous section. In the previous section, we used spaCy's pre-trained English language pipeline:

nlp = spacy.blank("en")
    ...