Training a Pipeline Component From Scratch
Let's see how we create a brand-new NER component.
We'll cover the following...
Previously, we saw how to update the existing NER component according to our data. In this lesson, we will create a brand-new NER component for the medicine domain.
Let's start with a small dataset to understand the training procedure. Then we'll be experimenting with a real medical NLP dataset. The following sentences belong to the medicine domain and include medical entities such as drug and disease names:
Methylphenidate/DRUG is effectively used in treating childrenwith epilepsy/DISEASE and ADHD/DISEASE.Patients were followed up for 6 months.Antichlamydial/DRUG antibiotics/DRUG may be useful for curingcoronary-artery/DISEASE disease/DISEASE.
The following code block shows how to train an NER component from scratch. As we mentioned before, it's better to create our own NER rather than updating spaCy's default NER model as medical entities are not recognized by spaCy's NER component at all. Let's see the code and also compare it to the code done previously. We'll go step by step:
In the first three lines, we made the necessary imports. We imported
spacy
andspacy.training.Example
. We also importedrandom
to shuffle our dataset:
import randomimport spacyfrom spacy.training import Example
We defined our training set of three examples. For each example, we included a sentence and its annotated entities:
train_set = [("Methylphenidate is effectively used intreating children with epilepsy and ADHD.", {"entities":[(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79,"DISEASE")]}),("Patients were followed up for 6months.", {"entities": []}),("Antichlamydial antibiotics may beuseful for curing coronary-artery disease.", {"entities":[(0, 26, "DRUG"), (52, 75, "DIS")]})]
We also listed the set of entities we want to recognize—
DIS
for disease names, andDRUG
for drug names:
entities = ["DIS", "DRUG"]
We created a blank model. This is different from what we did in the previous section. In the previous section, we used spaCy's pre-trained English language pipeline:
nlp = spacy.blank("en")