...

/

Named Entity Recognition with RNNs: Preparing Data

Named Entity Recognition with RNNs: Preparing Data

Learn how to use RNNs to identify various entities mentioned in a text corpus.

We'll cover the following...

Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.

CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.

Understanding the data

We have defined a function called download_data(), which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:

  • data\conllpp_train.txt: A training set that contains 14041 sentences.

  • data\conllpp_dev.txt: A validation set that contains 3,250 sentences.

  • data\conllpp_test.txt: A test set that contains 3,452 sentences.

Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:

-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
..O O
The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
...
to TO B-PP O
sheep NN B-NP O
..O O
Conversion of data into specific format

As we can see, the document has a single word in each line, along with the associated tags of that word. These tags are in the following order:

  1. The ...