Search⌘ K

Named Entity Recognition with RNNs: Preparing Data

Discover how to prepare data for named entity recognition tasks using recurrent neural networks. This lesson guides you through reading and understanding the CoNLLPP dataset, handling class imbalance, analyzing sequence lengths, and processing labels with padding and masks to optimize training.

We'll cover the following...

Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.

CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.

Understanding the data

We have defined a function called download_data(), which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:

  • data\conllpp_train.txt: A training set that contains 14041 sentences.

  • data\conllpp_dev.txt: A validation set that contains 3,250 sentences.

  • data\conllpp_test.txt: A test set that contains 3,452 sentences.

Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:

-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
..O O
The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
...
to TO B-PP O
sheep NN B-NP O
..O O
Conversion of data into specific format

As we can see, the document has a single word in each line, along with the associated tags of that word. These tags are in the following order:

  1. The ...