Named Entity Recognition with RNNs: Preparing Data

Learn how to use RNNs to identify various entities mentioned in a text corpus.

Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.

CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.

Understanding the data

We have defined a function called download_data(), which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:

  • data\conllpp_train.txt: A training set that contains 14041 sentences.

  • data\conllpp_dev.txt: A validation set that contains 3,250 sentences.

  • data\conllpp_test.txt: A test set that contains 3,452 sentences.

Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:

Get hands-on with 1400+ tech skills courses.