Named Entity Recognition with RNNs: Preparing Data

Discover how to prepare data for named entity recognition tasks using recurrent neural networks. This lesson guides you through reading and understanding the CoNLLPP dataset, handling class imbalance, analyzing sequence lengths, and processing labels with padding and masks to optimize training.

We'll cover the following...

Understanding the data
Processing data

Now, let’s look at our first task: using an RNN to identify named entities in a text corpus. This task is known as named entity recognition (NER). We’ll be using a modified version of the well-known Conference on Computational Natural Language Learning 2003 (CoNLL 2003) dataset for NER.

CoNLL 2003 is available for multiple languages, and the English data was generated from a Reuters corpus that contains news stories published between August 1996 and August 1997. The database we’ll be using is found on the website and is called CoNLLPP. It’s a more closely curated version than the original CoNLL, which contains errors in the dataset induced by incorrectly understanding the context of a word. For example, in the phrase “Chicago won ...” Chicago was identified as a location, whereas it’s actually an organization.

Understanding the data

We have defined a function called download_data(), which can be used to download the data. We won’t go into the details of it because it simply downloads several files and places them in a data folder. Once the download finishes, we’ll have three files:

data\conllpp_train.txt: A training set that contains 14041 sentences.
data\conllpp_dev.txt: A validation set that contains 3,250 sentences.
data\conllpp_test.txt: A test set that contains 3,452 sentences.

Next up, we’ll read the data and convert it into a specific format that suits our model. But before that, we need to see what our data looks like originally:

1.Introduction to Natural Language Processing

2.Understanding TensorFlow 2

3.Word2vec: Learning Word Embeddings

4. Advanced Word Vector Algorithms

5.Sentence Classification with Convolutional Neural Networks

6.Recurrent Neural Networks

7.Understanding Long Short-Term Memory Networks

8.Applications of LSTM: Generating Text

9.Sequence-to-Sequence Learning: Neural Machine Translation

10.Transformers

Project

11.Image Captioning with Transformers

12.Final Remarks

13.Appendix: Mathematical Foundations and Advanced TensorFlow

Mock Interview

Named Entity Recognition with RNNs: Preparing Data

Understanding the data