Natural Language Processing with TensorFlow/

...

Preparing Data for the NMT System

Learn to prepare data for the NMT system.

We'll cover the following...

The dataset
- Adding special tokens
Splitting training, validation, and testing datasets
Try it yourself

In this lesson, we’ll learn about the data and the process for preparing data for training and predicting from the NMT system. First, we’ll talk about how to prepare training data (that is, the source sentence and target sentence pairs) to train the NMT system, followed by inputting a given source sentence to produce the translation of the source sentence.

The dataset

The dataset we’ll be using is the WMT-14 English-German translation data. There are about 4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility. The vocabulary consists of the 50,000 most common English words and the 50,000 most common German words, and words not found in the vocabulary will be replaced with a special token, <unk>. We’ll need to download the following files:

train.de: File containing German sentences
train.en: File containing English sentences
vocab.50K.de: File containing German vocabulary
vocab.50K.en: File containing English vocabulary

train.de and train.en contain parallel sentences in German and English, respectively. Once we download these, we’ll load the sentences as follows:

Press + to interact

n_sentences = 250000
# Loading English sentences
original_en_sentences = []
with open(os.path.join('data', 'train.en'), 'r', encoding='utf-8') as en_file:
    for i,row in enumerate(en_file):
        if i >= n_sentences: break
        original_en_sentences.append(row.strip().split(" "))
# Loading German sentences
original_de_sentences = []
with open(os.path.join('data', 'train.de'), 'r', encoding='utf-8') as de_file:
    for i, row in enumerate(de_file):
        if i >= n_sentences: break
        original_de_sentences.append(row.strip().split(" "))

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

Preparing Data for the NMT System

The dataset

Adding special tokens