...

/

Preparing Data for the NMT System

Preparing Data for the NMT System

Learn to prepare data for the NMT system.

In this lesson, we’ll learn about the data and the process for preparing data for training and predicting from the NMT system. First, we’ll talk about how to prepare training data (that is, the source sentence and target sentence pairs) to train the NMT system, followed by inputting a given source sentence to produce the translation of the source sentence.

The dataset

The dataset we’ll be using is the WMT-14 English-German translation data. There are about 4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility. The vocabulary consists of the 50,000 most common English words and the 50,000 most common German words, and words not found in the vocabulary will be replaced with a special token, <unk>. We’ll need to download the following files:

train.de and train.en contain parallel sentences in German and English, respectively. Once we download these, we’ll load the sentences as follows:

Press + to interact
n_sentences = 250000
# Loading English sentences
original_en_sentences = []
with open(os.path.join('data', 'train.en'), 'r', encoding='utf-8') as en_file:
for i,row in enumerate(en_file):
if i >= n_sentences: break
original_en_sentences.append(row.strip().split(" "))
# Loading German sentences
original_de_sentences = []
with open(os.path.join('data', 'train.de'), 'r', encoding='utf-8') as de_file:
for i, row in enumerate(de_file):
if i >= n_sentences: break
original_de_sentences.append(row.strip().split(" "))

If we print the data we just loaded for the two languages, we would have sentences like the following:

English: a fire restant repair cement for fire places , ovens , open fireplaces etc .
German: feuerfester Reparaturkitt für Feuerungsanlagen , Öfen , offene Feuerstellen etc.
English: Construction and repair of highways and ...
German: Der Bau und die Reparatur der Autostraßen ...
English: An announcement must be commercial character .
German: die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen .

Adding special tokens

The next step is to add a few special tokens to the start and end of our sentences. We’ll add <s> to mark the start of a sentence and </s> to mark the end of a sentence. We can easily achieve this using the following list comprehension:

en_sentences = [["<s>"]+sent+["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"]+sent+["</s>"] for sent in original_de_sentences]

This will give us:

English: <s> a fire restant repair cement for fire places , ovens , open fireplaces etc . </s>
German: <s> feuerfester Reparaturkitt für Feuerungsanlagen , Öfen , offene Feuerstellen etc. </s>
English: <s> Construction and repair of highways and ... </s>
German: <s> Der Bau und die Reparatur der Autostraßen ... </s>
English: <s> An announcement must be commercial character . </s>
German: <s> die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen . </s>

This is a very important step for Seq2Seq models. <s> and </s> tokens serve an extremely important role during model inference. As we’ll see at inference time, we’ll be using the decoder to predict one word at a time by using the output of the previous time step as an input. This way, we can predict for an arbitrary number of time steps. ...