Preparing Data for the NMT System
Learn to prepare data for the NMT system.
In this lesson, we’ll learn about the data and the process for preparing data for training and predicting from the NMT system. First, we’ll talk about how to prepare training data (that is, the source sentence and target sentence pairs) to train the NMT system, followed by inputting a given source sentence to produce the translation of the source sentence.
The dataset
The dataset we’ll be using is the WMT-14 English-German translation data. There are about 4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility. The vocabulary consists of the 50,000 most common English words and the 50,000 most common German words, and words not found in the vocabulary will be replaced with a special token, <unk>
. We’ll need to download the following files:
train.de
: File containing German sentencestrain.en
: File containing English sentencesvocab.50K.de
: File containing German vocabularyvocab.50K.en
: File containing English vocabulary
train.de
and train.en
contain parallel sentences in German and English, respectively. Once we download these, we’ll load the sentences as follows:
Get hands-on with 1400+ tech skills courses.