Preprocessing Datasets

Learn about the steps involved in preprocessing certain datasets.

Preprocessing a WMT dataset

Vaswani et al. (2017) present the transformer’s achievements on the Workshops on Machine Translation (WMT) 2014 English-to-German translation task and the WMT 2014 English-to-French translation task. The transformer achieves a state-of-the-art BLEU score.

The 2014 WMT contained several European language datasets. One of the datasets contained data taken from version 7 of the Europarl corpus. We will be using the French-English dataset from the European Parliament Proceedings Parallel Corpus, 1996-2011.

We will preprocess the two parallel files in the extracted dataset:

  • europar1-v7.fr-en.en

  • europar1-v7.fr-en.fr

We will load, clear, and reduce the size of the corpus.

Let’s start the preprocessing.

Preprocessing the raw data

In this section, we will preprocess the files in the extracted dataset.

We will have to ensure that the two europarl files are in the same directory as read_code.ipynb (under the “Code playground” section).

The program begins by using standard Python functions and pickle to dump the serialized output files:

Get hands-on with 1200+ tech skills courses.