Challenge: Preprocess and Clean a Czech-English Dataset

Show off what you've learned so far by preprocessing and cleaning data.

Task 1

Your first task in this challenge is to preprocess a sample Czech-English dataset. The files to preprocess are:

  • europar1-v7.cs-en.en

  • europar1-v7.cs-en.cs

Steps

While performing preprocessing, make sure to implement these steps:

  • Find the maximum and minimum sentence lengths.

  • Create a translation table for removing punctuation.

  • Remove punctuation marks.

  • Tokenize on white spaces.

  • Convert to lowercase text.

  • Remove words with numbers inside of them.

Code playground

Implement your solution in the Jupyter notebook below. Feel free to edit the helper function names already present in the notebook. The code for loading the dataset is already provided to you. Happy building!

Get hands-on with 1400+ tech skills courses.