Understanding the Data
Learn about the data and datasets.
About the dataset
First, we need to understand what the dataset looks like so that when we see the generated text, we can assess whether it makes sense, given the training data. We’ll download the first 100 books from “Grimms’ Fairy Tales.” These are translations of a set of books (from German to English) by the Grimm brothers.
Initially, we’ll download all 209 books from the website with an automated script as follows:
url = 'https://www.cs.cmu.edu/~spok/grimmtmp/'dir_name = 'data'def download_data(url, filename, download_dir):"""Download a file if not present, and make sure it's the rightsize."""# Create directories if doesn't existos.makedirs(download_dir, exist_ok=True)# If file doesn't exist downloadif not os.path.exists(os.path.join(download_dir,filename)):filepath, _ = urlretrieve(url + filename, os.path.join(download_dir,filename))else:filepath = os.path.join(download_dir, filename)return filepath# Number of files and their names to downloadnum_files = 209filenames = [format(i, '03d')+'.txt' for i in range(1,num_files+1)]# Download each filefor fn in filenames:download_data(url, fn, dir_name)# Check if all files are downloadedfor i in range(len(filenames)):file_exists = os.path.isfile(os.path.join(dir_name,filenames[i]))assert file_existsprint('{} files found.'.format(len(filenames)))
We’ll now show example text snippets extracted from two randomly picked stories. The following is the first snippet:
Then she said, my dearest benjamin, your father has had these coffins made for you and for your eleven brothers, for if I bring a little girl into the world, you are all to be killed and buried in them. And as she wept while she was saying this, the son comforted her and said, weep not, dear mother, we will save ourselves, and go hence. But she said, go forth into the forest with your eleven brothers, and let one sit constantly on the highest tree which can be found, and keep watch, looking towards the tower here in the castle. If I give birth to a little son, I will put up a white flag, and then you may venture to come back. But if I bear a daughter, I will hoist a red flag, and then fly hence as quickly as you are able, and may the good God protect you.
The second text snippet is as follows:
Red-cap did not know what a wicked creature he was, and was not at all afraid of him.
“Good-day, little red-cap,” said he.
“Thank you kindly, wolf.”
“Whither away so early, little red-cap?”
“To my grandmother’s.”
“What have you got in your apron?”
“Cake and wine. Yesterday was baking-day, so poor sick grandmother is to have something good, to make her stronger.”
“Where does your grandmother live, little red-cap?”
“A good quarter of a league farther on in the wood. Her house stands under the three large oak-trees, the nut-trees are just below. You surely must know it,” replied little red-cap.
The wolf thought to himself, what a tender young creature. What a nice plump mouthful, she will be better to eat than the old woman.
We now understand what our data looks like. With that understanding, let’s move on to processing our data further.
Generating training, validation, and test sets
We’ll separate the stories we downloaded into three sets: training, validation, and test files. We’ll use the content in each set of files as the training, validation, and test data. We’ll use ...