Project Creation: Part One
In this lesson, we will start with our project on text generation and perform some pre-processing steps.
We'll cover the following...
Introduction to the project
In this chapter, we are going to build a text generator using Markov chains. We are going to build a character-based model. Let’s suppose we have a string the monke
. We need to find the character that is best suited after the character e
in the word monke
based on our training corpus. In other words, we are going to generate the next character for a given string. So, in this manner, we are going to generate the text.
We will save the last ‘K’ characters and the ‘K+1’ character from the training corpus and save them in a lookup table. To understand it better, let’s say we have a corpus that contains only the man was .... they ... then .... the ... the
(of course, we will use a much bigger corpus). So, we have the following number of occurrences of the words:
the
-3
then
-1
they
-1
man
-1
The table that will be generated looks something like this:
Lookup Table
X | Y | Frequency |
the | " " | 3 |
the | n | 2 |
the | y | 1 |
the | i | 1 |
man | " " | 1 |
: | : | : |
: | : | : |
In the example above, we have taken K = 3
. We will consider 3
characters at a time and take the next character as our output character. In the above lookup table, we have the X
(word) as the
and the Y
(output) as a single space (" "). We have also calculated how many times this sequence occurs in our dataset, which came out to be 3. Similarly, we will be generating all possible pairs of (X,Y)
...