Project Creation: Part One

In this lesson, we will start with our project on text generation and perform some pre-processing steps.

Introduction to the project

In this chapter, we are going to build a text generator using Markov chains. We are going to build a character-based model. Let’s suppose we have a string the monke. We need to find the character that is best suited after the character e in the word monke based on our training corpus. In other words, we are going to generate the next character for a given string. So, in this manner, we are going to generate the text.

We will save the last ‘K’ characters and the ‘K+1’ character from the training corpus and save them in a lookup table. To understand it better, let’s say we have a corpus that contains only the man was .... they ... then .... the ... the (of course, we will use a much bigger corpus). So, we have the following number of occurrences of the words:

  • the - 3
  • then - 1
  • they - 1
  • man - 1

The table that will be generated looks something like this:

Lookup Table

X

Y

Frequency

the

" "

3

the

n

2

the

y

1

the

i

1

man

" "

1

:

:

:

:

:

:

In the example above, we have taken K = 3. We will consider 3 characters at a time and take the next character as our output character. In the above lookup table, we have the X (word) as the and the Y (output) as a single space (" "). We have also calculated how many times this sequence occurs in our dataset, which came out to be 3. Similarly, we will be generating all possible pairs of (X,Y) ...