Project Creation: Part Two

In this lesson, we will perform some preprocessing on our dataset.

Padding

In the previous lesson, we preprocessed our data and created a numeric representation of the test sentences. We will be using the same function to work with our original dataset.

First, we will create the padding functionality.

Press + to interact
import numpy as np
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
def pad(x, length=None):
if length is None:
length = max([len(sentence) for sentence in x])
return pad_sequences(x, maxlen=length, padding='post')
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(np.array(token_sent)))
print(' Output: {}'.format(pad_sent))

Explanation:

  • First, we imported the required packages.

  • From line 4 to line 7, we defined a function that will pad our data. We are trying to find the sequence that is of maximum length. After that, we used the pad_sequences() function to pad extra 0’s at the end by providing the padding="post" parameter and also providing the maximum length of the sequence (which is never going to be more than the maximum length).

  • On line 9 we called the pad() function on the sequences that we created in the previous lesson.

  • Finally, we printed the sequence without padding and the sequence again after padding. Take a look at the output for one of the sequences below.

    Sequence 1 in x
    Input:  [ 4  7  2  1 16 10  5 11 17  1 18  8  3 19 12  1 20  3 21  1 22 10 23 14
    6  1  3 24  2  8  1  4  7  2  1 25 13 26  9  1 27  3 28  1 15]
    Output: [ 4  7  2  1 16 10  5 11 17  1 18  8  3 19 12  1 20  3 21  1 22 10 23 14
    6  1  3 24  2  8  1  4  7  2  1 25 13 26  9  1 27  3 28  1 15  0  0  0
    0  0  0
...
Access this course and 1400+ top-rated courses and projects.