Project Creation: Part Two
In this lesson, we will perform some preprocessing on our dataset.
We'll cover the following...
Padding
In the previous lesson, we preprocessed our data and created a numeric representation of the test sentences. We will be using the same function to work with our original dataset.
First, we will create the padding functionality.
import numpy as npfrom tensorflow.python.keras.preprocessing.sequence import pad_sequencesdef pad(x, length=None):if length is None:length = max([len(sentence) for sentence in x])return pad_sequences(x, maxlen=length, padding='post')test_pad = pad(text_tokenized)for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):print('Sequence {} in x'.format(sample_i + 1))print(' Input: {}'.format(np.array(token_sent)))print(' Output: {}'.format(pad_sent))
Explanation:
-
First, we imported the required packages.
-
From line 4 to line 7, we defined a function that will pad our data. We are trying to find the sequence that is of maximum length. After that, we used the
pad_sequences()
function to pad extra 0’s at the end by providing thepadding="post"
parameter and also providing the maximum length of the sequence (which is never going to be more than the maximum length). -
On line 9 we called the
pad()
function on the sequences that we created in the previous lesson. -
Finally, we printed the sequence without padding and the sequence again after padding. Take a look at the output for one of the sequences below.
Sequence 1 in x Input: [ 4 7 2 1 16 10 5 11 17 1 18 8 3 19 12 1 20 3 21 1 22 10 23 14 6 1 3 24 2 8 1 4 7 2 1 25 13 26 9 1 27 3 28 1 15] Output: [ 4 7 2 1 16 10 5 11 17 1 18 8 3 19 12 1 20 3 21 1 22 10 23 14 6 1 3 24 2 8 1 4 7 2 1 25 13 26 9 1 27 3 28 1 15 0 0 0 0 0 0