...

/

Processing and Tokenizing Data

Processing and Tokenizing Data

Learn to process and tokenize the data.

With the data downloaded and placed in the correct folders, let’s define the directories containing the required data:

trainval_image_dir = os.path.join('data', 'train2014', 'train2014')
trainval_captions_dir = os.path.join('data', 'annotations_trainval2014', 'annotations')
test_image_dir = os.path.join('data', 'val2017', 'val2017')
test_captions_dir = os.path.join('data', 'annotations_trainval2017', 'annotations')
trainval_captions_filepath = os.path.join(trainval_captions_dir, 'captions_train2014.json')
test_captions_filepath = os.path.join(test_captions_dir, 'captions_val2017.json')
Directories for data

Here, we have defined the directories containing training and testing images as well as the file paths of the JSON files that contain the captions of the training and testing images.

Preprocessing data

As the next step, let’s split the training set into training and validation sets. We’ll use 80% of the original set as training data and 20% as the validation data (randomly chosen):

all_filepaths = np.array([os.path.join(trainval_image_dir, f) for f in os.listdir(trainval_image_dir)])
rand_indices = np.arange(len(all_filepaths))
np.random.shuffle(rand_indices)
split = int(len(all_filepaths)*0.8)
train_filepaths, valid_filepaths = all_filepaths[rand_indices[:split]], all_filepaths[rand_indices[split:]]

We can print the dataset sizes and see what we ended up with:

print(f"Train dataset size: {len(train_filepaths)}")
print(f"Valid dataset size: {len(valid_filepaths)}")

This will print:

Train dataset size: 66226
Valid dataset size: 16557

Now let’s read the captions and create a pandas DataFrame using them. Our DataFrame will have four important columns:

  • image_id: Identifies an image (used to generate the file path)

  • image_filepath: File location of the image identified by image_id

  • caption: Original caption

  • preprocessed_caption: Caption after some simple preprocessing ... ...