Processing and Tokenizing Data
Learn to process and tokenize the data.
We'll cover the following...
With the data downloaded and placed in the correct folders, let’s define the directories containing the required data:
trainval_image_dir = os.path.join('data', 'train2014', 'train2014')trainval_captions_dir = os.path.join('data', 'annotations_trainval2014', 'annotations')test_image_dir = os.path.join('data', 'val2017', 'val2017')test_captions_dir = os.path.join('data', 'annotations_trainval2017', 'annotations')trainval_captions_filepath = os.path.join(trainval_captions_dir, 'captions_train2014.json')test_captions_filepath = os.path.join(test_captions_dir, 'captions_val2017.json')
Here, we have defined the directories containing training and testing images as well as the file paths of the JSON files that contain the captions of the training and testing images.
Preprocessing data
As the next step, let’s split the training set into training and validation sets. We’ll use 80% of the original set as training data and 20% as the validation data (randomly chosen):
all_filepaths = np.array([os.path.join(trainval_image_dir, f) for f in os.listdir(trainval_image_dir)])rand_indices = np.arange(len(all_filepaths))np.random.shuffle(rand_indices)split = int(len(all_filepaths)*0.8)train_filepaths, valid_filepaths = all_filepaths[rand_indices[:split]], all_filepaths[rand_indices[split:]]
We can print the dataset sizes and see what we ended up with:
print(f"Train dataset size: {len(train_filepaths)}")print(f"Valid dataset size: {len(valid_filepaths)}")
This will print:
Train dataset size: 66226Valid dataset size: 16557
Now let’s read the captions and create a pandas DataFrame using them. Our DataFrame will have four important columns:
image_id
: Identifies an image (used to generate the file path)image_filepath
: File location of the image identified byimage_id
caption
: Original captionpreprocessed_caption
: Caption after some simple preprocessing ... ...