...

/

Getting to Know the Data

Getting to Know the Data

Learn about the image datasets that we'll use to train the model.

Let’s first look at the data we’re working with both directly and indirectly. There are two datasets we’ll rely on:

We won’t engage the first dataset directly, but it’s essential for caption learning. This dataset contains images and their respective class labels (for example, cat, dog, and car). We’ll use a CNN that’s already trained on this dataset, so we don’t have to download and train on this dataset from scratch. Next, we’ll use the MS-COCO dataset, which contains images and their respective captions. We’ll directly learn from this dataset by mapping the image to a fixed-size feature vector using the vision transformer and then map this vector to the corresponding caption using a text-based transformer. 

ILSVRC

...