VideoBERT Model

Learn about pre-training the VideoBERT model using the cloze and linguistic-visual alignment tasks.

Now we'll learn about yet another interesting variant of BERT called VideoBERT. As the name suggests, along with learning the representation of language, VideoBERT also learns the representation of video. It is the first model that learns the representation of both video and language in a joint manner.

Just as we used a pre-trained BERT model and fine-tuned it for downstream tasks, we can also use a pre-trained VideoBERT model and fine-tune it for many interesting downstream tasks. VideoBERT is used for tasks such as image caption generation, video captioning, predicting the next frames of a video, and more.

Press + to interact

But how exactly is VideoBERT pre-trained to learn video and language representations? Let's find out.

Pre-training a VideoBERT model

We know that the BERT model is pre-trained using two tasks, called masked language modeling (cloze task) and next sentence prediction.

1.

Can we also pre-train VideoBERT using masked language modeling and next sentence prediction?

Show Answer
Q1 / Q1
Did you find this helpful?

Let's explore how exactly the VideoBERT model is pre-trained using the cloze task and linguistic-visual alignment.

Cloze task

First, let's see how VideoBERT is pre-trained using the cloze task. In order to pre-train VideoBERT, we use instructional videos such as cooking videos. But why instructional videos? Why can't we use any random videos? Let's explain with an example.

Example: Cooking video

Consider a video where someone is teaching us how to cook. Say the speaker is saying, 'Cut lemon into slices.' As we hear the speaker saying 'cut lemon into slices'they will also visually show us how they are cutting the lemon into slices, right? This is shown in the following figure:

Press + to interact
Sample cooking video
Sample cooking video

These sorts of instructional videos, where the speaker's statement and the corresponding visuals align with each other, are very useful for pre-training VideoBERT. In instructional videos, the speaker's statement and the corresponding visuals tend to match with one another, which helps us to learn the representations of the language and ...