...

VideoBERT Model

Learn about pre-training the VideoBERT model using the cloze and linguistic-visual alignment tasks.

We'll cover the following...

Pre-training a VideoBERT model
Cloze task
- Example: Cooking video
- Video for Training the model
Linguistic-visual alignment
The final pre-training objective

Now we'll learn about yet another interesting variant of BERT called VideoBERT. As the name suggests, along with learning the representation of language, VideoBERT also learns the representation of video. It is the first model that learns the representation of both video and language in a joint manner.

Just as we used a pre-trained BERT model and fine-tuned it for downstream tasks, we can also use a pre-trained VideoBERT model and fine-tune it for many interesting downstream tasks. VideoBERT is used for tasks such as image caption generation, video captioning, predicting the next frames of a video, and more.

Press + to interact

Let's explore how exactly the VideoBERT model is pre-trained using the cloze task and linguistic-visual alignment.

Cloze task

First, let's see how VideoBERT is pre-trained using the cloze task. In order to pre-train VideoBERT, we use instructional videos such as cooking videos. But why instructional videos? Why can't we use any random videos? Let's explain with an example.

Example: Cooking video

Consider a video where someone is teaching us how to cook. Say the speaker is saying, 'Cut lemon into slices.' As we hear the speaker saying 'cut lemon into slices'they will also visually show us how they are cutting the lemon into slices, right? This is shown in the following figure:

Press + to interact

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

VideoBERT Model

Pre-training a VideoBERT model

Cloze task

Example: Cooking video