Spatio-Temporal Transformers
Learn to simulate and recognize actions in video frames using spatio-temporal transformers with a concise Python code example.
Let's explore the integration of transformers in tasks involving temporal relations, such as video processing and multiple frames of images. Transformers, initially designed for natural language processing (NLP), naturally extend to model temporal sequences, making them suitable for video analysis applications.
Spatial and temporal relations in video analysis
Building on spatial relations using self-attention mechanisms, transformers now address temporal aspects in video analysis. The dimensions shift from
Video transformer network architecture
The Video Transformer Network (VTN) architecture was introduced in 2021. The main goal of this architecture is to perform video classification tasks.
This architecture processes each video frame with a convolutional neural network (CNN) backbone, extracting 2D embeddings,
Get hands-on with 1400+ tech skills courses.