Shifter Window (Swin) Transformers
Discover computer vision architectures, tackle quadratic complexity with Swin's variable patches, and window division. Simulate Swin-like progression with Python.
Let's explore an essential architecture that modifies the vision transformer or video concept to overcome the quadratic complexity issue associated with image patches and growing image sizes.
Quadratic complexity challenge
The quadratic complexity challenge emerges when employing a fixed 16 by 16 pixel patch in the width architecture. This design choice, while initially effective for handling image data, becomes a computational bottleneck as the image size expands. Consequently, this quadratic growth becomes impractical for large-scale images, imposing severe constraints on processing resources and hindering the efficiency of the vision transformer.
In essence, the fixed 16 by 16 pixel patch, while providing a structured approach to analyze image data, becomes a limiting factor when confronted with the demands of scalability. This challenge prompts the exploration of alternative architectures, leading to the introduction of the
Swin architecture overview
Swin introduces two clever tricks to address these challenges: variable patches and window division.
Variable patches
The Swin architecture introduces variable patches, a notable departure from the fixed 16 by 16 pixel patch paradigm. Unlike the rigid approach of the width architecture, Swin affords each layer the flexibility to accommodate a changing number of image patches.
Get hands-on with 1300+ tech skills courses.