Image Segmentation Using Transformers
Discover transformer applications in semantic segmentation and explore SETR and segmenter architectures.
Let's explore the application of transformers in semantic segmentation. Traditional encoder-decoder architectures pose computational challenges. We'll explore a transformative approach by incorporating transformers for image segmentation.
Encoder-decoder architecture with self-attention
In an encoder-decoder setup, replacing the encoder block with a self-attention mechanism is a viable option. However, the computational cost is a concern. Two solutions were discussed: multihead attention parallelization or utilizing image patches/words, similar to vision transformers (ViT).
Architectures combining approaches
Several architectures seamlessly integrate both approaches. Let's examine two notable models: SEgmentation TRansformers (SETR), and Segmenter.
SETR architecture
The SETR model employs a semantic segmentation transformer that divides the image into patches. The encoder operates on image patch embeddings with positional embeddings, using self-attention.
While it isn't a pure transformer model, the encoder operates on image patch embeddings with positional embeddings, utilizing self-attention to create the image's encoder representation. On the other hand, the decoder is a conventional convolutional decoder, such as the "SETR-Naive" model, employing direct upsampling.
Segmenter architecture
Another architecture, the Segmenter, adopts a similar segmentation transformer approach but integrates a transformer decoder.
The encoder utilizes image patches and a transformer encoder, while the decoder employs a transformer mechanism with object queries, resembling the detection transformer (DETR). This enables pixel-level class predictions, requiring upsampling for the final output.
Panoptic segmentation
Detection transformers extend naturally to panoptic segmentation. Object queries provide ...