Let's explore the application of transformers in semantic segmentation. Traditional encoder-decoder architectures pose computational challenges. We'll explore a transformative approach by incorporating transformers for image segmentation.

Encoder-decoder architecture with self-attention

In an encoder-decoder setup, replacing the encoder block with a self-attention mechanism is a viable option. However, the computational cost is a concern. Two solutions were discussed: multihead attention parallelization or utilizing image patches/words, similar to vision transformers (ViT).

Architectures combining approaches

Several architectures seamlessly integrate both approaches. Let's examine two notable models: SEgmentation TRansformers (SETR), and Segmenter.

SETR architecture

The SETR model employs a semantic segmentation transformer that divides the image into patches. The encoder operates on image patch embeddings with positional embeddings, using self-attention.

Get hands-on with 1200+ tech skills courses.