...

/

Understanding Vision Models: How AI Learns to See

Understanding Vision Models: How AI Learns to See

Explore how vision transforms AI models, how they are trained, and what new experiences they enable.

We’ve already explored how incredible AI has become at language—writing essays, answering questions, and even chatting naturally. Pretty amazing, right? But pause for a moment and think about your daily life. How much of your experience of the world relies solely on words?

Probably not that much! Most of how you understand and interact with your surroundings is through vision. Recognizing your friend’s face, navigating your home, or appreciating a gorgeous sunset are all visual experiences. What if we could allow AI to see and understand the world similarly?

That’s exactly what vision models do! Like the language models we studied earlier, vision models allow AI to see, interpret, and even generate visual information—much like the eyes and brain do. But why exactly is vision such a big deal in the AI world? Let’s dive deeper.

Why is vision so important in AI?

Let’s first quickly clarify: What is an image?

An image is simply a grid of pixels—tiny dots, each holding color information. Imagine a huge mosaic made up of thousands of tiny, colored tiles. Humans instantly recognize what the mosaic represents (say, a cat or a sunset). But to a computer, these are just numbers.

Teaching AI to interpret these numerical patterns visually is a game changer. Here’s why:

  • Vision dominates how we navigate and understand our world. If AI will assist us effectively, it needs vision, too. Imagine a robot that sees obstacles or medical AI that spots tumors on scans better than humans can.

  • The world is filled with visual data—from your smartphone gallery filled with pet photos to medical X-rays and satellite images. Processing this huge amount of visual data efficiently can help AI solve important real-world problems.

Vision-based foundation models, trained on billions of images, are at the forefront of this visual revolution. Let’s uncover exactly how these models do their magic.

What is the vision transformer?

We’ve talked about transformers—the powerful models behind language AI that can chat, write essays, and even create stories. Transformers are fantastic at understanding sequences—words in a sentence, for example. But could transformers also learn to understand images? At first glance, it seems tricky. Images aren’t words, after all—they’re pictures!

But imagine for a moment that we could turn images into something that transformers can naturally process—something like visual sentences. This creative idea led to a groundbreaking AI architecture called Vision Transformers (ViT).

Press + to interact
Vision models architecture
Vision models architecture

Let’s dive into how they actually work, step by step:

  1. Image patching: Imagine you have a picture of a cute cat. To us, it’s one whole image, but to a Vision Transformer, this picture is like a paragraph made up of visual words. How do we get these visual words? Easy—we slice the image into smaller squares called patches, similar to cutting a printed photo into neat, smaller tiles. Suppose our picture is a small square of 28x28 pixels. If we cut it into smaller squares (say 7x7 pixels each), we end up with a grid of 4 tiles by 4 tiles, totaling 16 patches. Each patch now becomes a visual “word” in our image sentence. Instead of seeing the whole image at once, the Vision Transformer sees a sequence of these visual words, similar to how you read words one by one in a sentence.

  2. Linear embedding of patches: Now, we have our image neatly sliced into visual words (patches). But here’s the catch: computers don’t really see pictures—they only understand numbers. Each of these patches is a tiny grid of pixels. To a computer, this is like an unreadable language. We need to translate these pixels into something computers understand. So, Vision Transformers first flatten each patch—imagine unrolling each tile into ...