...

/

Multimodal Models in Generative AI

Multimodal Models in Generative AI

Explore the exciting world of multimodal AI and discover how combining different types of data makes AI smarter, more robust, and more like us in understanding the world.

The world is multimodal

Take a moment to consider how you experience the world around you. Do you rely solely on your eyes? Probably not! You’re likely engaging multiple senses—seeing, hearing, and perhaps even feeling things at this very moment. You might be reading this text while hearing sounds in your environment. Maybe you catch the aroma of coffee brewing or notice the texture of your chair.

Humans are amazing at using their senses—sight, sound, touch, smell, and taste—together to understand what’s happening. We don’t just rely on one sense at a time; we blend information from all of them to get a much richer and complete picture of the world.

Now, think about AI. For a long time, AI systems were often designed to understand just one type of information at a time. An AI might be great at understanding text or amazing at recognizing images, but it usually focuses on just one of these things. We call this unimodal AI—AI that uses just one data mode.

But the real world isn’t unimodal, is it? It’s full of different kinds of information happening all at once! That’s why we’re now moving toward multimodal AI.

Press + to interact

Multimodal AI is like teaching AI to be more like us—to understand the world by simultaneously processing information from multiple data types. Just like we use all our senses, multimodal AI uses different senses of data to get a more complete and intelligent understanding.

Why is this important? Imagine an AI trying to understand a video. If it can only see the video (visual data), it might miss important information in the soundtrack (audio data), like spoken words or music that sets the mood. To truly understand the video, the AI must be able to see and hear—to be multimodal!

What are modalities?

Okay, we’ve been using the word modality a lot. But what exactly is a modality in the world of AI?

In simple terms, a modality refers to a distinct type of data or sensory input. Think of it as a different way of experiencing or representing information. For us, modalities are our senses: sight, sound, touch, smell, and taste. For AI, modalities are different types of data it can process.

Here are some of the most common modalities that multimodal AI deals with:

  • Visual modality: This includes data related to sight:

    • Images: Still pictures, photographs, drawings

    • Videos: Sequences of images that create motion

  • Auditory modality: This is data related to hearing:

    • Speech: Spoken words, human language in audio form

    • Sounds: Environmental sounds, music, and noises

  • Textual modality: This is data in written language form:

    • Written language: Documents, books, articles, web pages, social media posts, code

You might also encounter other modalities depending on the specific application, such as:

  • Sensor data: Readings from various sensors like temperature, pressure, GPS, lidar, and radar (especially in robotics and ...

Access this course and 1400+ top-rated courses and projects.