This Is Just the Beginning

Discover breakthroughs in Generative AI with MLLMs like Kosmos-1 for text and visuals, and GPT-4, which showcase superior intelligence, enhanced safety, and increased usability across various tasks.

Throughout this course, we saw how Generative AI and, more specifically, GPT models are revolutionizing the way both citizens and large enterprises are working.

Nevertheless, we have embarked on a journey where ChatGPT and GPT models represent only the first steps toward an era of unprecedented technological advancements. As we have seen throughout the course, these models have already demonstrated exceptional capabilities in language understanding and generation. However, the true potential of Generative AI has yet to be fully realized.

The first releases of multimodal large language models (MLLMs) and Microsoft’s introduction of the Copilot system have already revealed a glimpse of what to expect.

The advent of multimodal large language models (MLLMs)

So far, we’ve mainly focused on large language models (LLMs), as they are the architecture behind the GPT-x family and ChatGPT. These models are trained on massive amounts of text data, such as books, articles, and websites, and use a neural network to learn the patterns and structure of human language. If we want to combine further Generative AI capabilities with LLMs, such as image understanding and generation, we need the support of additional models, such as DALL-E. This holds true until the introduction of MLLMs.

MLLMs are AI systems that combine NLP with computer vision to understand and generate both textual and visual content. These models are trained on massive amounts of data, such as images and text, and are capable of generating human-like responses to queries that include both text and visual inputs.
In recent months, there have been great developments in the field of MLLMs, and in the next sections, we are going to focus on two main models: Kosmos-1 and GPT-4.

Kosmos-1

In the “Language Is Not All You Need: Aligning Perception with Language Models” paper, Microsoft’s researchers Shaohan Huang et al. introduced Kosmos-1, an MLLM that can respond to both language and visual cues. This enables it to perform tasks such as image captioning and visual question answering.

While LLMs such as OpenAI’s ChatGPT have gained popularity, they struggle with multimodal inputs such as images and audio. Microsoft’s research paper highlights the need for multimodal perception and real-world grounding to advance toward artificial general intelligence (AGI).

Kosmos-1 can perceive various modalities, follow instructions through zero-shot learning, and learn from the provided context using few-shot learning. Demonstrations of the model show its potential to automate tasks in various situations involving visual prompts.

The following figure provides an example of how it functions:

Get hands-on with 1200+ tech skills courses.