What Are Generative AI Models?
Learn about a few basic concepts of generative AI and get an overview of LLMs and multimodal models.
A historical look at generative AI
Generative AI models are a type of artificial intelligence that can create entirely new content, like text, images, or even code. These models learn patterns from large datasets and use this knowledge to generate new, original data.
Generative AI has existed since the 1960s. The perceptron was implemented in 1957, and one of the first chatbots, ELIZA, was created by British scientist Joseph Weizenbaum in 1966. Back then, these innovations were a small but crucial step toward generative AI. The usefulness of generative models rose rapidly after new deep-learning techniques, such as Recurrent Neural Networks (RNNs) and Long-Short-Term Memory (LSTM) networks, came forward in the 1980s. Generative Adversarial Networks (GANs) and Variational Autoencoders, introduced in 2014, made it possible to not only label images but also generate images.
The most recent breakthroughs in both deep learning models and computational power have led to some truly remarkable results. The models that have garnered the most attention in the AI community are based on the transformer architecture. Introduced in a 2017 paper, “Attention Is All You Need,” this transformer architecture has been evolving rapidly in the past few years and has yielded models such as ChatGPT, Llama, and even Gemini.
The transformer model also led to the release of transformer-based image generation models such as DALL·E, Midjourney, and Stable Diffusion. The rate of development in the domain of generative AI has made it difficult to keep up pace with all that is happening. However, some key phrases have not changed.
Buzzwords galore
After a short Blockchain and Web 3.0 boom in the 2020s, the tech industry is now back on the AI bandwagon. This has led to a few buzzwords such as “artificial intelligence,” “machine learning,” “deep learning,” “generative AI,” “large language models,” and “multimodal models” being used more frequently. Let’s break down what they mean:
Artificial intelligence: This is a broad field of computer science focused on creating intelligent machines that can think and act like humans. AI encompasses various approaches, such as rule-based systems and evolutionary algorithms, but machine learning is a powerful technique that has fueled many advancements.
Machine learning: This is a subfield of AI in which computers learn from data without explicit programming. Imagine showing a computer thousands of cat pictures—machine learning lets it identify new cats on its own. This is the foundation for many intelligent systems we see today.
Deep learning: This is a specific type of machine learning inspired by the structure and function of the human brain. It uses artificial neural networks with multiple layers to process complex data like images, text, and speech.
Generative AI: This leverages deep learning’s capabilities to analyze existing data and capture its underlying patterns. Generative AI involves creating new content, such as text, images, or music, using AI models. These models learn from existing data to generate original outputs. GANs (Generative Adversarial Networks) are a type of GenAI model that uses two competing neural networks. One network (generator) creates new data, while the other (discriminator) tries to identify if it’s real or generated. This competition helps the generator improve its creations over time.
Large language models: These are specialized AI models trained on massive amounts of text data. They can generate human-like text in response to a wide range of prompts and questions. By learning language patterns, they can perform tasks such as translation, summarization, and text generation.
Multimodal models: Multimodal models integrate and process multiple types of data, such as text, images, and audio, within a single framework. This enables them to understand and generate content across different modalities. Imagine an AI system that can not only describe an image in words but also generate a new image based on a textual query.
The word “AI” was mentioned 121 times at Google’s I/O developer conference in 2024.
The two phrases that strongly relate to Google Gemini are large language models and multi-modal models.
Large language models
Before we dive into large language models, let’s first look at what language models are. A language model is a probability distribution over all the words of a language. Language modeling, in turn, is a technique to predict what word comes next in a sentence or document, given a sequence of words that have already occurred. These models can predict the next word in a sequence, translate languages, summarize text, and generate code. Since these models can understand human language, they can also aid in market research and analysis or even assist in material discovery by analyzing research papers.
So, what is all the hype around large language models? What makes them large? There are two reasons why these language models can be considered large:
Massive data consumption: LLMs are trained on truly enormous datasets of text and code, mostly taken from the internet. This data can include books, articles, code repositories, web pages, and even conversational transcripts.
Complex model architecture: LLMs use intricate neural network architectures, often based on transformers. These neural networks have many parameters, ranging from a few to hundreds of billions. Training these models requires significant computational power, often involving hundreds or thousands of GPUs over weeks or months. The sheer number of parameters allows the LLMs to identify complex patterns within the massive datasets they are trained on.
OpenAI’s GPT-3 model was trained on approximately 570 GB of text data and has 175 billion parameters.
The advent of multimodal models
Imagine a world where AI interacts with information in silos—text is text, images are just pictures, and videos are silent streams. This is unimodal data. Traditional AI models often function within this limitation, struggling to connect the dots between different data types. Most traditional large language models such as OpenAI GPT-3, Amazon Comprehend, and Google’s BERT are text experts, but multimodal models can handle more than just text. They combine multiple AI models, allowing them to work with images, sounds, and even text simultaneously.
This is where the need for multimodal solutions arises. The real world is a blend of experiences, and multimodal data reflects this complexity. For example, consider a web page with a cookie recipe. This would not just be text; it could have an image of the finished dish and a step-by-step video. Multimodal AI models, such as Gemini, can ingest and process information from different modalities—text, code, audio, images, and video—all at once.
Let’s build on the recipe example. With a unimodal modal, the most we can do is provide a textual description or an image of the food that we want to prepare and ask for its recipe. The model, with its limited capability of being only trained on unimodal data, will provide us with only a text-based recipe or image in response. However, let’s say you had a picture of your grandma’s cookies, a handwritten note with a partial recipe, and a voice memo describing the taste. A multimodal model will be able to piece together this data and would not only be able to complete the missing ingredients and steps in the recipe but potentially suggest a few recipes or even generate a shopping list!
The exact response will depend on the question we ask the model.