Home/Blog/Generative Ai/How does generative AI work? A step-by-step guide
Home/Blog/Generative Ai/How does generative AI work? A step-by-step guide

How does generative AI work? A step-by-step guide

Zuwayr Wajid
Sep 06, 2024
14 min read

Generative AI is making big waves in how we create things, from images that look real to writing that sounds like a person wrote it. But what is GenAI really, and how does it work?

We’re going to take a high-level view of this technology and break it into bite-sized pieces. It doesn’t matter if you’re new to this or already know a thing or two — we’ll walk through the key ideas that got us to the advanced generative AI models we have today. 

What is generative AI?#

Generative AI is a revolutionary technology that allows users to generate new content across various modalities, such as text, images, audio, and videos. These modalities can be used as inputs or generated as outputs using Generative AI models.

The term generative AI can be broken down into two parts:

  • Generative, which refers to the ability to generate new things that have not been seen before.

  • Artificial intelligence (AI), which refers to computer programs that can handle tasks that would typically require human-level intelligence. These can include reasoning, learning, problem-solving, and understanding language.

Fast fact: The “G” in GPT (the model you’re probably hearing a lot about) stands for “Generative.” So if you’ve ever wondered what is GenAI, or what’s special about it, now you know the difference between GenAI vs AI!

How did we get here?#

To understand where generative AI is today, we need to step back and see how it all started.

Let’s take a simple example: ChatGPT, a chatbot that takes in text and spits out a response. Like many GenAI tools, ChatGPT uses a technique from natural language processing (NLP) called language modeling.

So, what’s language modeling? It’s basically the process of guessing what word is most likely to come next in a sentence. Think about it: if I say, “Educative is the best,” a language model will try to predict the next word. The model makes its best guess based on the words it has already seen. It picks one, then does the same thing for the next word, and so on.

This is how a lot of GenAI use cases work today — predicting and generating text, images, or even code, by learning from tons of examples.

canvasAnimation-image
1 of 5

In the past, traditional language models relied on statistics and probabilities. They made predictions based on the number of times the model might have seen the phrase during its training. So, given a large amount of text beforehand, the model would know how many times a phrase appeared and what its completions were. This method tends to lose efficacy when longer or more context-specific completions are required. These more complex problems led us to derive more complex solutions: neural networks.

Neural networks were designed to mimic the structure and function of the human brain. They can be trained to perform language completions by predicting the next most likely word. A key difference here is that they learn the complex patterns and relationships between words directly from vast amounts of data. Neural networks can also become more complex themselves, allowing them to become more adept at capturing more intricate patterns.

Language models using neural networks can also achieve the same results as traditional language models. The task, which the model uses as context, is commonly referred to as a prompt. So, given the context, the model will predict what comes next.

A neural network completing a partial sentence
A neural network completing a partial sentence

The percentages shown in the image correspond to the model’s predicted probabilities of each word being the following word in the sequence.

Both types of models use probabilities. In traditional models, probabilities are calculated based on parameters, often using mathematical formulas. Their probabilities have a clear statistical interpretation as well. On the other hand, while the output values for neural networks can be interpreted as probabilities, they are not directly calculated like in traditional models. Neural networks learn probabilistic relationships from data through their internal structure and parameters.

Currently, it might appear that the switch from traditional statistical models to more complex neural network-based models has had little impact. To appreciate and understand their value, let’s explore how they are developed.

How are generative models made?#

Building a generative model, especially a language model, can be broken down into a few key steps at a high level:

  1. Collect a large amount of text data. This data becomes the foundation for the model to learn from.

  2. Pick a sentence, remove part of it, and ask the model to predict the missing word. Essentially, you're testing the model by giving it incomplete sentences and seeing how well it fills in the gaps.

  3. Provide feedback and repeat. If the model’s guess is off, you let it know, and over time, it learns to make better predictions.

To get a clearer idea, let’s look at how this works on a smaller scale using a simple neural network. The neural network learns patterns in the data by repeatedly making predictions and adjusting based on feedback. No need to worry about the math right now — we’re just focusing on the core idea.

canvasAnimation-image
1 of 5

The model shown above has 54(3*5+5) + (5*4+4) + (4*2+2) parameters. This number is the sum of all the parameters in a layer. The number of parameters in a layer can be calculated using the following formula:

Parameters=input_features×output_features+output_features\text{Parameters} = \text{input\_features} \times \text{output\_features} + \text{output\_features}

The parameters can be calculated for each layer, and they can then be summed up to get the total number of parameters.

For the input layer, we have 33 input features and 55 output features. Plugging these into the formula yields 2020 , as shown below.

20=3×5+5\text{20} = \text{3} \times \text{5} + \text{5}

The table below calculates the sum of parameters in a neural network with an input layer, two hidden layers, and an output layer. Feel free to modify the feature values.

ABCD
1Input FeaturesOutput FeaturesSum of Parameters
2Neurons in layer 03f5f20
3Neurons in layer 15f4f24
4Neurons in layer 24f2f10
5Neurons in layer 32Total Parametersf54

Parameters include the weights and biases that the model learns during the training process. Weights are numerical values associated with connections between neurons. These weights influence how much one neuron’s activationNeuron activation refers to applying a mathematical function (activation function) to the input of a neuron. This function determines if a neuron "fires" (outputs a value) and influences the next layer. affects another neuron. During training, weights are adjusted to optimize the network’s performance. Bias acts like a constant nudge added to the sum of inputs reaching a neuron. It allows the neuron to activate even if the weighted input alone isn’t strong enough.

Try to imagine the scale of the generative AI models that have billions of parameters!

Increasing the hidden layers also increases the number of parameters the network can learn. This improves the network’s ability to generalize and enables it to see patterns that might not be clearly evident.

Neural networks typically work with numerical values. This means that content sent as input must first be converted into a numerical representation, typically as a vector. Numbers are much easier to work with and have allowed us to reach better efficiency levels.

Now that we have understood the most basic form of a neural network-based language model, let’s take it further with transformers.

If you’re curious, you can check out this blog on wrapping your head around neural networks in Python.

Transformers#

So, we have a neural network that can predict the next word for us. Where do we go from here?

Similar to how a collection of interconnected neurons makes up a neural network, various blocks of neural networks make up a transformer. The transformer architecture has been a revolutionary development in the domain of generative AI. Most generative models nowadays use this architecture as well.

Fast fact: The “T” in GPT stands for transformers.

A simplified overview of the transformer architecture
A simplified overview of the transformer architecture

The diagram above shows a simplified view of the transformer architecture. Each block contains smaller neural networks. The number of blocks and their arrangements have changed with time as more optimizations are made, given that this is still an active area of research. A key takeaway here is that the function of the transformer stays the same. In fact, we can abstract away the inner workings of the purple block for now and focus on the input.

If you would like to explore this further, check out the blog on attention mechanisms and transformers.

Our input prompt is first tokenized. Tokenization refers to breaking down words into smaller units or (tokens) such as words or sub-words. Each token typically represents a meaningful unit of the input text. Each token is then converted into a high-dimensional vectorIt is a list of numbers arranged in a specific order, where each number represents a different feature or dimension. representation called an embedding. Positional encoding is a technique used in transformer-based models (like BERT, GPT, etc.) to incorporate positional information into the word embeddings. This allows the model to capture the relative positions of words within the input sequence. This makes up the input that is sent to the transformer.

A transformer completing a partial sentence
A transformer completing a partial sentence

The transformer‘s output and purpose remain the same for us. We have a complex and powerful architecture that can use some context as input and generate an output for us. So far, we have mentioned the word “training” a few times. Let’s explore what that means for these transformer-based models.

Gathering and using data#

All the models we have discussed so far rely on learning some information (weights and biases) during the training process. For most neural networks and transformer-based models, this is usually self-supervised learning. Self-supervised learning (SSL) is a technique in which a model learns from unlabeled data by creating its own supervisory signals. Unlike supervised learning, which relies on pre-labeled data (like the label “cat” for an image of a cat), SSL finds ways to extract meaningful information from unlabeled data itself.

This is a game-changer. Since most data available to us on the internet is unlabeled, SSL allows us to tap into this vast resource. This is usually public-domain data, such as Wikipedia, textbooks, public posts, articles, and code repositories. Around 45TB of data was used to train GPT-3, a model with around 175 billion parameters. The sheer size of these models classifies them as large language models.

Once a model is trained on this enormous amount of data, we have what is called a foundation model. A foundation model is essentially a large language model trained on a massive dataset of text and code. It is good for general-purpose applications and can adapt to various use cases.

Fast fact: The “P” in GPT stands for pre-trained.

Since these models have been trained on a vast amount of diverse data, they might not excel at a particular task. To achieve better results in a specific domain, we need to tune the model further.

Fine-tuning #

Foundation models can be fine-tuned for special-purpose applications. Fine-tuning typically involves using a pre-trained model and adapting it to a specific task by providing domain—or task-specific data.

A typical fine-tuning workflow
A typical fine-tuning workflow

The pre-trained network already has some weights that it learned. These weights are used as a starting point and then during the fine-tuning process, these weights are updated to better suit a particular use case. Let’s say we have some financial data, and we would like our model to generate some reports from it. We will specialize the model to this data, and then we can prompt it to generate something that is specific to our task, for example, calculating the profits in the last quarter. Fine-tuning is very important as it enables the creation of models for special-purpose applications.

Most chatbots are tuned to be safe general-purpose chatbots.

How do generative AI models improve?#

We have seen what is possible with transformer-based models. The current state-of-the-art models excel in various domains, often topping various benchmarks and exceeding expectations in terms of what was previously possible. The transformer architecture was a definite breakthrough, with most of the newer models using it at their core. Models, however, are also improving at a rapid pace.

Generative AI models improve in several ways, driven by advancements in various areas:

  • More data: Access to vaster, higher-quality data allows models to learn more intricate patterns and generate more realistic outputs.

  • More computational power: Advancements in hardware and cloud computing technologies allow for the training of more complex models on massive datasets.

  • Newer architectures: New architectures and fine-tuning techniques enable models to process information more efficiently, handle complex relationships, and adapt to specific tasks.

The key takeaway is that most of the improvement is driven by scale, either by using more data or simply by creating bigger models. In fact, the most powerful models currently have upwards of half a trillion parameters!

Can you trust the outputs?

With all this power, it might be fair to think that these models can generate some truly impressive outputs. This is true to some extent, but the underlying problem of hallucinationsHallucinations refer to incorrect or misleading outputs produced by the model. still exists. While errors have reduced, it’s always best to double-check the responses.

Every advancement comes with its own set of challenges.

The impact of generative AI models#

Creating and working with generative AI models has had some huge costs. The time and money needed to train a foundation model can be months and millions of dollars. In fact, it cost OpenAI around $100 Million to train GPT-4. This cost can be a huge deterrent for new players to enter this market.

In terms of human capital, generative AI has made some tasks very easy to automate, resulting in the narrowing of certain job markets. While AI has not replaced jobs directly, it has empowered its users to greatly scale up their productivity. The environmental impact is often overlooked as well. The carbon footprint of training these massive models is not insignificant either.

Ever wondered how AI is transforming the online world? Dive deeper into the fascinating applications with this blog post exploring 4 ways generative AI is changing the internet.

One of the most popular generative AI chatbots, OpenAI’s ChatGPT, broke the record in the shortest time by amassing 100 million users in just 64 days after launch. Whether it's "generative artificial intelligence" or "GenAI," these phrases have been thrown around a lot in a short amount of time.

Generative AI models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have been around for almost a decade now, so why is generative AI hype only happening now now?

It's because the technologies have rapidly improved:

  • Improved algorithms: Deep learning models, particularly Generative Adversarial Networks (GANs) and Transformers, have made significant leaps in their ability to understand and generate complex data like images, text, and code.

  • Increased computing power: Access to powerful and affordable computing resources (like cloud computing) has made it feasible to train and run these complex AI models.

  • Data explosion: The sheer volume of digital data available for training these models has skyrocketed, further fueling their capabilities.

And for fun, a short quiz to see if you were paying attention:

Match The Answer
Select an option from the left-hand side

The G in GPT stands for

Pre-trained

The P in GPT stands for

Generative Pre-trained Transformer

The T in GPT stands for

Generative

GPT stands for

Transformer


Moving ahead into the GenAI future#

Generative AI has enabled computers to learn, think, and communicate in entirely new ways. To add, multi-modalAI models that can process and generate different types of data, such as text, images, and audio. generative AI has truly unlocked a new realm of possibilities, allowing us to combine the knowledge of countless minds into a single, infinitely patient entity. We’ve reached a point where providing a model with a photo and simply asking, “What am I looking at?” is enough to get an accurate answer within seconds.

As we continue to develop these models, we unlock even greater potential for creativity, problem-solving, and understanding our world. Generative AI is like the modern-day equivalent of Aladdin’s lamp, but with a much smarter genie. However, like the genie, AI's response is only as effective as the user’s prompt. As such, it's essential that we hone our prompt engineering skills to make the most of them.

What will you create with generative AI?

To build your Generative AI skills, check out our Gen-AI courses covering everything from prompt engineering to building apps by leveraging LLMs.

Here are some beginner-friendly courses you may want to check out:


  

Free Resources