Getting Started with Google Gemini

Get an overview of the evolution of Google's AI models. Explore Gemini along with its architecture and different variants.

As we’ve now familiarized ourselves with the course’s structure, the requirements, and what to expect moving forward, let’s understand the evolution of Google Gemini and its different predecessor language models.

Evolution of Google’s AI models

Google has developed a wide range of AI models for conversational AI and assistants, natural language processing and understanding, and image and video processing. Google’s advanced language models, such as LaMDA, PaLM, and Gemini, have significantly innovated the AI field.

LaMDA

LaMDA (Large Language Model for Dialogue Applications) is a conversational AI model and a family of transformer-based neural language models. It is the successor of Google’s Meena, an end-to-end conversational model, and it has two generations. The first generation was developed in 2021, and the second generation was in 2022. This model has up to 137B parameters and is pretrained on 1.56T words of human dialogue data and stories to enhance human-like behavior and open-ended conversations.

It is built on the seq2seq model, a transformer-based architecture. It generates sequence transformation using an encoder and decoder. The encoder understands the input sequence and sends the context to the decoder, which generates the response. Bard, a conversation AI chatbot by Google, was powered by LaMDA.

PaLM

PaLM (Pathways Language Model) is a large language model efficient in natural language processing and understanding. It is the successor of the LaMDA model and will be announced in 2022. It is trained over 100 languages on a high-quality dataset of natural language tasks of 780 billion tokens. PaLM 2 was announced in 2023, and it has 340 billion parameters and was trained on 3.6 trillion tokens.

Press + to interact

It is based on a Transformer model with decoder-only architecture. The decoder-only architecture contains a stack of decoder-only transformer blocks, and the stack depends on the size of the model. It focuses on predicting of the next word in a sequence by looking at previous words. The input is passed through a sequence of decoder-only transformer blocks, each using multi-layer self-attention to ensure that each word only considers words that came before it. This way, each timestepIt refers to specific position or index of a word in the sequence. attends itself and past timesteps, and it results in the generation of more richer context response.

Gemini 1.0

Google Gemini 1.0 is a significant milestone in the development of AI technology. It is the successor of LaMDA and PaLM 2. It offers advanced multimodal reasoning, planning, understanding, and more.

Press + to interact
Gemini, a Multimodal LLM
Gemini, a Multimodal LLM

Gemini 1.0 is trained on large datasets and supports 32k context length. It can generate text, provide image descriptions, translate languages, and answer queries efficiently and descriptively. It has become an integral tool for developers to generate advanced AI applications.

It’s the beginning of a new era of AI at Google: the Gemini era—Google CEO Sundar Pichai

Gemini 1.0 is built on transformer decoder architecture jointly trained across text, image, video and audio. It not only provides generalist capabilities across different data modalities and a wide range of NLP tasks but also shows state-of-the-art reasoning capabilities in each respective domain.

Press + to interact
Gemini 1.0 architecture
Gemini 1.0 architecture

Let’s break down the workflow to understand the model:

  • Input data: The user will pass the raw input data, and it can be in any form.

  • Encoder: This is a multimodal encoder that takes different input formats, such as text, images, audio, video, 3D models, and graphs. It uses techniques like embedding to convert each input type into a numerical representation that is then passed to the model.

  • Model: It receives the input representation from the encoder. The model type depends on the specific task, such as translation, generation, etc. The model performs the task and passes that to the decoder.

  • Decoder: It receives the task results from the model and converts the output representation to a human-readable form.

Gemini 1.5

Google released a new version, Gemini 1.5, to overcome the limitations of the Gemini 1.0. The 1.0 version was restricted to 32K tokens and wasn’t performing efficiently on long-context windows. This version was initially announced in two variants 1.5 Pro and 1.5 Flash, both variants, have a context window of up to one million tokens by default. We can also sign up for a context window of two million tokens for 1.5 Pro. Gemini 1.5 is efficient enough to achieve near-perfect performance even with context windows of up to 10M tokens during testing.

Gemini 1.5 is built upon MoE (Mixture of Experts) and transformer-based model architecture. The traditional transformer model operates as one large neural network, while the MoE models are divided into smaller “expert“ neural networks. Each expert performs specific task, for example, task expert will perform the task processing and image expert will perform the image processing. Depending on the type of input given, MoE models activate only the most relevant expert in its neural network. There’s a component called a gating network that decides to which expert the input data will be passed by assigning different weights to each expert.

Press + to interact
Gemini 1.5 architecture
Gemini 1.5 architecture

Let’s break down the workflow to understand the model:

  • Input data: The user will pass the raw input data, and it can be in any form.

  • Gating network: The gating network processes the input and then assigns weights to each expert according to the input data.

  • Experts processing: The specific experts process the input data sent to it.

  • Combining outputs: The outputs from the experts are combined according to the weights assigned by the gating network.

  • Response: The combined result is passed through the output layer to generate the response.

This mixture of experts significantly enhances the model’s efficiency.

Gemini models

Google Gemini is currently being offered in the following variants/model family:

Nano

It is for on-device tasks (with or without data). There are two versions of Nano:

  • Nano-1 with 1.8B parameters

  • Nano-2 with 3.25B parameters

These versions target the low and high memory of devices accordingly. It is efficient for summarization, comprehension, factuality, and reasoning tasks.

Flash

It is a lightweight model of Gemini 1.5 with optimized speed and efficiency. It performs better than 1.0 Pro and at the same level as 1.0 Ultra on some benchmarks. It has a context window of one million tokens, which means that it can process one hour of video, 11 hours of audio, more than 30,000 lines of code, or over 700,000 words.

Pro

This model is for scaling across a wide range of tasks. it is well-optimized in terms of cost and latency. This model exhibits strong reasoning performance and broad multimodal capabilities. It is useful for tasks where computational power and memory are the main parameters, such as large-scale data analysis and AI model training.

Note: Gemini 1.5 pro has a larger context window of up to two million tokens and has a faster response time. It has surpassed the 1.0 Pro and 1.0 Ultra model on a wide range of benchmarks.

Ultra

It is the most powerful variant and is efficient for highly complex tasks. It supports multiple languages and is optimized for generating high-quality output for complex tasks like coding and multimodal reasoning. Due to its ability to solve complex tasks, it is useful for solving complex mathematical problems in the educational domain and also for personalized learning.

Gemini Chatbot

Besides the Gemini large language models, Google released a Chatbot, Bard, that was powered by the LaMDA model in 2023. For continuous advancement in features and efficiency, Bard started using the PaLM model instead of LaMDA. It took a step further in advancement and started using the Gemini model as a conversational engine and was rebranded from Bard to Gemini.

Press + to interact
Gemini chatbot
Gemini chatbot

Gemini is freely available with no daily usage limitations. It uses a tuned version of the Gemini 1.0 Pro model. It can understand more than 100 languages and write code in multiple languages. Users can pass text, images, and audio as input to the free version. Gemini Chatbot also has a paid version, called Gemini Advanced, that uses Gemini 1.5 Pro and has a 1 million token context window. Gemini Advanced enables us to upload files and analyze data.

Gemini Advanced is more capable of complex reasoning, understanding instructions, generation tasks, and code generation, and it is more efficient for creative tasks. Gemini Advanced also has a better understanding of the context for conversations.

Note: Try different tasks on the free chatbot Gemini by passing different prompts.

Gemini vs. GPT-4V

Let’s look into the comparison of the Gemini with GPT-4VVariants of GPT based on multimodal benchmarks: image, video and audio.

Capability

Task


Gemini


GPT-4V

Image

Multi-discipline college-level reasoning problems

59.4%

56.8%


OCR on natural images

82.3%

78.0%

Video

English video captioning

(CIDEr)

62.7

56.0


Video question answering

54.7%

46.3%

Audio

Automatic speech translation

(BLEU score)

40.1

29.1


Automatic speech recognition

(based on word error rate, lower is better)

7.6%

17.6%

The above stats show that Google Gemini has surpassed the performance efficiency for image, video, and audio.