...
/Adding Image-to-Text Capabilities with Gemini
Adding Image-to-Text Capabilities with Gemini
Learn how to process images with Gemini in our Gradio chatbot.
Gemini is a popular multimodal chatbot built by Google. It can take input from various data modalities, such as text, images, charts, PDFs, videos, and audio. We are particularly interested in Gemini’s image-processing capabilities for our use case. A simple use case would be to generate HTML code from the image of a web page. This will greatly enhance our educational chatbot’s capabilities. Let’s begin!
Google AI Studio is a web-based tool designed to prototype and experiment with the Gemini AI models. The AI Studio can be a great place to get started with Gemini, but most importantly, the Studio also allows us to generate an API key that can be used to access Gemini using code.
Creating a Gemini API key
Let’s quickly walk through the API key creation process. Head over to the AI Studio and login. Then, follow the slides below:
Now that the API key is created, we can go ahead and start using Gemini. For Python, we will also need to install the google-generativeai
library. This can be done with the code below:
pip install google-generativeai
Once again, the library has already been set up for the widgets in this course. Installations are not needed.
The AI Studio also provides a “Get code” button that can be used to get the Python code to send a request to the model. We have copied the code from the AI Studio into the widget below.
import os import google.generativeai as genai genai.configure(api_key=os.environ["GEMINI_API_KEY"]) # Create the model generation_config = { "temperature": 1, "top_p": 0.95, "top_k": 64, "max_output_tokens": 8192, "response_mime_type": "text/plain", } model = genai.GenerativeModel( model_name="gemini-1.5-pro", generation_config=generation_config, ) chat_session = model.start_chat( history=[ ] ) response = chat_session.send_message("Hello!") print(response.text)
Let’s review the code:
Line 1: We import the
google.generativeai
library to interact with Google’s Generative AI API.Line 4: We configure the generative AI client using an API key stored in the environment variable
GEMINI_API_KEY
. This grants access to the generative AI models.Lines 7–12: We define a dictionary named
generation_config
that specifies optional parameters for generating the response. These parameters control aspects like:Temperature: Controls randomness (1 being more balanced).
Top P: Focuses on the most likely tokens (0.95 means high focus).
Top K: Considers top K most likely next words (64 provides some diversity).
Max output tokens: Limits the length of the generated text (8192 sets a maximum of 8192 words or sub-words).
Response Mime Type: Sets the output format (text/plain indicates plain text).
Lines 15–18: We create a
GenerativeModel
object namedmodel
by specifying the model namegemini-1.5-pro
and the generation configuration we defined earlier.Lines 20–23: We initiate a chat session with the model using the ...