Understanding Hand-Drawn Images with Image-to-Text Processing
Learn how our pictionary bot understands hand-drawn images and evaluates them using the image-to-text models in Gemini. Also, understand how images can be sent as prompts to Google Gemini.
Behind the scenes
Multimodal models such as Gemini can work with images as input. This enables them to analyze an image and generate a textual description of its content. Here’s a brief overview of how most image captioning models work:
The image is first processed into a format that is easily digestible for the model.
A
analyzes the image, extracting features like edges, objects, and their spatial relationships. This creates a compressed representation of the image’s visual content.CNN encoder A Convolutional Neural Network (CNN) encoder specializes in processing images/videos. It acts like a data summarizer, transforming raw visuals into a compact representation that captures the essence of the content. The
receives the encoded image representation from the CNN. It starts generating words one at a time. At each step, it considers the previous words it generated, the encoded image representation, and its internal knowledge of language.decoder Decoders are like translators. They take a condensed version of information, created by an encoder, and turn it back into something we can understand. The decoder predicts the next word in the caption based on the accumulated information. This process continues until a stopping sequence is generated or a maximum length is reached.
Prompts with images
Earlier in the “Multimodal Prompting with Google Gemini” lesson, we used the generate_content()
method to send an image as a prompt. Here’s a quick refresher on how to use an image in a prompt. The playground below has been set up to allow you to upload an image:
Get hands-on with 1300+ tech skills courses.