...

/

Enhancing AI with Audio/Video-to-Text Generation

Enhancing AI with Audio/Video-to-Text Generation

Learn about the Files API and how it can be used to send audio and videos in prompts.

We'll cover the following...

The Files API

The Gemini Files API allows us to store and access media files (text, images, audio, and video) to use with the model’s generation capabilities. This functionality is particularly useful when the prompt data exceeds the size limit of the standard prompt input of 20 MB or when we want to provide multimedia content for multimodal prompting. The File API allows us to store up to 20 GB of files per project, with each file capped at 2 GB. Files are kept for 48 hours and can be accessed with the API key that was used to upload them. This service is free in all regions where the Gemini API is available.

Supported audio formats

Gemini supports the following data types for audio files:

  • WAV: audio/wav
  • MP3: audio/mp3
  • AIFF: audio/aiff
  • AAC: audio/aac
  • OGG Vorbis: audio/ogg
  • FLAC: audio/flac

Supported video formats

Gemini supports the following data types for video files:

  • video/mp4
  • video/mpeg
  • video/mov
  • video/avi
  • video/x-flv
  • video/mpg
  • video/webm
  • video/wmv
  • video/3gpp

Audio-to-text

Audio-to-text models are a key component of multimodal LLMs such as Gemini. This allows them to understand and process spoken language alongside other modalities like text and images. Gemini can analyze speech files for summarization, transcription, and answering ...