How to implement the Whisper ASR API in Python

Whisper is an automatic speech recognition (ASR) system engineered by OpenAI. Its primary purpose is to transcribe spoken language into written text, a capability that has a wide array of uses, ranging from transcription services to voice-controlled assistants. This Answer will help you comprehend how to incorporate the Whisper ASR API in Python, granting you practical knowledge of this tool.

Setting up the environment

Before you get to the coding part, it's important to configure an appropriate environment. Ensure that your system is equipped with Python, as well as the OpenAI Python client library. The latter can be installed with the help of a pip:

pip install openai
Install openai

Additionally, you'll need to secure an API key from OpenAI, which serves to validate your requests to the Whisper ASR system.

Implementing the API

With your environment ready, you can begin utilizing the Whisper ASR API. The API provides two main functionalities: transcription and translation.

Transcription

Below is a straightforward example demonstrating how it can be used to transcribe an audio file:

import openai
import os
openai.api_key = openai.api_key = os.environ["SECRET_KEY"]
audio_file= open("/assets/sample.mp3", "rb")
transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = "srt")
print(transcript)
Code explanation

In the provided code snippet, we initially import the OpenAI library and set our API key. Next, we invoke the openai.Audio.transcribe method, passing in the audio file we aim to transcribe. The audio file must be in a format supported by Whisper, such as WAV, FLAC, or MP3.

The input audio provided to the code snippet above can be found here.

The method yields a response containing the transcription of the audio file, which is subsequently printed.

Example output

In case you're facing difficulties in executing the code due to the absence of an API key, below is the result obtained from a previous successful code execution.

[00:00.000 --> 00:06.000] ¿Dónde está la parada del autobús?
Transcription using Whisper

Translation

Whisper ASR API also supports the translation of spoken language into English. The process is similar to transcription. Here is an example with the same audio file as input:

import openai
import os
openai.api_key = openai.api_key = os.environ["SECRET_KEY"]
audio_file= open("/assets/sample.mp3", "rb")
transcript = openai.Audio.translate(model="whisper-1", file = audio_file, response_format = "srt")
print(transcript)
Example output

Below is the result obtained from a previous successful code execution.

[00:00.000 --> 00:06.000] Where is the bus stop?
Translation using Whisper

Managing large audio files

If you're dealing with large audio files, it might be necessary to segment them into smaller pieces before feeding them to the Whisper ASR API. This is because the API enforces a restriction on the size of the audio file it can process in a single request, currently 25MB. Audio processing libraries like PyDub or SoX can split your audio files.

Conclusion

Whisper ASR is a tool for translating speech into text, and with its Python API, its integration into your applications is quite straightforward. Whether you're developing a transcription service, a voice-controlled assistant, or any other application necessitating speech recognition, Whisper ASR can be an invaluable resource. Ensure to handle large audio files suitably and always secure your API key.

Copyright ©2024 Educative, Inc. All rights reserved