Transcribing Speech¶
In this recipe, we go over how to transcribe the speech from an audio file using Gemini 1.5 Flash’s audio capabilities.
Mirascope Concepts Used
Background
LLMs have significantly advanced speech transcription beyond traditional machine learning techniques, by improved handling of diverse accents and languages, and the ability to incorporate context for more precise transcriptions. Additionally, LLMs can leverage feedback loops to continuously improve their performance and correct errors through simple prompting.
Setup¶
Let's start by installing Mirascope and its dependencies:
!pip install "mirascope[gemini]"
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY"
# Set the appropriate API key for the provider you're using
Transcribing Speech using Gemini¶
With Gemini’s multimodal capabilities, audio input is treated just like text input, which means we can use it as context to ask questions. We will use an audio clip provided by Google of a countdown of the Apollo Launch. Note that if you use your own URL, Gemini currently has a byte limit of 20971520
when not using their file system.
Since we can treat the audio like any other text context, we can create a transcription simply by inserting the audio into the prompt and asking for a transcription:
import os
from google.generativeai import configure
from mirascope.core import gemini, prompt_template
configure(api_key=os.environ["GOOGLE_API_KEY"])
apollo_url = "https://storage.googleapis.com/generativeai-downloads/data/Apollo-11_Day-01-Highlights-10s.mp3"
@gemini.call(model="gemini-1.5-flash")
@prompt_template(
"""
Transcribe the content of this speech:
{url:audio}
"""
)
def transcribe_speech_from_url(url: str): ...
response = transcribe_speech_from_url(apollo_url)
print(response)
10 9 8 We have a goal for main engine start. We have a main engine start.
Tagging audio¶
We can start by creating a Pydantic Model with the content we want to analyze:
from typing import Literal
from pydantic import BaseModel, Field
class AudioTag(BaseModel):
audio_quality: Literal["Low", "Medium", "High"] = Field(
...,
description="""The quality of the audio file.
Low - unlistenable due to severe static, distortion, or other imperfections
Medium - Audible but noticeable imperfections
High - crystal clear sound""",
)
imperfections: list[str] = Field(
...,
description="""A list of the imperfections affecting audio quality, if any.
Common imperfections are static, distortion, background noise, echo, but include
all that apply, even if not listed here""",
)
description: str = Field(
..., description="A one sentence description of the audio content"
)
primary_sound: str = Field(
...,
description="""A quick description of the main sound in the audio,
e.g. `Male Voice`, `Cymbals`, `Rainfall`""",
)
Now we make our call passing in our AudioTag
into the response_model
field:
@gemini.call(model="gemini-1.5-flash", response_model=AudioTag, json_mode=True)
@prompt_template(
"""
Analyze this audio file
{url:audio}
Give me its audio quality (low, medium, high), a list of its audio flaws (if any),
a quick description of the content of the audio, and the primary sound in the audio.
Use the tool call passed into the API call to fill it out.
"""
)
def analyze_audio(url: str): ...
response = analyze_audio(apollo_url)
print(response)
audio_quality='Medium' imperfections=['Background noise'] description='A countdown from ten with a male voice announcing "We have a go for main engine start"' primary_sound='Male Voice'
Speaker Diarization¶
Now let's look at an audio file with multiple people talking. For the purposes of this recipe, I grabbed a snippet from Creative Commons[https://www.youtube.com/watch?v=v0l-u0ZUOSI], around 1:15 in the video and giving Gemini the audio file.
with open("YOUR_MP3_HERE", "rb") as file:
data = file.read()
@gemini.call(model="gemini-1.5-flash")
@prompt_template(
"""
Transcribe the content of this speech adding speaker tags
for example:
Person 1: hello
Person 2: good morning
{data:audio}
"""
)
def transcribe_speech_from_file(data: bytes): ...
response = transcribe_speech_from_file(data)
print(response)
Additional Real-World Examples
- Subtitles and Closed Captions: Automatically generate subtitles for same and different languages for accessibility.
- Meetings: Transcribe meetings for future reference or summarization.
- Voice Assistant: Transcription is the first step to answering voice requests.
When adapting this recipe to your specific use-case, consider the following:
- Split your audio file into multiple chunks and run the transcription in parallel.
- Compare results with traditional machine learning techniques.
- Experiment with the prompt by giving it some context before asking to transcribe the audio.