?ref_src=twsrc%5Etfw">September 23, 2025
Source Code: Get_started_LiveAPI_NativeAudio.py at the end!
Summary
With this Live API update:
-
Reliability of function calls
-
Human-like conversation experience
-
Furthermore, the "Thinking" function allows for handling complex questions
Voice agent development has entered a new stage. Google commented that "this opens up new possibilities for intuitive and powerful voice experiences," and further updates are planned for the future.
Source Code: Get_started_LiveAPI_NativeAudio.py
# -*- coding: utf-8 -*-
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This script uses Gemini's native audio model and Live API to
send microphone audio in real time and play back the response audio as is.
Important: Be sure to use headphones. No echo cancellation is performed.
If the output from the speakers is fed back into the microphone,
the model will pick up its own voice, causing interruptions and feedback.
[Setup]
1) macOS: brew install portaudio
2) Python dependencies: pip install -U google-genai pyaudio
3) If Python is older than 3.11: pip install taskgroup exceptiongroup
[API Key]
Set the API key in the environment variables.
- Recommended: export GOOGLE_API_KEY=your key
- Compatible: export GEMINI_API_KEY=your key (this script will automatically pick it up)
[Execution]
python Get_started_LiveAPI_NativeAudio.py
After starting, instructions and precautions will be given in Japanese. Ctrl+C to exit.
"""
from __future__ import annotations
import os
import sys
import asyncio
import traceback
from typing import Optional
try:
import pyaudio # type: ignore
except Exception as e: # pragma: no cover
print(
"[Error] PyAudio could not be imported.\n"
"- macOS: brew install portaudio\n"
"- Afterwards: pip install -U pyaudio\n"
"- Or use a prebuilt wheel for Python.",
file=sys.stderr,
)
raise
from google import genai
from google.genai import types
# Polyfill TaskGroup / ExceptionGroup if Python is older than 3.11
if sys.version_info < (3, 11, 0):
try:
import taskgroup, exceptiongroup # type: ignore
except Exception as e: # pragma: no cover
print(
"[Error] Python is older than 3.11 and 'taskgroup' / 'exceptiongroup' could not be found.\n"
"Run pip install taskgroup exceptiongroup.",
file=sys.stderr,
)
raise
asyncio.TaskGroup = taskgroup.TaskGroup # type: ignore[attr-defined]
asyncio.ExceptionGroup = exceptiongroup.ExceptionGroup # type: ignore[attr-defined]
FORMAT = pyaudio.paInt16
CHANNELS = 1
SEND_SAMPLE_RATE = 16000 # 16kHz PCM for sending
RECEIVE_SAMPLE_RATE = 24000 # 24kHz PCM for response
# Frame length in 20ms units, which is stable in actual operation
# 20ms at 16kHz = 320 frames, 20ms at 24kHz = 480 frames
SEND_CHUNK_FRAMES = 320
RECV_CHUNK_FRAMES = 480
# Jitter buffer (in milliseconds) to accumulate before playback.
# Absorbs network latency and intermittent reception.
PREBUFFER_MS = 120
pya = pyaudio.PyAudio()
def _build_client() -> genai.Client:
# Supports both environment variables (GOOGLE_API_KEY takes precedence)
api_key = os.environ.get("GOOGLE_API_KEY") or os.environ.get("GEMINI_API_KEY")
if api_key:
# Example of using preview features in v1alpha
return genai.Client(api_key=api_key, http_options={"api_version": "v1alpha"})
# Use environment even if not set (e.g. ADC)
return genai.Client(http_options={"api_version": "v1alpha"})
SYSTEM_INSTRUCTION = """
You are a helpful and friendly AI assistant.
Your default tone is helpful, engaging, and clear, with a touch of optimistic wit.
Anticipate user needs by clarifying ambiguous questions and always conclude your responses
with an engaging follow-up question to keep the conversation flowing.
""".strip()
MODEL = "gemini-2.5-flash-native-audio-preview-09-2025"
BASE_CONFIG = {
"system_instruction": SYSTEM_INSTRUCTION,
"response_modalities": ["AUDIO"],
# Enable proactive audio for preview (may be ignored in some environments)
"proactivity": {"proactive_audio": True},
}
class AudioLoop:
def __init__(self):
self.audio_in_queue: Optional[asyncio.Queue[bytes]] = None
self.out_queue: Optional[asyncio.Queue[types.Blob]] = None
self.session = None
self.audio_stream = None
async def listen_audio(self, input_device_index: Optional[int] = None):
"""Captures PCM from microphone at 16kHz/16bit/mono and puts it into the sending queue"""
mic_info = (
pya.get_device_info_by_index(input_device_index)
if input_device_index is not None
else pya.get_default_input_device_info()
)
stream = await asyncio.to_thread(
pya.open,
format=FORMAT,
channels=CHANNELS,
rate=SEND_SAMPLE_RATE,
input=True,
input_device_index=mic_info["index"],
frames_per_buffer=SEND_CHUNK_FRAMES,
)
self.audio_stream = stream
print(
f"[Recording Started] Device: {mic_info.get('name')} / {SEND_SAMPLE_RATE}Hz / mono / 16-bit",
flush=True,
)
# Avoid overflow exceptions and continue during debugging
kwargs = {"exception_on_overflow": False} if __debug__ else {}
while True:
data: bytes = await asyncio.to_thread(stream.read, SEND_CHUNK_FRAMES, **kwargs)
# Send in Blob type recommended by the library (MIME with sample rate)
blob = types.Blob(data=data, mime_type="audio/pcm;rate=16000")
assert self.out_queue is not None
await self.out_queue.put(blob)
async def send_realtime(self):
assert self.out_queue is not None
while True:
blob = await self.out_queue.get()
await self.session.send_realtime_input(audio=blob)
async def receive_audio(self):
"""Receives responses (PCM chunks/text) from the server and puts them into the playback queue"""
assert self.audio_in_queue is not None
while True:
# Receive events for one turn
turn = self.session.receive()
async for response in turn:
if response.data is not None:
self.audio_in_queue.put_nowait(response.data)
continue
# Safely retrieve text in case it arrives
text = getattr(response, "text", None)
if text:
print(text, end="", flush=True)
# Previously, unplayed audio was discarded at the end of the turn,
# but this can cause unnecessary sound interruptions, so it is not discarded by default.
# (Add implementation if immediate stopping on interruption is required)
async def play_audio(self, output_device_index: Optional[int] = None):
"""Plays back responses (24kHz/16bit/mono PCM) to the speakers"""
out_info = (
pya.get_device_info_by_index(output_device_index)
if output_device_index is not None
else pya.get_default_output_device_info()
)
stream = await asyncio.to_thread(
pya.open,
format=FORMAT,
channels=CHANNELS,
rate=RECEIVE_SAMPLE_RATE,
output=True,
frames_per_buffer=RECV_CHUNK_FRAMES,
output_device_index=out_info["index"],
)
print(
f"[Playback Started] Device: {out_info.get('name')} / {RECEIVE_SAMPLE_RATE}Hz / mono / 16-bit",
flush=True,
)
# Accumulate a little before starting playback (jitter absorption)
assert self.audio_in_queue is not None
bytes_per_sec = RECEIVE_SAMPLE_RATE * 2 # 16-bit mono
prebuffer_target = max(0, int(bytes_per_sec * (PREBUFFER_MS / 1000.0)))
prebuf = bytearray()
while len(prebuf) < prebuffer_target:
prebuf.extend(await self.audio_in_queue.get())
if prebuf:
await asyncio.to_thread(stream.write, bytes(prebuf))
# Normal playback loop
while True:
bytestream = await self.audio_in_queue.get()
await asyncio.to_thread(stream.write, bytestream)
async def run(self, model: str, config: dict, *, in_dev: Optional[int], out_dev: Optional[int]):
client = _build_client()
print("\n=== Execution Guide ===")
print("1) Please use headphones (to prevent echo)")
print("2) Speak into the microphone")
print("3) You will hear the model's response (interruptible)")
print("4) Ctrl+C to exit\n")
print(f"Model: {model}")
print("Proactive audio:", config.get("proactivity"))
if "speech_config" in config:
print("Speech config:", config.get("speech_config"))
print("")
try:
async with (
client.aio.live.connect(model=model, config=config) as session,
asyncio.TaskGroup() as tg,
):
self.session = session
# Provide some leeway in the receiving queue to ease sender congestion
self.audio_in_queue = asyncio.Queue(maxsize=200)
self.out_queue = asyncio.Queue(maxsize=20)
tg.create_task(self.send_realtime())
tg.create_task(self.listen_audio(input_device_index=in_dev))
tg.create_task(self.receive_audio())
tg.create_task(self.play_audio(output_device_index=out_dev))
except KeyboardInterrupt:
print("\n[Exit] Interrupted by user.")
except asyncio.CancelledError:
pass
except asyncio.ExceptionGroup as eg: # type: ignore[attr-defined]
traceback.print_exception(eg)
finally:
# Ensure streams and PyAudio are released
try:
if self.audio_stream is not None:
self.audio_stream.close()
finally:
pya.terminate()
def _list_devices() -> None:
print("\n=== Audio Device List (index / name) ===")
host_count = pya.get_host_api_count()
for i in range(pya.get_device_count()):
info = pya.get_device_info_by_index(i)
name = info.get("name")
max_in = int(info.get("maxInputChannels", 0))
max_out = int(info.get("maxOutputChannels", 0))
default_sr = int(info.get("defaultSampleRate", 0))
print(f"[{i:02d}] {name} (in:{max_in} / out:{max_out} / {default_sr}Hz)")
print("")
def main(argv: list[str] | None = None) -> int:
import argparse
parser = argparse.ArgumentParser(description="Gemini Live API (Native Audio) Simple Execution Script")
parser.add_argument("--model", default=MODEL, help="Model name to use")
parser.add_argument("--no-proactivity", action="store_true", help="Disable proactive audio")
parser.add_argument("--list-devices", action="store_true", help="List available audio devices and exit")
parser.add_argument("--in-dev", type=int, help="Input device index (defaults to default)")
parser.add_argument("--out-dev", type=int, help="Output device index (defaults to default)")
# Audio selection options effective in semi-cascade (TTS)
parser.add_argument("--tts-voice", help="TTS voice_name (only effective for semi-cascade models) e.g. ja-JP-Standard-A")
parser.add_argument("--tts-rate", type=float, help="TTS speaking_rate (e.g. 0.9~1.1)")
parser.add_argument("--tts-pitch", type=float, help="TTS pitch (in semitones. e.g. -2.0 ~ +2.0)")
args = parser.parse_args(argv)
if args.list_devices:
_list_devices()
return 0
cfg = dict(BASE_CONFIG)
if args.no_proactivity:
cfg.pop("proactivity", None)
# Specify TTS audio in semi-cascade (e.g. gemini-live-2.5-flash-preview, gemini-2.0-flash-live-001)
model_lower = (args.model or "").lower()
is_half_cascade = ("live-" in model_lower) or model_lower.endswith("-live-001")
if is_half_cascade and (args.tts_voice or args.tts_rate is not None or args.tts_pitch is not None):
speech_cfg = {}
if args.tts_voice:
speech_cfg["voice_name"] = args.tts_voice
if args.tts_rate is not None:
speech_cfg["speaking_rate"] = args.tts_rate
if args.tts_pitch is not None:
speech_cfg["pitch"] = args.tts_pitch
if speech_cfg:
cfg["speech_config"] = speech_cfg
elif not is_half_cascade and (args.tts_voice or args.tts_rate is not None or args.tts_pitch is not None):
print(
"[Note] The specified --tts-* may be ignored in native audio models.\n"
" If you want to select the type of voice, specify a semi-cascade model (e.g. gemini-live-2.5-flash-preview) with --model.",
file=sys.stderr,
)
loop = AudioLoop()
asyncio.run(loop.run(args.model, cfg, in_dev=args.in_dev, out_dev=args.out_dev))
return 0
if __name__ == "__main__":
raise SystemExit(main())
Recommended Python 3.12 environment
brew install portaudio
python3 -m pip install --upgrade pip setuptools wheel
-
Create new in 3.12: /opt/homebrew/bin/python3.12 -m venv .venv
-
Activate: source .venv/bin/activate
-
Verify: python -V (displays 3.12.x)
-
Update tools: python -m pip install -U pip setuptools wheel
-
Install dependencies: python -m pip install -U google-genai pyaudio
-
Gemini API key is required
-
Verify: python -c "from google import genai; import pyaudio; print('OK')"
-
Execute: python Get_started_LiveAPI_NativeAudio.py --list-devices
python Get_started_LiveAPI_NativeAudio.py --list-devices
=== Audio Device List (index / name) ===
[00] AirPods Pro A3047 (in:1 / out:0 / 24000Hz)
[01] AirPods Pro A3047 (in:0 / out:2 / 48000Hz)
[02] MacBook Air microphone (in:1 / out:0 / 48000Hz)
[03] MacBook Air speakers (in:0 / out:2 / 48000Hz)
[04] iP16PM microphone (in:1 / out:0 / 48000Hz)
python Get_started_LiveAPI_NativeAudio.py --in-dev 2 --out-dev 1
Additional tuning if the effect is insufficient
-
Increase pre-buffer: (delay increases, but stability increases) Increase PREBUFFER_MS to 200-300
-
Increase frame length: SEND_CHUNK_FRAMES = 640 (40ms), RECV_CHUNK_FRAMES = 960, etc.
-
Disable proactive audio (suppresses conversation interruption behavior) --no-proactivity