Gemini Live API Update: Build Powerful Conversational AI Easily!

Cover Image for Gemini Live API Update: Build Powerful Conversational AI Easily!
AICU media
AICU media

Building More Powerful Voice Agents - Gemini Live API Latest Update

Google DeepMind Product Managers Ivan Solovyev, Valeria Wu, and Engineer Mingqiu Wang announced a significant update to the Live API of the Gemini API. A new native audio model is now available in preview, allowing you to build more reliable, responsive, and natural conversational voice agents.

This release focuses on the following two main points:

  • More Robust Function Calling: Ensuring reliable connection to external data and services.

  • More Natural Conversation Experience: Intuitively responding to interruptions, pauses, and side conversations (small talk).

Significant Improvement in Reliability

Voice agents provide the most powerful and engaging experiences when they can reliably connect to external data and services. Users can retrieve information in real time, book appointments, and complete transactions. The accuracy of function calls becomes crucial here. Since voice interactions happen in real time, there's no room to retry failed requests.

The new model has significantly improved the following:

  • Accuracy in recognizing the correct function to call

  • Judgment in avoiding unnecessary function calls

  • Consistent adherence to the provided tool schema

According to internal benchmarks, the accuracy of function calls has significantly improved even in complex scenarios (cases where 10 or more functions operate simultaneously). Compared to the previous version, the success rate for single calls has doubled, and tests involving 5-10 calls have increased by 1.5 times.

This is a major step forward for voice applications, and we plan to further improve reliability, especially in multi-turn scenarios, based on feedback from developers.

You can try out this improved function call accuracy on Google AI Studio.

Google's Live API is incredibly high quality! https://t.co/Osgy990R9n pic.twitter.com/aWVqpg6Te2

— Dr.(Shirai)Hakase - Shirai Hakase (@o_ob)

Towards More Natural Conversations

In this update, proactive audio features have been added to make voice interactions more human-like.

The new model enables the following:

  • Ignoring irrelevant small talk

  • Understanding and smoothly resuming from user's natural silences and interruptions

For example, if someone enters the room and asks a question during a conversation with a voice agent, the model pauses the conversation and ignores the small talk. When the user returns, the conversation can be naturally resumed.

Additionally, it better understands the rhythm and context of speech, appropriately handling silences when users are thinking about complex content and during casual conversations. As a result, the number of cases where the conversation is interrupted incorrectly has been significantly reduced.

Furthermore, the accuracy of interruption detection has also been improved, significantly reducing the frequency of missing user interruptions.

"Thinking" Function for Smarter Responses

Following this release, the "Thinking" function introduced in Gemini 2.5 Flash and Pro will also be available in the Live API starting next week.

Not all questions are suitable for immediate answers. You can set a few seconds of "thinking time" for complex questions to allow for deeper reasoning. This process also introduces a mechanism for the model to return a summary of its thinking in text.

Real-World Use Case: Ava Becoming the "COO" of the Household

We have been testing the latest API features through collaboration with early access partners. Many have reported positive results.

For example, AI-powered family OS Ava uses the Live API to function as the "COO" of the household, processing complex inputs such as emails, PDFs, and voice memos from schools and converting them into calendars and tasks.

Co-founder and CTO Joe Alicata says:

"Natural two-way voice chat was essential. The improved function call accuracy of the new model was a decisive turning point. Achieving high primary accuracy even with noisy input and not having to rely on fragile prompt workarounds has allowed our small team to rapidly develop reliable multimodal products."

Experimentation by AICU AIDX Lab

You can start using the Live API immediately from Google AI Studio. End-to-end code samples are also available in the Cookbook.

The following is an explanation and improvement of the Python code provided by the official AICU AIDX Lab for the Mac Python environment.

It worked in Python! pic.twitter.com/Xm5u8OeME4

— Dr.(Shirai)Hakase - Shirai Hakase (@o_ob)

Source Code: Get_started_LiveAPI_NativeAudio.py at the end!

Summary

With this Live API update:

  • Reliability of function calls

  • Human-like conversation experience

  • Furthermore, the "Thinking" function allows for handling complex questions

Voice agent development has entered a new stage. Google commented that "this opens up new possibilities for intuitive and powerful voice experiences," and further updates are planned for the future.

Source Code: Get_started_LiveAPI_NativeAudio.py

# -*- coding: utf-8 -*-
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script uses Gemini's native audio model and Live API to
send microphone audio in real time and play back the response audio as is.

Important: Be sure to use headphones. No echo cancellation is performed.
If the output from the speakers is fed back into the microphone,
the model will pick up its own voice, causing interruptions and feedback.

[Setup]
  1) macOS: brew install portaudio
  2) Python dependencies: pip install -U google-genai pyaudio
  3) If Python is older than 3.11: pip install taskgroup exceptiongroup

[API Key]
  Set the API key in the environment variables.
  - Recommended: export GOOGLE_API_KEY=your key
  - Compatible: export GEMINI_API_KEY=your key (this script will automatically pick it up)

[Execution]
  python Get_started_LiveAPI_NativeAudio.py

After starting, instructions and precautions will be given in Japanese. Ctrl+C to exit.
"""

from __future__ import annotations

import os
import sys
import asyncio
import traceback
from typing import Optional

try:
    import pyaudio  # type: ignore
except Exception as e:  # pragma: no cover
    print(
        "[Error] PyAudio could not be imported.\n"
        "- macOS: brew install portaudio\n"
        "- Afterwards: pip install -U pyaudio\n"
        "- Or use a prebuilt wheel for Python.",
        file=sys.stderr,
    )
    raise

from google import genai
from google.genai import types

# Polyfill TaskGroup / ExceptionGroup if Python is older than 3.11
if sys.version_info < (3, 11, 0):
    try:
        import taskgroup, exceptiongroup  # type: ignore
    except Exception as e:  # pragma: no cover
        print(
            "[Error] Python is older than 3.11 and 'taskgroup' / 'exceptiongroup' could not be found.\n"
            "Run pip install taskgroup exceptiongroup.",
            file=sys.stderr,
        )
        raise
    asyncio.TaskGroup = taskgroup.TaskGroup  # type: ignore[attr-defined]
    asyncio.ExceptionGroup = exceptiongroup.ExceptionGroup  # type: ignore[attr-defined]


FORMAT = pyaudio.paInt16
CHANNELS = 1
SEND_SAMPLE_RATE = 16000  # 16kHz PCM for sending
RECEIVE_SAMPLE_RATE = 24000  # 24kHz PCM for response

# Frame length in 20ms units, which is stable in actual operation
# 20ms at 16kHz = 320 frames, 20ms at 24kHz = 480 frames
SEND_CHUNK_FRAMES = 320
RECV_CHUNK_FRAMES = 480

# Jitter buffer (in milliseconds) to accumulate before playback.
# Absorbs network latency and intermittent reception.
PREBUFFER_MS = 120

pya = pyaudio.PyAudio()


def _build_client() -> genai.Client:
    # Supports both environment variables (GOOGLE_API_KEY takes precedence)
    api_key = os.environ.get("GOOGLE_API_KEY") or os.environ.get("GEMINI_API_KEY")
    if api_key:
        # Example of using preview features in v1alpha
        return genai.Client(api_key=api_key, http_options={"api_version": "v1alpha"})
    # Use environment even if not set (e.g. ADC)
    return genai.Client(http_options={"api_version": "v1alpha"})


SYSTEM_INSTRUCTION = """

You are a helpful and friendly AI assistant. Your default tone is helpful, engaging, and clear, with a touch of optimistic wit. Anticipate user needs by clarifying ambiguous questions and always conclude your responses with an engaging follow-up question to keep the conversation flowing. """.strip()

MODEL = "gemini-2.5-flash-native-audio-preview-09-2025"
BASE_CONFIG = {
    "system_instruction": SYSTEM_INSTRUCTION,
    "response_modalities": ["AUDIO"],
    # Enable proactive audio for preview (may be ignored in some environments)
    "proactivity": {"proactive_audio": True},
}


class AudioLoop:
    def __init__(self):
        self.audio_in_queue: Optional[asyncio.Queue[bytes]] = None
        self.out_queue: Optional[asyncio.Queue[types.Blob]] = None
        self.session = None
        self.audio_stream = None

    async def listen_audio(self, input_device_index: Optional[int] = None):
        """Captures PCM from microphone at 16kHz/16bit/mono and puts it into the sending queue"""
        mic_info = (
            pya.get_device_info_by_index(input_device_index)
            if input_device_index is not None
            else pya.get_default_input_device_info()
        )

        stream = await asyncio.to_thread(
            pya.open,
            format=FORMAT,
            channels=CHANNELS,
            rate=SEND_SAMPLE_RATE,
            input=True,
            input_device_index=mic_info["index"],
            frames_per_buffer=SEND_CHUNK_FRAMES,
        )
        self.audio_stream = stream

        print(
            f"[Recording Started] Device: {mic_info.get('name')} / {SEND_SAMPLE_RATE}Hz / mono / 16-bit",
            flush=True,
        )

        # Avoid overflow exceptions and continue during debugging
        kwargs = {"exception_on_overflow": False} if __debug__ else {}
        while True:
            data: bytes = await asyncio.to_thread(stream.read, SEND_CHUNK_FRAMES, **kwargs)
            # Send in Blob type recommended by the library (MIME with sample rate)
            blob = types.Blob(data=data, mime_type="audio/pcm;rate=16000")
            assert self.out_queue is not None
            await self.out_queue.put(blob)

    async def send_realtime(self):
        assert self.out_queue is not None
        while True:
            blob = await self.out_queue.get()
            await self.session.send_realtime_input(audio=blob)

    async def receive_audio(self):
        """Receives responses (PCM chunks/text) from the server and puts them into the playback queue"""
        assert self.audio_in_queue is not None
        while True:
            # Receive events for one turn
            turn = self.session.receive()
            async for response in turn:
                if response.data is not None:
                    self.audio_in_queue.put_nowait(response.data)
                    continue

                # Safely retrieve text in case it arrives
                text = getattr(response, "text", None)
                if text:
                    print(text, end="", flush=True)

            # Previously, unplayed audio was discarded at the end of the turn,
            # but this can cause unnecessary sound interruptions, so it is not discarded by default.
            # (Add implementation if immediate stopping on interruption is required)

    async def play_audio(self, output_device_index: Optional[int] = None):
        """Plays back responses (24kHz/16bit/mono PCM) to the speakers"""
        out_info = (
            pya.get_device_info_by_index(output_device_index)
            if output_device_index is not None
            else pya.get_default_output_device_info()
        )
        stream = await asyncio.to_thread(
            pya.open,
            format=FORMAT,
            channels=CHANNELS,
            rate=RECEIVE_SAMPLE_RATE,
            output=True,
            frames_per_buffer=RECV_CHUNK_FRAMES,
            output_device_index=out_info["index"],
        )

        print(
            f"[Playback Started] Device: {out_info.get('name')} / {RECEIVE_SAMPLE_RATE}Hz / mono / 16-bit",
            flush=True,
        )

        # Accumulate a little before starting playback (jitter absorption)
        assert self.audio_in_queue is not None
        bytes_per_sec = RECEIVE_SAMPLE_RATE * 2  # 16-bit mono
        prebuffer_target = max(0, int(bytes_per_sec * (PREBUFFER_MS / 1000.0)))
        prebuf = bytearray()
        while len(prebuf) < prebuffer_target:
            prebuf.extend(await self.audio_in_queue.get())
        if prebuf:
            await asyncio.to_thread(stream.write, bytes(prebuf))

        # Normal playback loop
        while True:
            bytestream = await self.audio_in_queue.get()
            await asyncio.to_thread(stream.write, bytestream)

    async def run(self, model: str, config: dict, *, in_dev: Optional[int], out_dev: Optional[int]):
        client = _build_client()

        print("\n=== Execution Guide ===")
        print("1) Please use headphones (to prevent echo)")
        print("2) Speak into the microphone")
        print("3) You will hear the model's response (interruptible)")
        print("4) Ctrl+C to exit\n")
        print(f"Model: {model}")
        print("Proactive audio:", config.get("proactivity"))
        if "speech_config" in config:
            print("Speech config:", config.get("speech_config"))
        print("")

        try:
            async with (
                client.aio.live.connect(model=model, config=config) as session,
                asyncio.TaskGroup() as tg,
            ):
                self.session = session

                # Provide some leeway in the receiving queue to ease sender congestion
                self.audio_in_queue = asyncio.Queue(maxsize=200)
                self.out_queue = asyncio.Queue(maxsize=20)

                tg.create_task(self.send_realtime())
                tg.create_task(self.listen_audio(input_device_index=in_dev))
                tg.create_task(self.receive_audio())
                tg.create_task(self.play_audio(output_device_index=out_dev))
        except KeyboardInterrupt:
            print("\n[Exit] Interrupted by user.")
        except asyncio.CancelledError:
            pass
        except asyncio.ExceptionGroup as eg:  # type: ignore[attr-defined]
            traceback.print_exception(eg)
        finally:
            # Ensure streams and PyAudio are released
            try:
                if self.audio_stream is not None:
                    self.audio_stream.close()
            finally:
                pya.terminate()


def _list_devices() -> None:
    print("\n=== Audio Device List (index / name) ===")
    host_count = pya.get_host_api_count()
    for i in range(pya.get_device_count()):
        info = pya.get_device_info_by_index(i)
        name = info.get("name")
        max_in = int(info.get("maxInputChannels", 0))
        max_out = int(info.get("maxOutputChannels", 0))
        default_sr = int(info.get("defaultSampleRate", 0))
        print(f"[{i:02d}] {name}  (in:{max_in} / out:{max_out} / {default_sr}Hz)")
    print("")


def main(argv: list[str] | None = None) -> int:
    import argparse

    parser = argparse.ArgumentParser(description="Gemini Live API (Native Audio) Simple Execution Script")
    parser.add_argument("--model", default=MODEL, help="Model name to use")
    parser.add_argument("--no-proactivity", action="store_true", help="Disable proactive audio")
    parser.add_argument("--list-devices", action="store_true", help="List available audio devices and exit")
    parser.add_argument("--in-dev", type=int, help="Input device index (defaults to default)")
    parser.add_argument("--out-dev", type=int, help="Output device index (defaults to default)")
    # Audio selection options effective in semi-cascade (TTS)
    parser.add_argument("--tts-voice", help="TTS voice_name (only effective for semi-cascade models) e.g. ja-JP-Standard-A")
    parser.add_argument("--tts-rate", type=float, help="TTS speaking_rate (e.g. 0.9~1.1)")
    parser.add_argument("--tts-pitch", type=float, help="TTS pitch (in semitones. e.g. -2.0 ~ +2.0)")
    args = parser.parse_args(argv)

    if args.list_devices:
        _list_devices()
        return 0

    cfg = dict(BASE_CONFIG)
    if args.no_proactivity:
        cfg.pop("proactivity", None)

    # Specify TTS audio in semi-cascade (e.g. gemini-live-2.5-flash-preview, gemini-2.0-flash-live-001)
    model_lower = (args.model or "").lower()
    is_half_cascade = ("live-" in model_lower) or model_lower.endswith("-live-001")
    if is_half_cascade and (args.tts_voice or args.tts_rate is not None or args.tts_pitch is not None):
        speech_cfg = {}
        if args.tts_voice:
            speech_cfg["voice_name"] = args.tts_voice
        if args.tts_rate is not None:
            speech_cfg["speaking_rate"] = args.tts_rate
        if args.tts_pitch is not None:
            speech_cfg["pitch"] = args.tts_pitch
        if speech_cfg:
            cfg["speech_config"] = speech_cfg
    elif not is_half_cascade and (args.tts_voice or args.tts_rate is not None or args.tts_pitch is not None):
        print(
            "[Note] The specified --tts-* may be ignored in native audio models.\n"
            "        If you want to select the type of voice, specify a semi-cascade model (e.g. gemini-live-2.5-flash-preview) with --model.",
            file=sys.stderr,
        )

    loop = AudioLoop()
    asyncio.run(loop.run(args.model, cfg, in_dev=args.in_dev, out_dev=args.out_dev))
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Recommended Python 3.12 environment brew install portaudio python3 -m pip install --upgrade pip setuptools wheel

  • Create new in 3.12: /opt/homebrew/bin/python3.12 -m venv .venv

  • Activate: source .venv/bin/activate

  • Verify: python -V (displays 3.12.x)

  • Update tools: python -m pip install -U pip setuptools wheel

  • Install dependencies: python -m pip install -U google-genai pyaudio

  • Gemini API key is required

    • export GOOGLE_API_KEY=…
  • Verify: python -c "from google import genai; import pyaudio; print('OK')"

  • Execute: python Get_started_LiveAPI_NativeAudio.py --list-devices

python Get_started_LiveAPI_NativeAudio.py --list-devices === Audio Device List (index / name) === [00] AirPods Pro A3047 (in:1 / out:0 / 24000Hz) [01] AirPods Pro A3047 (in:0 / out:2 / 48000Hz) [02] MacBook Air microphone (in:1 / out:0 / 48000Hz) [03] MacBook Air speakers (in:0 / out:2 / 48000Hz) [04] ‎iP16PM microphone (in:1 / out:0 / 48000Hz)

python Get_started_LiveAPI_NativeAudio.py --in-dev 2 --out-dev 1

Additional tuning if the effect is insufficient

  • Increase pre-buffer: (delay increases, but stability increases) Increase PREBUFFER_MS to 200-300

  • Increase frame length: SEND_CHUNK_FRAMES = 640 (40ms), RECV_CHUNK_FRAMES = 960, etc.

  • Disable proactive audio (suppresses conversation interruption behavior) --no-proactivity