Engineering

Building Real-Time Multimodal Applications with Gemini Live API on Vertex AI

Allan Porras

29 Dec 2025 — 6 min read

For Chief Technology Officers and Lead Architects, the shift from "request-response" LLM interactions to stateful, bidirectional streaming represents the next frontier in AI engineering. The Gemini Multimodal Live API (powered by Gemini 2.0 Flash) enables low-latency, real-time voice and video interactions that feel genuinely conversational.

Unlike traditional pipelines that chain Speech-to-Text (STT), LLM inference, and Text-to-Speech (TTS)—accruing latency at every hop—Gemini Live handles modality bridging natively. This unifies the context, allowing the model to "see" a live video feed and "hear" interruptions instantly.

This article details the architectural patterns for implementing Gemini Live on Vertex AI, focusing on the WebSocket protocol, audio chunking strategies, and session management required for enterprise-grade applications.

On-Demand Shared Software Engineering Team, By Suscription.

Access a flexible, shared software product engineering team on demand through a predictable monthly subscription. Expert developers, designers, QA engineers, and a free project manager help you build MVPs, scale products, and innovate with modern technologies like React, Node.js, and more.

Try 4Geeks Teams

The Architecture: Bidirectional Streaming over WebSockets

The core of the Live API is the BidiGenerateContent method. Unlike standard REST endpoints, this establishes a persistent WebSocket connection. This stateful channel allows:

Real-time Input: The client streams audio (PCM) and video (JPEG frames) continuously.
Server Events: The server pushes audio chunks (response), text transcripts, and tool calls asynchronously.
Barge-in: If the user speaks while the model is outputting audio, the server detects this (Voice Activity Detection) and sends an interrupted signal, allowing the client to halt playback immediately.

Protocol Flow:

Handshake: Authenticate via OAuth 2.0.
Setup: Send initial configuration (Model version, System Instructions, Voice settings).
Session Loop: Asynchronously send realtime_input (media chunks) and receive server_content.

Prerequisites

To implement this, you need a Google Cloud Project with the Vertex AI API enabled.

Model: gemini-2.0-flash-exp (or the latest enterprise equivalent).
Region: us-central1 (Live API availability is often region-bound during preview).
IAM: Ensure your service account has Vertex AI User roles.

Technical Implementation: Python Async Client

While you can manage raw WebSockets, the google-genai SDK abstracts the framing complexity while exposing the necessary control.

Below is a production-ready pattern for a console-based voice agent. This implementation handles the async nature of sending microphone input while simultaneously processing audio output from the model.

Dependencies:

pip install google-genai pyaudio

The Core Logic:

import asyncio
import pyaudio
from google import genai

# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
MODEL_ID = "gemini-2.0-flash-exp"

# Audio Settings (Gemini expects 16kHz, 1 channel, 16-bit PCM)
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK_SIZE = 1024

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

async def audio_stream_generator(input_stream):
    """Yields audio chunks from the microphone."""
    while True:
        data = await asyncio.to_thread(input_stream.read, CHUNK_SIZE, exception_on_overflow=False)
        yield data

async def run_live_session():
    # Initialize PyAudio
    p = pyaudio.PyAudio()
    mic_stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK_SIZE)
    speaker_stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, output=True)

    print("--- Connecting to Gemini Live ---")
    
    config = {
        "response_modalities": ["AUDIO"],
        "system_instruction": "You are a senior technical assistant. Be concise and precise."
    }

    try:
        async with client.aio.live.connect(model=MODEL_ID, config=config) as session:
            print("--- Session Active. Speak now. ---")

            # Task 1: Send Audio
            async def send_audio():
                async for chunk in audio_stream_generator(mic_stream):
                    await session.send(input=chunk, end_of_turn=False)

            # Task 2: Receive Audio
            async def receive_audio():
                async for response in session.receive():
                    # Handle Text/Audio parts
                    if response.server_content:
                        model_turn = response.server_content.model_turn
                        if model_turn:
                            for part in model_turn.parts:
                                if part.inline_data: # Audio data
                                    await asyncio.to_thread(speaker_stream.write, part.inline_data.data)
                                    
                    # Handle Interruption (Barge-in)
                    if response.server_content and response.server_content.interrupted:
                         print("\n[Interrupted] Stopping playback...")
                         # In a real GUI app, you would clear the audio buffer here.

            # Run both tasks concurrently
            await asyncio.gather(send_audio(), receive_audio())

    except Exception as e:
        print(f"Session Error: {e}")
    finally:
        mic_stream.stop_stream()
        mic_stream.close()
        speaker_stream.stop_stream()
        speaker_stream.close()
        p.terminate()

if __name__ == "__main__":
    asyncio.run(run_live_session())

Key Architectural Decisions in Code

Concurrency (asyncio.gather): The input and output streams must be handled independently. Blocking on microphone input will prevent the client from processing the model's audio response, destroying the real-time effect.
Modality Configuration: We explicitly set response_modalities=["AUDIO"]. This tells Gemini to generate raw PCM audio directly rather than text that the client must then synthesize.
Barge-in Handling: The interrupted flag in server_content is critical. When the model detects user speech during its own output, it stops generating. Your client must listen for this flag to purge its local audio buffer immediately, or the user will hear "ghost" audio finishing a sentence after they interrupted.

On-Demand Shared Software Engineering Team, By Suscription.

Try 4Geeks Teams

Optimizing for Enterprise Contexts

When scaling ai engineering services for enterprises using this API, consider these three factors:

1. Tool Calling (Function Invocation)

Gemini Live supports real-time tool calling. You can define tools (e.g., check_inventory, query_crm) in the setup configuration.

Flow: User speaks command -> Model pauses audio -> Sends tool_call event -> Client executes code -> Client sends tool_response -> Model resumes audio with answer.
Latency Tip: Ensure your backend tool executions are highly optimized (e.g., Redis lookups vs. slow SQL queries) to maintain the conversational illusion.

2. Network Stability & Reconnection

WebSockets are fragile in mobile environments. Implementing a "Session Resumption" strategy is vital. Although the current API does not support resuming a session context exactly where it dropped (state is ephemeral), your client should cache the conversation history (turns) and re-inject them as client_contentcontext upon reconnection.

3. Video Integration

For field service or technical support apps, you can send video frames.

Rate Limit: Do not send 60fps video. Sending 1-2 frames per second (FPS) is usually sufficient for the model to understand visual context (e.g., identifying a broken cable) without blowing up token costs and bandwidth.
Format: Convert frames to base64 encoded JPEG chunks before sending via realtime_input.

Conclusion

The Gemini Live API moves us away from brittle, high-latency chains of isolated models toward unified, multimodal reasoning engines. For engineering teams, the challenge shifts from managing pipeline latency to managing WebSocket state and concurrency.

If your organization is looking to build custom agents that leverage these real-time capabilities for complex workflows, partnering with an expert in ai engineering services for enterprises can accelerate your path to production.

4Geeks offers specialized expertise in product, growth, and AI engineering, helping global teams deploy robust, scalable AI solutions.

On-Demand Shared Software Engineering Team, By Suscription.

Try 4Geeks Teams

FAQs

How does the Gemini Live API improve performance compared to traditional AI pipelines?

The Gemini Live API significantly reduces latency by replacing the traditional "request-response" chain—which typically links Speech-to-Text (STT), LLM inference, and Text-to-Speech (TTS)—with a single, stateful bidirectional streaming connection. By bridging modalities natively, the model can process live audio and video inputs simultaneously without the delays caused by passing data between separate models.

What is "barge-in" and how does it work in real-time voice interactions?

"Barge-in" is a feature that allows a user to interrupt the model while it is speaking, creating a natural conversational flow. The system uses Voice Activity Detection to identify when the user starts speaking and immediately sends an interrupted signal from the server. This triggers the client to halt audio playback and clear the audio buffer instantly to prevent "ghost" audio from playing.

What are the best practices for streaming video to the Gemini Live API?

To optimize for bandwidth and token costs without sacrificing performance, it is recommended to limit video input to 1-2 frames per second (FPS). High frame rates (like 60fps) are generally unnecessary for the model to understand visual context. Additionally, video frames should be converted to base64-encoded JPEG chunks before being sent through the real-time input stream.

Building Real-Time Multimodal Applications with Gemini Live API on Vertex AI

Allan Porras

On-Demand Shared Software Engineering Team, By Suscription.

The Architecture: Bidirectional Streaming over WebSockets

Prerequisites

Technical Implementation: Python Async Client

Key Architectural Decisions in Code

On-Demand Shared Software Engineering Team, By Suscription.

Optimizing for Enterprise Contexts

1. Tool Calling (Function Invocation)

2. Network Stability & Reconnection

3. Video Integration

Conclusion

On-Demand Shared Software Engineering Team, By Suscription.

FAQs

How does the Gemini Live API improve performance compared to traditional AI pipelines?

What is "barge-in" and how does it work in real-time voice interactions?

What are the best practices for streaming video to the Gemini Live API?

Read more

Robotics and Spatial Reasoning Use Cases with Gemini Robotics-ER

Achieve Flawless Product Quality with Custom Computer Vision from 4Geeks

Scaling Without Dying in the Attempt: The Rockefeller Method Meets Growth Engineering

The Strategic Convergence: Why Buyer Personas and Technical Execution Must Align