Engineering

Architecting Real-Time Multimodal Agents with Gemini and WebSockets

Allan Porras

10 Jan 2026 — 6 min read

The era of "text-in, text-out" chatbots is rapidly fading. Modern enterprise applications demand "Live" agents—intelligent systems capable of perceiving and responding to audio, video, and text in real-time. For a CTO or Senior Software Engineer, the challenge isn't just prompting an LLM; it is architecting a low-latency, stateful pipeline that handles multimodal streams effectively.

In this article, we will dismantle the architecture required to build a real-time multimodal conversational agent using Google’s Gemini 1.5 Pro/Flash models and Python. We will focus specifically on the Bidirectional (Bidi) Streaming capabilities via WebSockets, which allow for interruptible, human-like voice interactions.

As a partner in ai engineering services for enterprises, 4Geeks frequently assists organizations in migrating from static request-response models to these dynamic, session-based architectures.

On-Demand Shared Software Engineering Team, By Suscription.

Access a flexible, shared software product engineering team on demand through a predictable monthly subscription. Expert developers, designers, QA engineers, and a free project manager help you build MVPs, scale products, and innovate with modern technologies like React, Node.js, and more.

Try 4Geeks Teams

The Architectural Shift: From REST to WebSockets

Traditional LLM integration relies on stateless HTTP requests. However, true multimodal conversations (like voice assistants or video analysis agents) require a persistent connection to handle continuous data streams.

The architecture generally follows this pattern:

Client Layer: Captures audio (PCM) or video frames and streams them over a WebSocket.
Orchestration Layer (Backend): A Python service (e.g., FastAPI) that validates the session, manages state, and proxies the stream to the Gemini API.
Model Layer: Gemini 1.5 Flash (optimized for low latency) processing the multimodal stream and generating tokens on the fly.

Technical Implementation

We will use the modern google-genai SDK (the V2 Python SDK) and FastAPI to build a secure backend proxy. Direct client-to-API connections are possible but discouraged for enterprise use cases due to security risks (exposing API keys) and lack of control over the session context.

1. Prerequisites and Setup

Ensure you have the latest SDK installed. In a production environment, you would manage dependencies via poetry or requirements.txt.

pip install -q -U google-genai fastapi uvicorn websockets

2. Initializing the Gemini Client

We initialize the client with the Gemini 1.5 Flash model. Flash is preferred for conversational agents due to its significantly lower Time to First Token (TTFT) compared to Pro.

import os
from google import genai
from google.genai import types

# Initialize the client
# Ensure GEMINI_API_KEY is set in your environment variables
client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))

# Configuration for the session
config = types.GenerateContentConfig(
    temperature=0.7,
    max_output_tokens=2048,
    response_modalities=["TEXT"] # or ["AUDIO"] for voice-to-voice
)

3. Building the WebSocket Proxy

This is the core of the engineering challenge. We must create a WebSocket endpoint in FastAPI that accepts client audio streams, forwards them to Gemini, and streams the response back to the client.

On-Demand Shared Software Engineering Team, By Suscription.

Try 4Geeks Teams

We utilize Python's asyncio to handle the bidirectional flow without blocking.

import asyncio
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from google.genai import types

app = FastAPI()

@app.websocket("/ws/chat")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    
    # Establish a live session with Gemini
    async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
        try:
            # Create tasks to handle sending and receiving concurrently
            receive_task = asyncio.create_task(handle_client_input(websocket, session))
            send_task = asyncio.create_task(handle_model_output(websocket, session))
            
            # Wait for either to finish (or error)
            done, pending = await asyncio.wait(
                [receive_task, send_task],
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            for task in pending:
                task.cancel()
                
        except Exception as e:
            print(f"Session error: {e}")
        finally:
            await websocket.close()

async def handle_client_input(websocket: WebSocket, session):
    """Receives audio/text from client and pushes to Gemini."""
    try:
        while True:
            # Expecting JSON with base64 encoded chunks or text
            data = await websocket.receive_json()
            
            if "text" in data:
                await session.send(data["text"], end_of_turn=True)
            elif "audio_chunk" in data:
                # data['audio_chunk'] should be base64 encoded PCM data
                # Gemini expects raw bytes or specific blob types
                await session.send(
                     {"mime_type": "audio/pcm", "data": data["audio_chunk"]}, 
                     end_of_turn=False
                )
    except WebSocketDisconnect:
        print("Client disconnected")

async def handle_model_output(websocket: WebSocket, session):
    """Receives stream from Gemini and pushes to client."""
    async for response in session.receive():
        # Gemini sends text chunks or audio bytes depending on config
        if response.text:
            await websocket.send_json({"text": response.text})
        elif response.data:
             # Handle audio bytes output if voice-to-voice is enabled
            await websocket.send_bytes(response.data)

4. Handling Audio Formats and Latency

One common engineering pitfall in multimodal integration is audio formatting. Gemini Live API generally expects:

Format: Linear PCM (16-bit, little-endian)
Sample Rate: 16kHz or 24kHz (consistency is key)
Chunk Size: Send audio in small chunks (e.g., 20ms - 40ms duration) to minimize buffer latency. Sending 1-second chunks will result in noticeable lag for the user.

If your client application (React Native, Flutter, or Swift) captures audio in a different format (like AAC or Opus), you must transcode it to PCM before sending it to the backend or use a library like ffmpeg on the server side, though the latter introduces latency.

Advanced Pattern: Tool Use and Function Calling

To make the agent truly useful for "enterprise workflows," it needs to interact with your internal APIs (e.g., checking inventory, booking appointments). Gemini supports dynamic function calling within the streaming session.

You define tools in the initial configuration:

def get_inventory_status(item_id: str):
    # Mock database lookupPython
    return {"item_id": item_id, "status": "in_stock", "quantity": 150}

tools = [get_inventory_status]

# Update config to include tools
config = types.GenerateContentConfig(
    tools=tools,
    response_modalities=["TEXT"]
)

When the model decides to call a function, it pauses generation and sends a function_call signal. Your backend must execute the Python function and return the function_response to the session context, allowing the model to generate the final natural language response.

Performance Considerations for CTOs

When deploying these [ai engineering services for enterprises], consider the following metrics:

Time to First Token (TTFT): For voice interactions, TTFT needs to be under 500ms to feel natural. Use Gemini Flash variants and ensure your WebSocket infrastructure (e.g., AWS API Gateway + Lambda or Kubernetes) is optimized for persistent connections.
Context Window Management: While Gemini 1.5 supports up to 1-2 million tokens, keeping the context full of raw audio data increases cost and latency. Implement a sliding window strategy or summarize older turns in the conversation history.
Safety Guardrails: Real-time agents can hallucinate or produce unsafe content. Always configure safety_settings in the GenerateContentConfig to BLOCK_LOW_AND_ABOVE for enterprise-facing applications.

Conclusion

Integrating Gemini for real-time multimodal conversations requires a shift from stateless REST paradigms to event-driven, streaming architectures. By leveraging WebSockets and the low-latency capabilities of Gemini 1.5 Flash, engineering teams can build agents that don't just "read" text, but "listen" and "watch" alongside the user.

At 4Geeks, we specialize in building these high-performance architectures. Whether you need to deploy custom AI agents or scale your cloud infrastructure to handle real-time streaming, we are your partner in engineering excellence.

On-Demand Shared Software Engineering Team, By Suscription.

Try 4Geeks Teams

FAQs

Why are WebSockets preferred over REST for building real-time multimodal agents?

Why are WebSockets preferred over REST for building real-time multimodal agents? Traditional REST architectures rely on stateless HTTP requests, which are inefficient for continuous, live interactions. In contrast, WebSockets establish a persistent, bidirectional connection that allows for the streaming of audio and video data in real-time. This architecture supports "live" sessions where the agent can perceive and respond to streams immediately, enabling interruptible, human-like voice interactions that static request-response models cannot achieve. For organizations looking to implement these dynamic architectures, 4Geeks AI Engineering services provide the expertise to migrate from legacy models to stateful, session-based systems.

How can developers minimize latency in voice-enabled Gemini applications?

To ensure natural voice interactions, keeping the Time to First Token (TTFT) under 500ms is critical. Developers should utilize the Gemini 1.5 Flash model, which is optimized for speed, rather than larger models like Pro. Additionally, audio should be formatted as Linear PCM (16-bit, 16kHz or 24kHz) and transmitted in small chunks (e.g., 20ms to 40ms) via the WebSocket pipeline. Sending large audio buffers introduces noticeable lag, disrupting the user experience in enterprise applications.

How does Gemini's function calling capability enhance enterprise AI workflows?

Function calling transforms a conversational agent from a passive chatbot into an active tool capable of executing business logic. By defining tools—such as Python functions for checking inventory or booking appointments—within the session configuration, the model can pause generation to request specific data processing. The backend orchestration layer then executes these functions and returns the results to the model, allowing the AI to generate accurate, data-driven responses. This feature is essential for integrating 4Geeks AI Engineering solutions into complex internal APIs and operational workflows.

Architecting Real-Time Multimodal Agents with Gemini and WebSockets

Allan Porras

On-Demand Shared Software Engineering Team, By Suscription.

The Architectural Shift: From REST to WebSockets

Technical Implementation

1. Prerequisites and Setup

2. Initializing the Gemini Client

3. Building the WebSocket Proxy

On-Demand Shared Software Engineering Team, By Suscription.

4. Handling Audio Formats and Latency

Advanced Pattern: Tool Use and Function Calling

Performance Considerations for CTOs

Conclusion

On-Demand Shared Software Engineering Team, By Suscription.

FAQs

Why are WebSockets preferred over REST for building real-time multimodal agents?

How can developers minimize latency in voice-enabled Gemini applications?

How does Gemini's function calling capability enhance enterprise AI workflows?

Read more

Robotics and Spatial Reasoning Use Cases with Gemini Robotics-ER

Achieve Flawless Product Quality with Custom Computer Vision from 4Geeks

Scaling Without Dying in the Attempt: The Rockefeller Method Meets Growth Engineering

The Strategic Convergence: Why Buyer Personas and Technical Execution Must Align