Real-Time Voice AI with Gemini Live API — WebSockets, Audio, and Tool Calling

April 8, 2026

English

The first time I tested interruption on the Gemini Live API — spoke mid-sentence while the model was responding — it stopped immediately and listened.

I made my teammate try it before I said anything. She spoke over the model. It stopped. She said: "Oh. That's different." That's the reaction you're trying to produce in your users. Not "this is impressive for AI" — just "oh, it works like talking to a person."

Every other voice AI implementation I'd built felt like a chatbot with a voice skin. Record → Whisper → LLM → TTS → play. Each step adds latency. By the time the user hears a response, 3–5 seconds have passed. The Gemini Live API is architecturally different: audio goes in, audio comes out, over a persistent WebSocket, with the model handling transcription, reasoning, and speech synthesis internally. The result: 320–800ms end-to-end — fast enough that interruption works, which changes the entire conversational dynamic.

Here's everything I learned building production voice interfaces with this API.

The One Thing That Breaks Everything If You Miss It

Send the setup message immediately after the WebSocket open event — before you send any audio. If audio arrives before setup, the connection state is undefined and you get silence or errors with no useful diagnostic. This is the most common first-day mistake.

Connection Model

This is not a request/response API. One WebSocket, open for the entire conversation, up to 10 minutes per session.

Loading diagram...

Audio specs — the numbers that matter:

Direction	Format	Sample Rate	Bit Depth	Channels
Input (you → Gemini)	PCM	16kHz	16-bit	Mono
Output (Gemini → you)	PCM	24kHz	16-bit	Mono
Chunk size	—	~100ms	—	1600 samples at 16kHz

Wrong format = silent failure. The API accepts the connection and drops the audio without an error. Match these exactly.

The Setup Message

Everything about the session — model, voice, tools, system prompt, VAD configuration — is set in the first message you send. There's no way to change it mid-session.

private sendSetup() {
  this.ws.send(JSON.stringify({
    setup: {
      model: 'models/gemini-2.0-flash-live-001',
      generationConfig: {
        responseModalities: ['AUDIO'],
        speechConfig: {
          voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Aoede' } },
        },
      },
      systemInstruction: { parts: [{ text: this.config.systemInstruction }] },
      tools: this.config.tools ?? [],
      realtimeInputConfig: {
        automaticActivityDetection: {
          disabled: false,
          startOfSpeechSensitivity: 'START_SENSITIVITY_HIGH',
          endOfSpeechSensitivity: 'END_SENSITIVITY_MEDIUM',
          silenceDurationMs: 800,
        },
      },
    },
  }));
}

Available voices: Puck, Charon, Kore, Fenrir, Aoede (English). Aoede is the most neutral for assistant use cases. Fenrir sounds more authoritative. Test them with your specific system prompt — the voice changes how the content lands.

Voice Activity Detection (VAD)

VAD is what makes the conversation feel natural — the model knows when you're done speaking without you pressing a button.

Loading diagram...

VAD sensitivity tradeoffs:

Parameter	Low	High
`startOfSpeechSensitivity`	Misses soft speech	May trigger on noise
`endOfSpeechSensitivity`	Cuts off mid-thought	Waits too long
`silenceDurationMs`	500ms — rapid Q&A	1200ms — users who pause

For noisy environments, add client-side RMS gating: only send audio chunks when amplitude is above a noise floor threshold. This prevents background noise from burning tokens on the Gemini side.

Echo cancellation is essential when audio plays through speakers. Without it, the model hears its own voice, tries to respond to itself, and you get a feedback loop. Set echoCancellation: true in getUserMedia. For speaker setups without headphones, you may need acoustic echo cancellation at the server level.

Interruption Handling

This is the feature that changes the entire feel of the product, and it deserves more than a paragraph.

When VAD detects user speech while audio is still playing, Gemini sends an interrupted signal. The model has already discarded its remaining response and switched to listening mode. Your client must match that immediately: stop queued audio, clear the buffer, signal ready.

The implementation is straightforward. The experience is not. What you're shipping is the feeling of being able to say "wait, actually—" to an AI and have it stop talking. Nobody who has used conventional voice AI has experienced this. The first time users encounter it, they usually test it on purpose — speak over the model, watch it stop, speak over it again. It feels like a feature worth discovering.

Get this right and users will show the product to other people. Get it wrong (queued audio plays out, model ignores the interruption) and the whole interaction feels broken.

Tool Calling in Voice

Tool calling pauses audio output while waiting for your function result. For slow tools, this creates audible dead silence.

Loading diagram...

For slow tools: respond immediately with { status: "pending", message: "Looking that up..." }. The model speaks this interim message while your tool runs. When the real result arrives, send a second function_response. The model picks up with the actual data. Without this pattern, slow tools produce 2–5 seconds of silence — in a voice conversation, that feels like the model crashed.

Latency Breakdown

Component	Typical
Network: browser → your server	10–30ms
Audio buffering (100ms chunks)	0–100ms
Network: server → Gemini	20–80ms
Gemini: time to first audio token	200–400ms
Network: Gemini → server → browser	20–80ms
Browser: audio decode + playback	10–20ms
Total TTFA	~320–800ms

TTFA = Time To First Audio. Gemini starts streaming audio before generation is complete — you hear the first words while the rest is still being generated.

Optimization levers:

Deploy your server in us-central1 (closest to Gemini endpoints) — saves 20–50ms
Forward audio chunks to the client immediately as they arrive — don't buffer
Pre-warm connections on server start — first connection after cold start is slower

Session Renewal

Sessions have a hard 10-minute limit.

Loading diagram...

Schedule renewal at 9 minutes (not 10) to allow overlap. Give the old session 5 seconds to finish any in-progress audio before closing. Handle unexpected closes with auto-restart — WebSocket connections drop. Your session management code will be tested on this more than you expect.

Resources

Gemini Live API reference
Gemini Live API quickstart
Available voices
Web Audio API spec
AudioWorklet API (MDN)
Multimodal Live API Web Console — Google's reference implementation

Building on the Gemini Live API is 80% WebSocket plumbing, 15% audio format debugging, and 5% AI configuration. That ratio surprises people expecting the AI to be the hard part. It's not. The hard parts are: managing WebSocket reconnects gracefully, handling the edge cases in VAD (background noise, self-interruption during fast tool calls, the feedback loop if you miss echo cancellation). Get the infrastructure right and the AI does something genuinely impressive. Skip the infrastructure details and you get a demo that works once under ideal conditions.

Posted ondevwith tags:

#ai-agents #typescript #nextjs