Designing barge-in that actually works
VAD is the load-bearing component. Most VADs are wrong for the job. A field guide to interruption that doesn't apologize, doesn't cough-trigger, and doesn't fall apart on a real phone call.
A user coughs and your agent stops mid-sentence. Worse: it apologizes, loses the thread, and asks them to repeat the question it was already answering. That is barge-in done wrong, and it is the single most common reason a “fast” voice agent feels broken in production.
Barge-in is the ability for a caller to interrupt the assistant while it is speaking, have the TTS stop immediately, and have the next user utterance handled with full conversational context. It sounds simple. It is not. You are asking a pipeline to decide, within ~100–300ms, whether an audio frame coming from the microphone is speech intended for the agent, a breath, a doorbell, the agent’s own voice echoing back, or the user’s seven-year-old shouting in the next room. Every one of those categories has to route differently, and getting any of them wrong damages the conversation.
VAD is the load-bearing component, and most VADs are wrong for this job
Voice Activity Detection is the gate. The classic open-source option, WebRTC VAD, is a Gaussian Mixture Model built for telephony — fast, tiny, and calibrated for silence detection rather than speech discrimination. At a 5% false-positive rate it catches roughly 50% of true speech frames; a learned DNN VAD like Silero gets to ~88% at the same false-positive rate with comparable CPU cost. For barge-in, that gap is the difference between “interrupts on every cough” and “actually listens.”
But raw VAD scores are not enough. You need hysteresis. A practical pattern: require ~80ms of continuous voiced frames above a speech-probability threshold (e.g., Silero score > 0.7) to declare user_speaking = true, and ~200–500ms of sub-threshold frames to declare user_silent. Those two debounce windows are the single most impactful tuning knob in the whole stack. Too short on the onset and a cough triggers a false interrupt; too long and the agent talks over the user for a full syllable before yielding. OpenAI’s Realtime server VAD exposes exactly these as prefix_padding_ms (default 300ms) and silence_duration_ms (default 500ms), with an interrupt_response: true flag that decides whether a VAD-start event cancels the in-flight TTS (OpenAI docs). Developers on the forum have been loud about the defaults being too aggressive — which is the right critique, because the defaults optimize for a quiet studio mic, not a real phone call.
The double-talk problem is why naive systems fail in the field
When the agent is speaking through a speaker and the user’s mic picks that audio back up, the VAD will happily flag the echo as user speech and interrupt the agent every few seconds. This is the double-talk problem, and the fix is an acoustic echo canceller with a double-talk detector: a linear adaptive filter subtracting the known reference (what you sent to the speaker) from the mic signal, followed by a neural residual suppressor, gated so the AEC does not adapt during simultaneous user+agent audio. Telephony SDKs (Twilio, LiveKit, Daily) handle this at the media layer. If you are rolling your own WebRTC path, you must enable browser AEC explicitly and feed the TTS back as the reference stream, or every barge-in will be a self-triggered hallucination. This also determines whether your system is truly full-duplex (mic and speaker active simultaneously with cross-cancellation) or half-duplex (mic muted while agent speaks — easier, but eliminates real barge-in entirely).
How the providers actually behave
The cascaded-stack platforms converge on similar designs with different defaults. Retell ships a proprietary turn-taking model tuned to distinguish pauses from turn ends, with configurable interruption sensitivity and an explicit interruption-recovery path (cancel TTS, flush buffer, process full utterance, respond with context intact). Vapi layers Krisp-based denoising with an adaptive filter that learns the acoustic floor over a 3-second rolling window and suppresses audio falling 15dB below the 85th-percentile speech level — effectively per-call auto-calibration for noisy rooms. Gemini Live does server-side VAD by default, emits an explicit interrupted signal the client must honor by stopping playback immediately, and requires 20–40ms audio chunks upstream or the interrupt signal never fires.
Kyutai’s Moshi rejects this entire architecture. Instead of bolting a VAD onto a half-duplex pipeline, Moshi models two parallel audio streams — user and agent — as a single joint autoregressive sequence, removing the concept of “turns” entirely. Overlaps, backchannels, and interruptions emerge natively from the token stream at ~200ms theoretical latency. This is the full-duplex endgame, and it is the reason end-to-end S2S models are a structurally different product from cascaded ones, not just a faster one.
You cannot eyeball barge-in in production
The evaluation problem is ugly. A “barge-in worked” log line does not tell you whether it fired on a cough, cut the user off mid-word, or missed a genuine interruption until the user repeated themselves. A serious eval needs a held-out set of labeled audio clips — confederate-interrupts, throat-clears, laughter, TV-in-background, partner-in-next-room — replayed through the live agent with four metrics tracked: time-to-yield (first voiced frame to TTS-stop, target <200ms), false-trigger rate per minute of agent speech, missed-interrupt rate, and context-preservation rate (did the agent pick up where it left off, or did it reset?). Without that fixture, every “improvement” is vibes.
The minimum viable stack
For a production agent where you own the plumbing: Silero VAD at 16kHz with 80ms onset hysteresis and 400ms offset hysteresis, AEC via the media transport with double-talk detection enabled (WebRTC’s built-in AEC3 if browser-side, LiveKit/Daily’s server-side AEC otherwise), adaptive noise suppression trained on the actual call distribution (Krisp or equivalent), an interrupt_response path that cancels TTS within one audio frame and preserves the unspoken tail in conversation state, and a labeled replay fixture you run on every deploy. Skip any of these and you will ship a demo, not a product.
We benchmark these behaviors across the major providers — latency, false-trigger rates, and recovery quality by use case. If you are shipping voice and getting burned by false interrupts, that is where to start.