Building voice ai for noisy real-world audio

A call-center agent’s headset mic picks up two colleagues arguing three desks over and a coffee grinder two meters behind her. Your STT returns eight words right out of twenty. The customer is already angry. You ship the fix on Monday or you lose the account. This article is the shape of that fix.

Studio-clean audio is a fiction that benchmarks keep alive. FLEURS and LibriSpeech are recorded in conditions no production user will ever reproduce — close-talking mics, quiet rooms, read speech, no cross-talk. When your provider tops a leaderboard at 5% WER on LibriSpeech test-clean and then chokes at 22% WER on a real call-center stream, nobody lied; you just evaluated on the wrong distribution. The question is never “which model is most accurate.” It is “which model degrades gracefully as SNR drops, and what do I put in front of it.”

Noise is not one thing

Treat noise as four separate problems and the stack starts to make sense.

Stationary additive noise is the easy case — HVAC hum, fan whine, constant traffic rumble. The spectral profile is steady, so classical spectral subtraction and even 1990s-era Wiener filtering handle it well. Non-stationary additive noise is the hard case: a door slam, a keyboard, a baby crying, a colleague laughing. The spectrum shifts frame to frame and competes with the speech envelope itself. Babble — the specific noise of other human voices — is the worst of all because it occupies the same formant space as the target speaker, so any suppressor that preserves speech also preserves the interference. Convolutional noise is reverberation: the room smears each phoneme across 200–800 ms of its own reflections, which destroys the temporal cues phonetic classifiers rely on. A glass-walled conference room at RT60 of 600 ms will wreck a model that was fine in an anechoic booth.

Your provider choice matters differently for each category. Whisper and its descendants were trained on 680,000 hours of wildly heterogeneous internet audio, and the original paper shows them degrading more gracefully than LibriSpeech-trained systems as SNR falls — NVIDIA’s NeMo models win at SNR > 15 dB and lose to Whisper below 10 dB. Alibaba’s qwen3-asr-flash, which wins six of eight languages on Speko’s internal noise-matrix benchmark, behaves similarly: strong in clean, but its real advantage shows up in the non-stationary noise conditions where weaker models collapse. Deepgram Nova-3, by contrast, was explicitly trained on telephony-grade audio and is unusually stable in the 5–10 dB SNR band that dominates call-center traffic.

SNR, in the units your ears don’t measure in

A senior engineer should carry these numbers in working memory. A quiet office is roughly 30 dB SNR — effectively clean. A typical home-office call sits at 20 dB — still comfortable for any modern ASR. A coffee shop is 10–15 dB. A busy open-plan office with cross-talk is 5–10 dB. A moving car cabin on a highway is –5 to 10 dB SNR, with transient events pushing it further. Once you drop below 10 dB, WER roughly doubles for every 10 dB lost — that is the single most important empirical rule in this entire article. If your production median is 8 dB and your benchmark was run at 25 dB, your published WER is off by a factor of three or more.

The suppression tradeoff

The instinct is to put a noise suppressor in front of every STT call. This is sometimes wrong.

Real-time suppressors like RNNoise run at ~10 ms algorithmic latency on a single CPU core and handle stationary noise cleanly. Krisp and NVIDIA Maxine use larger deep models and are more aggressive — they reduce perceived noise further but introduce what practitioners call “spectral holes”: brief deletions of the speech signal itself when the network misclassifies a voiced frame as noise. Modern ASR models hate spectral holes more than they hate the original noise, because noise is in their training distribution and surgical gaps are not. The asymmetric result: aggressive suppression can improve human listening experience while making WER worse.

The rule I ship with: suppress stationary noise aggressively, suppress non-stationary noise conservatively, and never suppress babble — route it to a model that was trained on babble instead. If you must use Krisp or Maxine in a production STT pipeline, A/B the suppressed and raw streams against ground truth on your actual audio, because the gain is not universal.

A minimum viable noise-handling stack

For a team shipping in the next quarter, this is the opinionated stack:

A 80 Hz high-pass filter to kill DC offset and HVAC rumble. A soft AGC with a long attack (300 ms+) to normalize levels without pumping. RNNoise or a comparable ~10 ms suppressor for stationary noise only, bypassed when estimated SNR exceeds 20 dB. A noise-aware ASR — Whisper-large-v3, qwen3-asr-flash, or Deepgram Nova-3 depending on language and latency tolerance — chosen on noise-matrix data, not on LibriSpeech. For multi-mic hardware, delay-and-sum beamforming before the suppressor, not after. And a confidence threshold that falls back to a second provider when the primary returns low-confidence tokens on a low-SNR segment.

What you do not need in v1: a custom-trained acoustic model, server-side GPU suppression, or a dereverberation network. You will want those in v2, and you will only know which one to build after you have measured WER across an SNR-stratified sample of your actual production audio.

Speko’s noise-matrix benchmark covers thousands of clips across multiple languages and six noise conditions — stationary, non-stationary, babble, reverberant, mixed, and clean control — precisely because choosing a provider on a single clean WER number is how production voice AI ships broken.