A developer's framework for picking an stt provider
Six axes that decide whether your product ships — accuracy, latency, language coverage, cost, API ergonomics, vocabulary tolerance — with the tolerance thresholds we use to route traffic.
Deepgram publishes Nova-3 at 5.26% batch WER and 6.84% streaming WER. The same model scores 18.3% WER on Artificial Analysis’s third-party benchmark. Same model. Same week. Same English. The gap is not a lie — it is a reminder that any single number you read about an STT provider was produced by someone who had a reason to produce it.
That is why you do not pick an STT provider from a leaderboard. You pick one by scoring the providers you can realistically deploy against the six axes that actually determine whether your product ships. This post is the framework for doing that, in priority order, with the tolerance thresholds we use internally at Speko to route traffic across providers.
Axis 1: Accuracy on audio that looks like yours
Not generic accuracy. Your audio. A model tuned on broadcast news rarely wins on telephony, and an engine claiming 95% on podcasts drops to 70% on call-center audio. You need a 30-minute eval set that matches your codec (8kHz µ-law vs 16kHz PCM), your noise floor, your accents, and your domain vocabulary.
Score the top three candidates on that set. If the spread between them is less than 2pp absolute WER, accuracy is not the deciding axis — move on. If it is 5pp or more, you have a winner and the other axes become tiebreakers.
Axis 2: Latency, but the right kind
Most developers optimize the wrong latency. For a voice agent, what matters is the time from end-of-speech to final-word-emitted, not the time from first-byte-sent to first-partial. AssemblyAI’s Universal-Streaming emits words at ~300ms P50 and ~1,012ms P99; Deepgram Nova-3 comes in at 516ms P50 and 1,907ms P99 on the same measurement. A 500ms P50 feels instant in a meeting-notes app and catastrophic in a barge-in voice agent.
Tolerance guidance: voice agents need sub-400ms P50 and sub-1,200ms P99. Live captioning tolerates 800ms P50. Batch transcription does not care — stop measuring it.
Axis 3: Language coverage that is real, not claimed
“Supports 100+ languages” is a marketing sentence. What you need to know is: which of your target languages are first-class (trained on, evaluated on, supported in streaming) versus listed (Whisper-finetune behind a flag, English-only streaming, no diarization). Alibaba’s qwen3-asr-flash beats every Western provider on six Asian languages in our testing. Google leads on obscure African languages. Speechmatics has the best European-accent coverage. None of them are interchangeable.
Axis 4: Cost at your scale
Per-minute pricing looks simple until you hit volume. OpenAI’s gpt-4o-transcribe is $0.006/min. Deepgram Nova-3 streaming is $0.0077/min, batch $0.0043/min. Gemini 2.0 Flash Lite is $0.19 per 1,000 minutes — an order of magnitude cheaper.
But the headline rate is rarely the real rate. Amazon Transcribe bills a 60-second minimum per request. Azure rounds up to the hour. OpenAI’s Realtime API is a separate product at roughly $0.06/min audio in, $0.24/min audio out. At 100k hours/month, a $0.002/min delta is $12k/month — enough to fund a second provider as a fallback.
Axis 5: API ergonomics and streaming stability
This is the axis every engineer under-weights during procurement and over-weights at 3am on launch night. What to actually test: does the WebSocket reconnect cleanly when your client network blips? Does it surface end-of-utterance as an event or make you infer it from silence? Does it give you word-level timestamps, or approximate them from segments? Can you stream 16kHz linear16 without a re-encoding step?
Run a 24-hour soak test on a single connection with injected network jitter before you sign. The provider that looks prettiest in the docs is rarely the one whose client lib handles a dropped TLS frame without dropping the transcript.
Axis 6: Domain vocabulary tolerance
Generic STT will hallucinate “Clindamycin” as “Clyde Myosin” and “OAuth” as “oh off.” Deepgram’s keyterm prompting claims up to 90% keyword recall with up to 100 terms fed at inference. AssemblyAI offers word boost with numeric weights. Whisper ignores you completely unless you fine-tune.
If your product has a bounded vocabulary — drug names, legal terms, product SKUs, command words — this axis moves from nice-to-have to table stakes. Test with the hardest 50 terms in your domain. If recall is below 80%, disqualify the provider regardless of its WER.
The weighting matrix
There is no universal weight vector. The honest framework is: weight each axis by how much of your product’s value proposition depends on it.
| Use case | Accuracy | Latency | Languages | Cost | API | Vocab |
|---|---|---|---|---|---|---|
| Voice agent | 25% | 30% | 10% | 10% | 15% | 10% |
| Meeting notes | 35% | 5% | 15% | 15% | 10% | 20% |
| Live captioning | 20% | 25% | 25% | 10% | 15% | 5% |
| Call-center QA | 30% | 5% | 10% | 25% | 10% | 20% |
| Medical/legal batch | 30% | 0% | 5% | 15% | 10% | 40% |
These are starting points, not laws. The exercise of arguing about them forces you to name the thing your product actually sells.
What to do with this
Pick two providers. Run your own eval. Measure all six axes on your audio. Expect to be surprised — we regularly are. The provider that wins on paper loses on your telephony codec; the one you wrote off on cost has the streaming API that actually reconnects.
We publish per-language, per-axis benchmark data — real audio, reproducible methodology, no single “Speko score.” Use it as a starting set, then run your own. That is the entire point.