Eight Seconds of Silence
AllysAI is a voice-based AI agent we built for customer support. The pitch was compelling: call in, describe your problem in natural language, get an accurate answer without navigating phone trees. The demo killed. Investors loved it. Early testers said the answers were "surprisingly good."
Then we launched the beta and watched the analytics in horror. Average call duration: 11 seconds. Not because users got their answers fast — because they hung up before getting any answer at all. The first response was taking 8.2 seconds.
Eight seconds doesn't sound catastrophic on paper. It's an eternity on a phone call. Try it right now: ask someone a question, then stare at them silently for eight seconds. You'll feel the awkwardness at three. By five, you'll assume they didn't hear you. By eight, you've already moved on.
The PM's response was predictable: "But the answers are better with GPT-4! Can't we just tell users to wait?" No. You cannot. That's not how humans work.
A 2-second mediocre answer beats a 12-second perfect answer every time. Users don't grade your AI on accuracy. They grade it on whether it felt like it was working.
Anatomy of 8.2 Seconds
Before you can fix latency, you have to know where it goes. I instrumented every stage of the AllysAI pipeline. Here's where those 8.2 seconds were hiding:
| Stage | What It Does | Time |
|---|---|---|
| Speech-to-Text (STT) | Whisper API — transcribe user audio | ~400ms |
| Intent Classification | Route to correct knowledge domain | ~120ms |
| RAG Retrieval | Vector search + reranking | ~150ms |
| LLM Generation | GPT-4 with retrieved context | ~6,200ms |
| Text-to-Speech (TTS) | Synthesize full response audio | ~400ms |
| Network / Overhead | API calls, serialization, buffering | ~930ms |
| Total | ~8,200ms |
The villain is obvious: LLM generation at 6.2 seconds. But here's the trap — if you only focus on the biggest number, you'll miss the systemic fix. Every stage in this pipeline runs sequentially. The full response is generated before TTS even starts. We weren't just slow; we were architecturally incapable of being fast.
The Latency Budget: Think Like a CFO
I stole this concept from game development, where rendering engineers allocate a per-frame budget in milliseconds — 4ms for physics, 6ms for lighting, 2ms for UI, and so on. The total must hit 16.6ms for 60fps. No exceptions. No "but the shadows look better."
Same principle for AI products. Start with what the user will tolerate, then work backwards.
For a voice agent, research and our own testing pointed to 2 seconds as the threshold. Beyond 2 seconds, users start to feel the silence. Beyond 4, they assume something broke. We set our target at perceived latency under 2 seconds — meaning the user hears something within 2 seconds, even if the full response takes longer.
Here's the budget I drew up:
| Stage | Budget | Strategy |
|---|---|---|
| STT | 300ms | Stream audio, don't wait for silence detection |
| Intent + Retrieval | 200ms | Run in parallel, not sequential |
| LLM First Token | 800ms | Switch to GPT-4o-mini, stream output |
| TTS First Chunk | 300ms | Stream — start speaking the first sentence immediately |
| Overhead | 200ms | Connection pooling, edge deployment |
| Perceived Total | ~1,800ms |
That's a 77% reduction in perceived latency. Not by making any single component dramatically faster, but by restructuring how the components interact.
Streaming Changes Everything
The single biggest win was converting the pipeline from batch to streaming. In the old architecture, each stage waited for the previous stage to fully complete. In the new architecture, we stream at every boundary:
- STT streams partial transcriptions. We start retrieval the moment we have enough tokens for intent classification — usually after the first 3-4 words.
- The LLM streams tokens. We buffer the first sentence (usually 12-20 tokens), then immediately send it to TTS.
- TTS streams audio chunks. The user hears the first sentence while the LLM is still generating sentence two.
The result: the user hears a response start in under 2 seconds. The full response might take 5-6 seconds, but it arrives as natural speech — pausing between sentences feels conversational, not broken.
# Simplified streaming pipeline (pseudocode)
async def handle_voice_query(audio_stream):
# Stage 1: Stream STT, start retrieval early
transcript_stream = stt.stream_transcribe(audio_stream)
partial_text = ""
async for chunk in transcript_stream:
partial_text += chunk.text
if len(partial_text.split()) >= 4 and not retrieval_started:
# Fire retrieval on partial transcript
context_task = asyncio.create_task(
retrieve_context(partial_text)
)
retrieval_started = True
context = await context_task
# Stage 2: Stream LLM, buffer first sentence
token_buffer = []
async for token in llm.stream(partial_text, context=context):
token_buffer.append(token)
sentence = "".join(token_buffer)
# Flush to TTS at sentence boundaries
if sentence.rstrip().endswith((".", "!", "?")):
await tts.stream_speak(sentence)
token_buffer = []
# Flush remaining tokens
if token_buffer:
await tts.stream_speak("".join(token_buffer))
The Model Downgrade Nobody Wanted
Switching from GPT-4 to GPT-4o-mini was the most contentious decision. The PM hated it. The demo answers were noticeably less polished. But here's what the data showed after two weeks:
- User satisfaction (post-call survey): Went UP from 3.1 to 4.2 out of 5
- Call completion rate: Went from 62% to 89%
- Answer accuracy (human eval): Dropped from 91% to 84%
Read that again. Accuracy went down and satisfaction went up. Because users were actually sticking around to hear the answers. A 91% accurate answer that nobody hears has exactly 0% practical accuracy.
This is the core mistake I see PMs make with AI products: optimizing for capability in isolation instead of capability as experienced by the user. Your benchmark scores don't matter if the user closed the tab.
SIMO's Tighter Constraints
If AllysAI's 2-second budget felt tight, SIMO Avatar made it look generous. SIMO is a 3D avatar system — think digital human you can have a face-to-face conversation with. The avatar needs to start responding with lip movements, facial expressions, and voice simultaneously. Any delay and the uncanny valley gets worse, not better.
Our latency budget for SIMO was 800ms to first visible reaction. Not first word — first reaction. The avatar needed to show it was "thinking" (a subtle head tilt, eye movement) within 400ms, then start speaking within 800ms.
We solved this by pre-generating filler animations and transition states. The avatar starts a "listening acknowledgment" animation the instant the user stops speaking, buying us time while the LLM processes. It's a UX trick, not an engineering optimization, and it works beautifully. Perceived latency dropped to under a second even though actual generation time didn't change.
The lesson: latency optimization isn't just about making things faster. It's about making things feel faster. Animation, progressive loading, streaming, filler responses — these are legitimate engineering tools, not hacks.
The Framework
After going through this exercise on AllysAI, SIMO, and Sarathi's real-time transit routing, I settled on a repeatable process:
- Define perceived latency target. What's the user's tolerance? Voice: 2s. Chat: 3-5s. Background processing: minutes. Start here, not with your architecture.
- Instrument everything. You can't budget what you can't measure. Add timing to every stage, every API call, every serialization step. You'll be surprised where time hides.
- Draw the budget. Allocate milliseconds top-down from your target. Every stage gets a cap. If a stage can't hit its cap, you need a different approach for that stage — not a bigger overall budget.
- Identify the critical path. What's sequential that could be parallel? What's batch that could stream? The biggest wins are almost always architectural, not algorithmic.
- Separate perceived from actual. Streaming, progressive UI, filler states, and optimistic updates all reduce perceived latency without touching actual processing time.
Stop Arguing About Models
I've sat through too many meetings where the debate is "GPT-4 vs Claude vs Gemini" when the actual problem is a pipeline that serializes seven API calls. The model is one row in your latency budget. If you haven't drawn the budget, you're optimizing in the dark.
Draw the table. Instrument the pipeline. Set the target from the user's perspective, not yours. Then — and only then — decide if you need a faster model, a streaming architecture, or just a loading animation that doesn't make people feel abandoned.
Latency is a product decision, not an engineering constraint. Treat it like a budget. Spend it wisely.