The demo that lied to me
Klavy's AI agent was my proudest piece of engineering. Fourteen nodes in a LangGraph state machine, handling everything from French rental law Q&A over 3,000 Legifrance law chunks to document OCR with risk scoring. Intent classification fed into specialized subgraphs. The routing was clean. The tests were green. The demo made investors nod.
Then a real user typed: "Can you check my lease document and also tell me if the deposit amount is legal?"
Two intents. One message. The classifier node pinged the request to "legal Q&A." The legal node saw an attached document and bounced it to "document processing." Document processing extracted the deposit clause and routed it back to "legal Q&A" for validation. Legal Q&A saw the document reference and sent it back to document processing.
Infinite loop. The agent burned through tokens like a furnace. The user saw a spinning indicator for 45 seconds before the request timed out. And this wasn't an edge case — it was the third most common query pattern in our analytics.
I spent the next two weeks rearchitecting the entire agent graph. What I learned changed how I think about multi-agent systems entirely.
Lesson 1: Every agent needs a kill switch
This sounds obvious. It isn't. When you're designing agent flows, you're thinking about the happy path. You're thinking about how Node A elegantly hands off to Node B. You are not thinking about what happens when Node B hands back to Node A, which hands back to Node B, forever.
The fix is dead simple and should be the first thing you implement, before any business logic:
class AgentState(TypedDict):
messages: List[BaseMessage]
intent: str
iteration_count: int # This saved my sanity
max_iterations: int
fallback_triggered: bool
def route_with_guardrail(state: AgentState) -> str:
if state["iteration_count"] >= state["max_iterations"]:
return "fallback_handler"
state["iteration_count"] += 1
return classify_intent(state)
I set max_iterations to 5 for Klavy. In practice, legitimate requests never need more than 3 hops. If you're hitting 5, something is wrong, and it's better to gracefully degrade than to let the agent thrash.
The fallback handler doesn't just say "sorry, I can't help." It captures the full state — the original query, the classification attempts, the nodes visited — and passes that to a simpler, single-purpose LLM call that at least gives the user a coherent response. Not perfect. But infinitely better than a timeout.
Lesson 2: Shared state is the enemy
Klavy v1 had a single global state dict that every node could read and write. The intent classifier would set state["current_intent"], the legal node would set state["legal_context"], the document node would set state["doc_results"]. Clean, right?
Wrong. The problem is that when you have 14 nodes all mutating a shared state object, you lose the ability to reason about what happened. When the document node sets state["needs_legal_review"] = True, and the legal node sees that flag and re-routes, who set it? When? Based on what data? Your observability is gone. Your debugging is a nightmare.
I ripped out the shared state and replaced it with explicit message passing between nodes:
# Instead of mutating shared state...
state["legal_context"] = result # BAD: who reads this? when?
# ...pass explicit messages between nodes
return {
"messages": [AIMessage(
content=result,
additional_kwargs={
"source_node": "legal_qa",
"confidence": 0.87,
"requires_followup": False
}
)],
"next_node": "response_synthesizer"
}
Every node now receives only what it needs and declares where it's sending the result. The graph becomes auditable. You can trace exactly which node produced which piece of data, and why the routing decision was made.
Is it more verbose? Yes. Do I care? Not even a little. Debuggability at 3 AM when production is broken is worth more than elegant code.
Lesson 3: Fallback hierarchies matter more than primary paths
Here's a stat that surprised me: in Klavy's production traffic, 23% of user queries don't cleanly map to a single intent. Almost a quarter. These are the queries that break your carefully designed routing. "Check my documents and calculate if my rent is too high and also what are my rights if the landlord refuses the deposit return" — that's three intents jammed into one message.
My original design spent 90% of the effort on the primary routing path and 10% on error handling. That ratio should be inverted. Here's the fallback hierarchy I eventually built:
- Primary classification — LLM-based intent routing with confidence scores
- Multi-intent decomposition — if confidence is below 0.75, split the query into sub-queries and process each independently
- Keyword fallback — regex-based intent matching as a safety net (fast, no API calls)
- General assistant — single direct LLM call with full context, no routing at all
- Human handoff — flag for support team with full conversation state
Level 2 was the game-changer. When the classifier isn't confident, instead of picking the "best guess" and hoping, decompose the query:
async def decompose_multi_intent(query: str) -> List[SubQuery]:
"""Split ambiguous queries into discrete sub-queries."""
decomposition = await llm.ainvoke(
f"Split this into independent questions: {query}"
)
sub_queries = parse_sub_queries(decomposition)
# Process each sub-query through the graph independently
results = await asyncio.gather(*[
process_single_intent(sq) for sq in sub_queries
])
# Synthesize into one coherent response
return await synthesize_results(results, original_query=query)
This alone cut our fallback-to-human rate by 60%. Most "ambiguous" queries aren't ambiguous — they're just compound.
Lesson 4: You need observability or you're flying blind
I cannot stress this enough. Without observability, debugging a multi-agent system is like debugging a distributed microservices architecture with print() statements. Which is to say: technically possible, but you will lose your mind.
For Klavy, I built a tracing layer that captures every node transition:
class AgentTracer:
def trace_node(self, node_name: str, state: AgentState):
span = {
"node": node_name,
"timestamp": datetime.utcnow().isoformat(),
"iteration": state["iteration_count"],
"intent": state.get("intent", "unclassified"),
"token_usage": self._get_token_count(),
"confidence": state.get("confidence", 0),
"parent_trace_id": state["trace_id"],
}
self.spans.append(span)
# Alert if approaching iteration limit
if state["iteration_count"] >= state["max_iterations"] - 1:
self.emit_warning(
f"Agent approaching max iterations on trace {state['trace_id']}"
)
Every trace gets a unique ID. Every node hop is a span. I can pull up any user conversation and see: user sent message, classifier routed to legal_qa with 0.82 confidence, legal_qa processed in 1.2s with 340 tokens, response synthesizer combined results.
When something goes wrong — and it will — you can see exactly where the graph diverged from the expected path. No guessing. No reproducing. Just look at the trace.
The SIMO Avatar problem: when agents need to be real-time
Klavy was hard, but at least it was request-response. SIMO Avatar threw a different wrench into things: multiple AI modalities running concurrently. Vision processing, voice recognition, and NLP all operating in parallel on the same input stream.
The challenge isn't the individual agents. Each one works fine in isolation. The challenge is coordination. When a user speaks to the avatar while making a hand gesture, the voice agent and vision agent both fire simultaneously. Who gets priority? What happens when they produce conflicting interpretations?
The answer I landed on: don't coordinate at the agent level. Coordinate at the response level.
Each modality agent runs independently and publishes its results to a shared event bus with timestamps and confidence scores. A lightweight arbitrator — not an LLM, just deterministic logic — consumes these events and decides what to surface to the user based on a priority matrix:
- Voice + high confidence NLP intent = execute the voice command
- Vision + gesture recognized + no voice = execute the gesture
- Conflicting signals = ask for clarification (but make it natural, not robotic)
The key insight: the arbitrator is dumb on purpose. It doesn't try to "understand" anything. It just applies rules to structured outputs from smart agents. This separation of concerns is what keeps the system from collapsing.
The whiteboard test
If you can't draw your agent graph on a whiteboard in under 2 minutes, it's too complex.
I'm dead serious about this. Klavy v1 had 14 nodes, and I could barely explain the routing logic to a colleague in 10 minutes. That was a code smell I ignored because the tests passed. Don't make that mistake.
Klavy v2 still has 14 nodes, but the routing is simpler. Every node has exactly one primary output and one fallback output. The graph is a DAG with clearly marked fallback edges. I can draw it in 90 seconds.
Here's the actual structure:
# Klavy v2 Agent Graph (simplified)
User Input
|
Intent Classifier (confidence threshold: 0.75)
|--- high confidence --> Route to specialized node
| |--- legal_qa --> Legal RAG Pipeline --> Response Synthesizer
| |--- document_processing --> OCR + Extraction --> Response Synthesizer
| |--- calculator --> Rent/Deposit Calculator --> Response Synthesizer
| |--- search --> Listing Search --> Response Synthesizer
|
|--- low confidence --> Multi-Intent Decomposer
| |--- split into sub-queries --> Process each independently --> Merge
|
|--- failure at any point --> General Fallback (single LLM call)
|
Response Synthesizer --> User
Every path terminates. Every node has a timeout. Every transition is logged. Boring? Maybe. Reliable? Extremely.
The real cost of complexity
Let me give you some numbers from Klavy's production metrics. Before the rearchitecture:
- Average tokens per conversation: 4,200
- Timeout rate: 8.3%
- Fallback-to-human rate: 19%
- Average response time: 6.2 seconds
After implementing kill switches, message passing, fallback hierarchies, and observability:
- Average tokens per conversation: 2,100 (50% reduction)
- Timeout rate: 0.4%
- Fallback-to-human rate: 7%
- Average response time: 3.1 seconds
The simpler architecture isn't just more reliable. It's cheaper. It's faster. It's easier to maintain. There is no tradeoff here.
What I'd tell you before you build your first multi-agent system
Start with one agent. Seriously. A single LLM call with good prompt engineering and structured output will get you further than you think. I've seen teams build elaborate multi-agent systems for problems that a well-crafted system prompt could solve.
If you genuinely need multiple agents — because you have distinct modalities like SIMO, or complex domain routing like Klavy — then build them incrementally. Start with two nodes: classifier and handler. Add nodes only when you can point to a specific failure mode that the current architecture can't handle.
And every time you add a node, ask yourself: "Can I draw this on a whiteboard in under 2 minutes?"
If the answer is no, you've gone too far. Simplify. Your 3 AM self will thank you.