The Question That Broke Everything

Here's the query, verbatim from our logs: "Can my landlord raise rent if the building was built before 1948 and I've been there 5 years?"

Simple enough, right? A human paralegal would knock this out in minutes. But our vanilla RAG pipeline — the one I'd painstakingly optimized with top-k retrieval against our Pinecone index of Nepal's housing legislation — returned a confidently wrong paragraph about commercial lease termination clauses. Not even close.

I stared at the retrieved chunks. Individually, they made sense: one about rent stabilization, one about building age classifications, one about tenant tenure protections. But the system had no idea how to compose them. It grabbed the chunk with the highest cosine similarity to the overall query embedding and called it a day. That's the dirty secret nobody tells you in those "Build RAG in 20 Lines of Python" tutorials.

Most RAG tutorials are dangerously oversimplified. They work great for single-hop factual lookups. The moment a user asks anything compositional, the whole thing collapses.

Why Single-Shot Retrieval Falls Apart

Let's dissect what the user actually needed. Their question has three distinct information requirements:

  1. What are the rent increase rules for buildings constructed before 1948?
  2. What protections exist for tenants with 5+ years of tenure?
  3. How do these two conditions interact — does one override the other?

Vanilla RAG treats the entire query as a single embedding vector and retrieves the top-k nearest chunks. But the embedding of the full question doesn't necessarily land near any of the three specific answers. It lands in some averaged-out region of the vector space that's kinda-sorta related to all three topics but nails none of them.

I see this pattern constantly. The harder the question, the worse single-shot retrieval performs — and hard questions are exactly the ones your users actually care about. Nobody's paying for an AI legal assistant to answer "what is a lease."

Enter the Planning Agent

The fix wasn't more chunks, better embeddings, or fancier reranking. It was fundamentally rethinking the retrieval architecture. Instead of retrieve-then-generate, we needed plan-retrieve-synthesize.

I built a query decomposition agent that sits between the user's question and the retrieval layer. Before touching the vector store, it breaks the query into atomic sub-questions, retrieves evidence for each independently, then synthesizes a unified answer. Here's the core logic:

from openai import OpenAI
from pydantic import BaseModel

class SubQuery(BaseModel):
    question: str
    reasoning: str
    depends_on: list[int] = []

class QueryPlan(BaseModel):
    sub_queries: list[SubQuery]
    synthesis_strategy: str

def decompose_query(user_query: str) -> QueryPlan:
    """Break a complex query into atomic retrieval steps."""
    client = OpenAI()

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        response_format=QueryPlan,
        messages=[
            {"role": "system", "content": (
                "You are a query planner for a legal knowledge base. "
                "Decompose the user's question into independent sub-questions "
                "that can each be answered by a single document chunk. "
                "Mark dependencies between sub-queries. "
                "Specify how to synthesize the final answer."
            )},
            {"role": "user", "content": user_query}
        ]
    )
    return response.choices[0].message.parsed


async def agentic_retrieve(user_query: str, retriever) -> str:
    plan = decompose_query(user_query)

    evidence = {}
    for i, sq in enumerate(plan.sub_queries):
        # Wait for dependencies
        dep_context = "\n".join(evidence[d] for d in sq.depends_on if d in evidence)

        # Retrieve with sub-question (not the original query!)
        enriched_query = f"{sq.question}\nContext: {dep_context}" if dep_context else sq.question
        chunks = await retriever.search(enriched_query, top_k=3)
        evidence[i] = "\n".join(c.text for c in chunks)

    # Synthesize with all evidence
    return synthesize(user_query, evidence, plan.synthesis_strategy)

The key insight: each sub-question gets its own retrieval pass. The embedding for "rent increase rules for pre-1948 buildings" lands much closer to the right chunks than the embedding of the original monster question. It's almost embarrassingly obvious in hindsight.

The Dependency Graph Matters More Than You Think

See that depends_on field? That's the part I almost skipped — and it would've been a huge mistake. Some sub-questions can be answered independently. Others need the answer from a prior step to even make sense.

For our landlord question, the decomposition looked like this:

  1. Sub-query 1: "What rent regulations apply to buildings built before 1948?" → Independent
  2. Sub-query 2: "What tenant protections exist for 5+ year tenancies?" → Independent
  3. Sub-query 3: "When both pre-1948 building status and 5+ year tenure apply, which regulation takes precedence?" → Depends on 1 and 2

Sub-queries 1 and 2 run in parallel (speed win). Sub-query 3 uses their answers as additional context for retrieval. The agent isn't just decomposing — it's building a retrieval DAG.

With Klavy's setup — Pinecone for the primary vector index, plus pgvector on our Postgres instance for metadata-heavy queries — this architecture lets us route sub-queries to the right store. Building-age queries hit the structured metadata in pgvector. Tenant rights queries go to the dense vector index. The agent decides.

The Results Were Not Subtle

We ran a before/after evaluation on 200 multi-hop legal questions. I'm not going to pretend we had a perfect benchmark — we had a paralegal grade a random sample on a 1-5 accuracy scale. Messy but honest.

The latency increase is real and I won't sugarcoat it. You're making 3-4 retrieval calls instead of 1, plus an LLM call for decomposition. For Klavy, where users expect thorough answers to serious legal questions, 3.8 seconds is fine. For a real-time voice agent like AllysAI or SIMO Avatar? You'd need a different approach. Context matters. I'll get into latency budgets in another post.

When You Don't Need This

I want to be clear: I'm not saying every RAG system needs an agent layer. That would be its own kind of over-engineering. Here's my framework for deciding:

Vanilla RAG is fine when:

You need agentic retrieval when:

The Bigger Pattern

What I've described here is a specific instance of a broader shift I keep seeing: the move from monolithic LLM calls to orchestrated multi-step workflows. The same principle applies to code generation (plan the approach before writing), data analysis (decompose the question before querying), and content creation (outline before drafting).

Working on SALAMA's safety monitoring system reinforced this for me. We needed the AI to assess construction site risks — "Is this scaffold safe given the current wind conditions and the load being carried?" Same pattern. Single-shot analysis missed interactions between risk factors. A planning step that decomposed the assessment into independent checks, then synthesized a holistic risk score, performed dramatically better.

The LLM isn't the bottleneck anymore. The retrieval architecture is. And the fix isn't a better model — it's a smarter system around the model.

What I'd Do Differently

If I were rebuilding Klavy's retrieval layer from scratch today, three things I'd change:

First, I'd add a complexity classifier before the planning agent. Not every query needs decomposition. A lightweight model (or even a rule-based check for conjunctions and conditional clauses) can route simple questions directly to vanilla retrieval and save the agent overhead for queries that actually need it.

Second, I'd cache decomposition plans. Legal questions cluster. If ten users ask variations of "can my landlord do X given Y," the decomposition structure is nearly identical. Cache the plan template, swap in the specifics.

Third, I'd invest more in the synthesis step. Right now it's a single LLM call with all the evidence stuffed in. For truly complex questions with conflicting evidence across sub-queries, you need the synthesizer to explicitly reason about conflicts and cite which sources support which conclusions. That's the difference between "useful" and "trustworthy."

The gap between a demo and a product is the gap between single-shot retrieval and a system that actually thinks about what it needs to know before it starts looking.

Stop optimizing chunk sizes. Start building retrieval agents.