The $2,400 Lesson
When we started building Sarathi, our voice agent for underserved communities in Northeast India, we hit a wall immediately. The agent needed to handle Assamese, Bodo, and Hindi — sometimes in the same sentence. Code-switching is incredibly common in that region, and no off-the-shelf model handled it gracefully.
My first instinct? Fine-tune. Obviously. We had transcribed audio data from community health workers. We had the budget (barely). The team was excited about training "our own model." It felt like the real engineering move.
So we spun up a fine-tuning job on GPT-3.5 Turbo with about 800 curated examples of Assamese-to-English translation and intent classification. Three training runs, hyperparameter sweeps, validation set curation — the whole ritual. Cost: roughly $2,400 in compute and two weeks of engineering time.
The model was... fine. 73% accuracy on intent classification. But here's what killed us: every time we added a new intent or changed the response format, we had to retrain. The model was frozen in time, and our product was not.
Meanwhile, I'd been prototyping on the side with Claude. A carefully structured system prompt with 8 few-shot examples of Assamese code-switching hit 81% accuracy on the same test set. No training. No data pipeline. Just a well-written prompt.
That was my wake-up call. Fine-tuning isn't always the answer. But prompting isn't always the answer either. The real skill is knowing which tool fits which problem.
The Framework: Four Questions Before You Fine-Tune
After Sarathi, I started applying a simple decision framework to every AI feature we build. Four questions, in order. If you answer "no" to any of the first three, prompting is probably your move.
1. Do You Have 1,000+ Quality Examples?
Not 1,000 examples. 1,000 quality examples. There's a massive difference.
For Sarathi's speech understanding, we had about 800 transcribed examples. That sounds like a lot until you break it down: 200 were near-duplicates from the same health worker, 150 had inconsistent labeling, and maybe 50 were outright wrong. Our effective dataset was closer to 400 clean examples. That's not enough.
Quality means: consistent formatting, verified labels, representative of your actual distribution, and diverse enough to generalize. If you have to spend more time cleaning data than training, you don't have enough data — you have a data problem.
When we built Klavy, our legal document assistant for French law, we initially thought fine-tuning was mandatory. French legal language is arcane, domain-specific, and full of terms that trip up general models. But curating a fine-tuning dataset from confidential legal documents? Nightmare. Privacy concerns, annotation costs, and the simple fact that legal language evolves with every new ruling.
Instead, we built a RAG pipeline with carefully engineered retrieval prompts and few-shot examples. The system prompt looked something like this:
You are a legal research assistant specializing in French law.
When answering, always:
1. Cite the specific article or arrêt
2. Distinguish between settled law and evolving jurisprudence
3. Flag if a provision has been modified by recent reform
Example query: "Quelle est la responsabilité du vendeur en cas de vice caché?"
Example response: "Selon l'article 1641 du Code civil, le vendeur est tenu
de la garantie à raison des défauts cachés de la chose vendue...
Note: La réforme du droit des obligations (Ordonnance n° 2016-131)
n'a pas modifié les dispositions relatives aux vices cachés,
mais la jurisprudence récente de la Cour de cassation
(Civ. 3e, 14 février 2024) a précisé..."
This prompt, combined with a retrieval layer pulling from a curated legal database, outperformed our fine-tuning experiments by 12 percentage points on answer accuracy. And it cost us nothing to update when new rulings came in.
2. What Are Your Latency Requirements?
This is where fine-tuning genuinely shines, and where I see people under-weight it in their analysis.
A fine-tuned smaller model (say, a distilled 7B parameter model) can run inference in 50-100ms. A prompted large model with a complex system prompt, few-shot examples, and chain-of-thought reasoning? You're looking at 800ms-2s easily, more if you're doing RAG with retrieval latency.
For Sarathi, this mattered enormously. It's a voice agent. Users are speaking into their phones, often on 3G connections in rural areas. Every 100ms of model latency compounds with network latency. We needed responses in under 500ms to feel conversational.
So here's what we actually did — and this is the part nobody tells you — we used both. We fine-tuned a small model for intent classification (the part that needed to be fast) and used prompted Claude for response generation (the part that needed to be smart). Two models, two jobs, each optimized for what it does best.
# Sarathi's hybrid architecture (simplified)
async def handle_utterance(audio: bytes) -> str:
# Step 1: Fast intent classification (fine-tuned, ~80ms)
transcript = await whisper_transcribe(audio)
intent = await fine_tuned_classifier.predict(transcript)
# Step 2: Smart response generation (prompted, ~1.2s)
context = await retrieve_context(intent, transcript)
response = await claude.generate(
system=SARATHI_SYSTEM_PROMPT,
context=context,
user_message=transcript,
intent=intent
)
return response
The fine-tuned classifier runs in 80ms. The prompted response generation takes about 1.2 seconds but streams back, so the user starts hearing the response almost immediately. Best of both worlds.
3. What Does This Cost at Scale?
Here's a table I wish someone had shown me two years ago. These are real numbers from our production systems, averaged over a month of traffic.
| Factor | Fine-Tuned (GPT-3.5) | Prompted (Claude Sonnet) | Hybrid (Our Approach) |
|---|---|---|---|
| Cost per request | ~$0.002 | ~$0.04 | ~$0.018 |
| Monthly cost (50K req) | $100 | $2,000 | $900 |
| Latency (p95) | 120ms | 1.8s | 1.4s (streams) |
| Update turnaround | 2-3 days (retrain) | 5 minutes (edit prompt) | Mixed |
| Accuracy (our tasks) | 73% | 81% | 84% |
| Upfront training cost | $800-2,400 | $0 | $400 |
At 50K requests per month, the cost difference is meaningful but not devastating. At 500K requests? The fine-tuned model saves you $19,000/month. At 5 million requests? You're talking about the difference between a viable business and burning cash.
But — and this is the part people miss — the update turnaround cost is invisible in these tables. Every time you retrain, you're paying in engineering time, QA cycles, and deployment risk. For a fast-moving product where requirements change weekly, that "cheap per-request" fine-tuned model becomes expensive in human hours.
4. Can You Afford the Maintenance Burden?
Fine-tuned models freeze in time. The model you trained in November doesn't know about the API change in December or the new product feature in January. You need a data pipeline, a retraining schedule, evaluation benchmarks, and someone who actually monitors for drift.
At a startup, that "someone" is you. And you have twelve other things on fire.
I've seen teams at AllysAI fall into this trap. They fine-tuned a model for accessibility analysis, shipped it, and moved on to the next feature. Six months later, WCAG 2.2 guidelines got updated, and the model was confidently giving outdated recommendations. Nobody had set up a retraining trigger. The model didn't know what it didn't know.
A prompted system, by contrast, can be updated in a PR. Change the system prompt, update the few-shot examples, merge, deploy. The model itself stays current because you're using the latest version from the provider.
The Prompt That Replaced a Fine-Tuned Model
Let me show you something concrete. For Klavy, we initially fine-tuned a model to extract key clauses from French legal contracts. It took two weeks and about $1,200 in compute. Here's the prompt that replaced it:
Extract the following clauses from this French legal document.
For each clause, provide:
- The exact text (verbatim quote)
- The article/section number
- A severity rating: CRITICAL / IMPORTANT / STANDARD
Clauses to extract:
1. Limitation of liability (limitation de responsabilité)
2. Termination conditions (conditions de résiliation)
3. Non-compete (clause de non-concurrence)
4. Governing law (loi applicable)
5. Force majeure
If a clause is absent, state "NOT FOUND" and flag as CRITICAL.
Here are two examples of correct extractions:
[Example 1: A service agreement with clear clause boundaries]
...
[Example 2: A complex multi-party contract with nested clauses]
...
This prompt, fed into Claude with the full contract in context, hit 89% extraction accuracy vs. the fine-tuned model's 82%. More importantly, when a client asked us to add "data processing agreement" as a sixth clause type, I added one line to the prompt and one example. Done in 10 minutes. The fine-tuned approach would have needed new training data, a retraining run, and a full regression test.
When Fine-Tuning Actually Wins
I don't want to sound like I'm anti fine-tuning. There are cases where it's clearly the right call:
- You need a specific output format consistently — fine-tuning is unbeatable for teaching a model your exact JSON schema, tone of voice, or structured output format
- You're operating at massive scale — above 1M requests/month, the per-request savings dominate everything
- Latency is a hard constraint — real-time applications where every millisecond matters
- You have proprietary knowledge that can't be in a prompt — domain knowledge that's too large for context windows and too sensitive for third-party APIs
- The task is narrow and stable — classification into fixed categories that won't change next quarter
For Sarathi's intent classifier, all five of those were true. Narrow task, latency-critical, stable categories, sensitive health data, and high volume. Fine-tuning was the obvious choice for that specific component.
My Decision Flowchart
Here's how I think about it now, boiled down to a quick mental checklist:
- Can you write a prompt that works? → Try it first. Always.
- Does the prompt work but cost too much at scale? → Fine-tune a smaller model
- Does the prompt work but too slowly? → Fine-tune for the latency-critical path only
- Does the prompt fail because the task is too specialized? → Check if you have 1K+ quality examples. If yes, fine-tune. If no, fix your data first.
- Is the task stable or evolving? → Evolving = prompt. Stable = fine-tune is safe.
Fine-tuning is the hammer everyone reaches for when a screwdriver would do. Learn to use both, and more importantly, learn to recognize which fastener you're looking at.
The Uncomfortable Truth
Most teams fine-tune because it feels more "real" than prompt engineering. Writing prompts feels like cheating. Training a model feels like engineering. But that's ego talking, not pragmatism.
The best AI systems I've built — Sarathi, Klavy, SALAMA — all use a mix of both approaches, carefully matched to the requirements of each component. The worst systems I've seen are the ones where someone decided on the approach before understanding the problem.
Start with a prompt. Measure it honestly. If it falls short, understand exactly why it falls short. Then, and only then, consider whether fine-tuning solves that specific shortcoming. Your users don't care how sophisticated your pipeline is. They care that it works, fast, and keeps working tomorrow.