Fine-tuning is the LLM technique clients ask about most often and apply correctly least often. The typical pattern: team has a quality issue, their first instinct is "we should fine-tune a model", they spend 6 weeks and significant money, and the result underperforms a better prompt + better retrieval.
Fine-tuning is a real, powerful tool. It's also the wrong tool for ~80% of the cases we see it considered for. Here's the decision framework.
The three levers
When an LLM isn't doing what you want, you have three levers:
- Better prompting — few-shot examples, chain-of-thought, structured output
- Better retrieval (RAG) — giving the model better context
- Fine-tuning — actually updating the model's weights
They're not equally expensive, and they're not interchangeable. Before reaching for fine-tuning, exhaust the first two.
When prompting works
Prompting fixes most quality issues. You'd be surprised how often "hallucination" is fixed by:
- Adding 2-3 few-shot examples of correct behavior
- Specifying the output format explicitly (JSON schema, Pydantic model)
- Asking the model to reason before answering (CoT)
- Clarifying edge-case handling in the system prompt
Iteration cycle: minutes. Cost: near zero.
When RAG works
RAG fixes issues related to knowledge:
- Model doesn't know your domain
- Model is out-of-date
- Model needs access to private data
- Model needs to cite sources
Iteration cycle: days to weeks. Cost: moderate (embeddings, vector DB, re-ranking).
When fine-tuning works
Fine-tuning shines in a narrower set of cases:
- Consistent output style/format across varied inputs (where few-shot doesn't scale)
- Cost reduction by distilling a big model into a smaller one
- Latency reduction similarly
- Classification / narrow task that RAG can't solve (no factual retrieval needed)
- Tone/persona that's hard to specify in words
Iteration cycle: weeks. Cost: data collection + training + evaluation.
The decision flowchart
Use this sequence before committing to training:
- Can prompting solve it with acceptable reliability?
- If not, is the issue knowledge and context quality?
- If not, do you need strong consistency at scale that prompting cannot hold?
- Do you have enough high-quality labeled examples?
- Will expected gains justify ongoing maintenance?
If steps 1-2 are still open, fine-tuning is usually premature.
When fine-tuning is justified
Fine-tuning tends to pay off for:
- narrow, high-volume classification tasks
- strict structured-output requirements with low error tolerance
- cost/latency reduction via distillation to smaller models
- domain-specific language consistency where prompts drift
When it usually fails
- weak or noisy training data
- unclear task boundaries
- retrieval problems disguised as "model quality" problems
- no evaluation harness or rollback strategy
In these cases, improve prompting, retrieval, and data quality first.
Data and evaluation requirements
Before training, establish:
- representative labeled set across common and edge cases
- explicit failure taxonomy (what errors matter most)
- held-out validation and regression sets
- business metrics tied to model quality (not only offline scores)
Without this, teams cannot tell whether tuning helped or just shifted error types.
Practical rollout plan
- Baseline current system quality and cost.
- Train on a narrow, high-value slice first.
- A/B test tuned vs baseline model under real traffic.
- Keep hard fallback routing and kill switch.
- Set retraining cadence only after drift is observed.
Closing
Fine-tuning is a leverage tool, not a default step. Teams that treat it as the third lever after strong prompting and retrieval generally ship faster and with less risk.
Related resources
- Capabilities: AI-Native Products and Data Platform
- Case study: LegalTech RAG overhaul
- Deep dive: RAG evaluation tests before shipping
Tags
Anthra AI Team
Engineering Team
Collective posts from the engineers at Anthra AI. We write about what we build.
More posts by Anthra AI TeamShare this article
Get insights like this weekly
Product engineering notes on AI, data, and infrastructure - no fluff.
Previous post
Building an internal analytics platform: the 14-week playbook
Product Analytics
Next post
The LLM eval harness we wish we'd built sooner
AI