You Might Not Need RAG
We built a life insurance bot that matches clients with carriers based on medical conditions. When developing this with a friend, I planned to "do the dumb thing first" - we wanted the simplest thing that shipped fast so we could test that the customer actually wanted it. Our initial hack was to just let the LLM pick and read complete files. But after going back and building a "proper" RAG system, we discovered the hack was both easier to maintain and more accurate.
Our corpus was carrier documentation about medical conditionsâdiabetes guidelines, heart disease rules, cancer policies. The initial approach was dead simple: show the LLM a file list, let it pick relevant documents, load them completely if they were small enough. For larger documents, we'd hand another LLM call the whole document and a summary of what we were trying to do, and ask it to pull out all the relevant information for the main process. One API endpoint (called a "function" then, "tool" now), one for loop, all in one afternoon. It was embarrassingly simple for a team that had worked on retrieval and ranking for billions at Google, so we figured we'd improve it later.
Given our backgrounds, RAG seemed obvious. We knew retrieval systems. Surely we could beat our hack.
We built the works: state-of-the-art embeddings from Voyage.ai which we stored in PGVector, and reciprocal rank fusion with BM25 powered by Elasticsearch. Our chunking got gradually more sophisticated, with semantic splits by heading, smart merging of small chunks, and LLM-generated contextualizing annotations. All the best practices.
But initial tests showed some retrieval failures. Every tweak meant hours of reindexing. Medical terms broke embeddingsâsearch "diabetes" and get random A1C ranges. And tables were the killer. The life insurance industry seems to love complex tables: merged cells, multi-page conditions, rules like "Accept Type 2 diabetes if A1C under 7 AND no retinopathy AND BMI under 35" scattered across chunks. We added more chunking logic and increased the size for tables specifically. Finally, the ranking step of the RAG approach needed some kind of pagination, since the agent could ask for more infromation about the same query in the loop but no longer had direct control about where to look for that information.
The complexity multiplied. Simple approach: LLM â Files.
RAG approach: LLM â Query Processor â Elasticsearch + PGVector â Embeddings â Reranker â Results Cache â Context Assembly. Many moving parts, many failure modes.
After a ton of work and four new services to maintain, we finally matched our baseline accuracy. The win? Faster responses. That's it.
Later I learned our initial hack had a name: agentic retrieval. The LLM acts as its own retrieval agent, picking documents to examine. We'd built it by accident.
Why it worked: Complete documents meant complete context. Those complex multi-condition tables stayed intact. No chunking errors, no missing pieces. Simpler prompts tooâno explaining why information was fragmented.
RAG's benefitsâmassive scale, faster responsesâdidn't apply. We had a couple thousand documents, not millions. Modern context windows handled multiple complete docs easily.
My best guess about when to use what:
Agentic retrieval works when you have hundreds to low thousands (not millions) of documents with interdependent information, care about accuracy over speed, need quick iteration, or have complex document structures.
RAG makes sense for truly massive corpora, independent documents, sub-second requirements, or when context windows genuinely can't fit your needs.
Cost-wise, yes, we burn more tokens loading full documents. But tuning all of RAG's infrastructureâElasticsearch, PGVector, compute, reindexingâcost more than our "wasteful" token usage.
What started as a hack became our architecture. As context windows expand and costs drop, letting LLMs read complete documents stays competitive with complex RAG pipelines.
The lesson I should've known: simple approaches scale well with LLMs! We spent a lot of engineering effort solving non-problems. Our corpus wasn't big enough to require RAG, and our latency needs weren't that strict. RAG was unlikely to ever improve accuracy either because of all the complexity where you can lose information.
My advice: you should usually try the dumb solution first. Let the LLM read whole documents. If it's dumb and it works well, it isn't dumb.
Building knowledge-intensive AI systems? Let's discuss your experiences with RAG vs. alternative approaches.