You Might Not Need RAG - Matthew Newton

We built a life insurance bot that matches clients with carriers based on medical conditions. When developing this with a friend, I planned to "do the dumb thing first" - we wanted the simplest thing that shipped fast so we could test that the customer actually wanted it. Our initial hack was to just let the LLM pick and read complete files. But after going back and building a "proper" RAG system, we discovered the hack was both easier to maintain and more accurate.

Our corpus was carrier documentation about medical conditions—diabetes guidelines, heart disease rules, cancer policies. The initial approach was dead simple: show the LLM a file list, let it pick relevant documents, load them completely if they were small enough. For larger documents, we'd hand another LLM call the whole document and a summary of what we were trying to do, and ask it to pull out all the relevant information for the main process. One API endpoint (called a "function" then, "tool" now), one for loop, all in one afternoon. It was embarrassingly simple for a team that had worked on retrieval and ranking for billions at Google, so we figured we'd improve it later.

Given our backgrounds, RAG seemed obvious. We knew retrieval systems. Surely we could beat our hack.

We built the works: state-of-the-art embeddings from Voyage.ai which we stored in PGVector, and reciprocal rank fusion with BM25 powered by Elasticsearch. Our chunking got gradually more sophisticated, with semantic splits by heading, smart merging of small chunks, and LLM-generated contextualizing annotations. All the best practices.

But initial tests showed some retrieval failures. Every tweak meant waiting on reindexing the world. Medical terms broke embeddings—search "diabetes" and get random A1C ranges. And tables were the killer. The life insurance industry seems to love complex tables: merged cells, multi-page conditions, rules like "Accept Type 2 diabetes if A1C under 7 AND no retinopathy AND BMI under 35" scattered across chunks. We added more chunking logic and increased the size for tables specifically. Finally, the ranking step of the RAG approach needed some kind of pagination, storing previous results to remove them from refinement queries. The agent could trivially ask for more information about the same topic when it was in charge of browsing, but in the RAG approach it no longer had direct control about where it was looking for that information, and could easily end up thrashing between similar queries returning the same chunks.

The complexity multiplied. Simple approach: LLM → Files.

RAG approach: LLM → Query Processor → Elasticsearch + PGVector → Embeddings → Reranker → Results Cache → Context Assembly. Many moving parts, many failure modes.

After a ton of work and four new services to maintain, we finally matched our baseline accuracy. The win? Faster responses. That's it.

Later I learned our initial hack had a name: agentic retrieval. The LLM acts as its own retrieval agent, picking documents to examine. We'd built it by accident.

Why it worked: Complete documents meant complete context. Those complex multi-condition tables stayed intact. No chunking errors, no missing pieces. Simpler prompts too—no explaining why information was fragmented.

RAG's benefits—massive scale, faster responses—didn't apply. We had a couple thousand documents, not millions. Modern context windows handled multiple complete docs easily.

My best guess about when to use what:

Agentic retrieval works when you have hundreds to low thousands (not millions) of documents with interdependent information, care about accuracy over speed, need quick iteration, or have complex document structures.

RAG makes sense for truly massive corpora, independent documents, sub-second requirements, or when context windows genuinely can't fit your needs.

Cost-wise, yes, we burn more tokens loading full documents. But tuning and maintaining all of RAG's infrastructure — Elasticsearch, PGVector, compute, reindexing — cost more than our "wasteful" token usage.

As context windows expand and costs drop, I expect letting LLMs read complete documents stays competitive with complex RAG pipelines in the thousand-document scale.

The lesson I should've known: simple approaches scale well with LLMs! We spent a lot of engineering effort solving non-problems. Our corpus wasn't big enough to require RAG, and our latency needs weren't that strict. But our accuracy requirements were quite high. In retrospect, knowing now that in-context retrieval works so well, I think RAG was unlikely to ever improve accuracy. All the steps introduce places where you can lose information. Attention is still all you need :)

My advice: you should usually try the dumb solution first. Let the LLM read whole documents. If it's dumb and it works well, it isn't dumb.

Building knowledge-intensive AI systems? Let's discuss your experiences with RAG vs. alternative approaches.