RAG's real production failures are chunking failures, and the fix is routing by document type

A practical writeup on Towards Data Science this week by Priyansh Bhardwaj lays out why most production RAG failures are chunking failures, not retrieval or generation failures. The headline insight is easy to quote: "The LLM is not the bottleneck. The bottleneck is the decision about where one chunk ends and the next begins." The piece gets past the abstract argument with four concrete failure modes and measured improvements from a document-type-aware chunking strategy. For anyone who has shipped a RAG system and watched it return technically-plausible but subtly-wrong answers, the failure modes will look familiar.

The four patterns Bhardwaj calls out are all common and all expensive to debug. Logical boundary splitting is the worst of them: a chunk ends with "contractors follow the standard onboarding process as described in Section 4" and the next begins "unless engaged on a project classified under Annex B," producing two fragments, each wrong in isolation, whose combination the retriever may never reassemble. Fixed-size token windows (the default 512 tokens with 50-token overlap that still lives in too many stacks) happily split three-paragraph exceptions and numbered lists down the middle because they are structure-blind. Table flattening turns grids into long sequences of orphan values divorced from their headers. Layout-unaware PDF extraction compounds all of the above, because the underlying text stream no longer matches the visual structure the writer relied on.

The recommended remedy is unexciting and exactly right. Route by document type. Structured documentation with clear hierarchy (specs, runbooks) wants hierarchical chunking and AutoMergingRetriever. Narrative content (policies, guides, explanatory prose) wants SentenceWindowNodeParser with a 3-sentence context window. Mixed and unstructured content wants semantic chunking, with the honest caveat about latency cost. PDFs and slides need layout-aware preprocessing via PyMuPDF or pdfplumber before any chunking happens. Bhardwaj's benchmark on the narrative-content switch is unglamorous but real: context recall moved from 0.72 to 0.88 and context precision from 0.71 to 0.83 by matching the chunker to the document shape. Those are the kind of numbers you get when you stop treating RAG as a model problem and start treating it as a content-engineering problem.

If your RAG system is in production and you have never instrumented chunking as a failure source, that is the place to look first. Concrete steps: sample 50 recent queries where the answer was wrong or thin, read the retrieved chunks by hand, and count how many are doomed from the chunk boundaries alone regardless of which model generates from them. That number is almost always larger than teams expect. The second step is to stop treating your corpus as homogeneous; route by document type, because the chunker that works for a 200-page policy PDF will not work for a table-heavy SKU catalog. This pairs with the memweave piece from yesterday. Both are making the same broader point from different angles. The infrastructure choices that look most exciting (bigger vector databases, newer models, longer contexts) are usually not where your quality problems live. The boring choices about how your data gets chopped up and indexed are.

RAG's real production failures are chunking failures, and the fix is routing by document type

More News