A RAG pipeline has three stages: indexing, retrieval, and generation. During indexing, you take your documents — PDFs, web pages, database records, whatever — split them into chunks, run each chunk through an embedding model to get a vector, and store those vectors in a vector database like Qdrant, Pinecone, or Weaviate. During retrieval, the user's query is embedded with the same model, and the vector database returns the top-k most similar chunks (typically 3–10). During generation, those chunks are stuffed into the model's prompt as context, and the model generates a response grounded in that material. The entire round trip — embed the query, search, assemble the prompt, generate — typically takes 1–3 seconds.
Chunking is where most RAG systems succeed or fail, and it's more subtle than it looks. Split chunks too small and you lose context; too large and you waste precious context window space on irrelevant text. A common starting point is 500–1000 tokens per chunk with 10–20% overlap between adjacent chunks (so you don't lose information that spans a boundary). But naive fixed-size chunking often cuts sentences in half or separates a heading from its content. More sophisticated approaches use document structure — splitting on headings, paragraph breaks, or semantic shifts — to create chunks that are self-contained and meaningful. LangChain's RecursiveCharacterTextSplitter and LlamaIndex's SentenceSplitter both try to handle this, but the best results usually come from understanding your specific documents and writing custom splitting logic.
The retrieval step has more options than people realize. Pure vector similarity search (nearest-neighbor lookup in embedding space) is the default, but it struggles with exact keyword matches, proper nouns, and code identifiers. This is why hybrid search has become the standard in production systems: you run both a vector search and a traditional keyword search (BM25), then combine the results using reciprocal rank fusion or a learned re-ranker. Qdrant, Weaviate, and Elasticsearch all support hybrid search natively. Re-ranking — taking the top 20–50 results from retrieval and scoring them with a cross-encoder model — adds latency but dramatically improves relevance. Cohere Rerank and cross-encoder models from Hugging Face are the common choices here.
A common misconception is that RAG eliminates hallucination. It reduces it significantly, but a model can still hallucinate details that aren't in the retrieved chunks, especially if the chunks are tangentially related to the query rather than directly answering it. Good RAG systems mitigate this by including source citations in the prompt instructions ("only answer based on the provided context, cite your sources"), by filtering out low-relevance chunks (setting a minimum similarity threshold rather than always returning top-k), and by letting the model say "I don't have enough information" when the retrieved context genuinely doesn't answer the question. Some teams add a verification step where a second model call checks whether the response is actually supported by the sources.
RAG was introduced in a 2020 paper by Facebook AI Research (now Meta AI), but it didn't become a mainstream production pattern until 2023 when vector databases and embedding APIs matured enough to make it practical. The pattern has since evolved in several directions: GraphRAG uses knowledge graphs instead of (or alongside) vector search for better handling of relational questions. Agentic RAG gives the model the ability to reformulate queries, search multiple sources, and iterate on retrieval rather than doing a single search. And context-window expansion — models now handle 100k+ tokens — has led some teams to skip RAG entirely for smaller knowledge bases, just stuffing everything into the prompt. That approach works for a few hundred pages of docs but breaks down at scale, which is where RAG remains essential.