MervCodes

Tech Reviews From A Programmer

RAG Tutorial: Build an AI Knowledge Base

1 min read

RAG Tutorial: Build an AI Knowledge Base

Large language models are remarkable generalists, but they have two stubborn limits: they don't know your private data, and they confidently invent facts when they're unsure. Retrieval-Augmented Generation (RAG) fixes both. Instead of relying on what a model memorized during training, RAG fetches relevant snippets from your documents at query time and hands them to the model as context. The result is an AI that answers questions about your knowledge base accurately, with citations, and without expensive fine-tuning.

This tutorial walks through building a RAG-powered knowledge base end to end. We'll cover the architecture, the practical decisions that actually matter, and a working code skeleton you can adapt.

How RAG Works in One Picture

A RAG system has two phases: ingestion (done once, or whenever documents change) and querying (done on every user question).

During ingestion you:

  1. Load raw documents (PDFs, Markdown, HTML, database rows).
  2. Split them into smaller chunks.
  3. Convert each chunk into an embedding — a numeric vector capturing meaning.
  4. Store those vectors in a vector database.

At query time you:

  1. Embed the user's question with the same model.
  2. Search the vector store for the most similar chunks.
  3. Stuff those chunks into a prompt as context.
  4. Ask the LLM to answer using only that context.

That's the whole trick. The model never "learns" your data — it reads the right pages just in time.

Step 1: Chunking Your Documents

Chunking is the single most underrated step. If chunks are too large, retrieval pulls in noise and you waste context window. Too small, and a chunk loses the surrounding meaning needed to answer.

Sensible defaults to start with:

  • Chunk size: 500–1,000 tokens.
  • Overlap: 10–15% between adjacent chunks so a sentence split across a boundary still survives in one piece.
  • Respect structure: split on headings, paragraphs, or sentences rather than blindly every N characters. A chunk that ends mid-sentence retrieves poorly.
def chunk_text(text, size=800, overlap=100):
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        end = start + size
        chunks.append(" ".join(words[start:end]))
        start += size - overlap
    return chunks

Always attach metadata to each chunk — source filename, page number, section title, last-updated date. You'll need it for citations and for filtering.

Step 2: Generating Embeddings

An embedding model turns text into a vector (often 768 to 3,072 dimensions). Similar meanings land close together in vector space, which is what makes semantic search possible.

When choosing an embedding model, weigh:

  • Quality — how well it separates relevant from irrelevant text on your domain.
  • Dimension size — bigger isn't always better; it costs more storage and compute.
  • Cost and latency — you embed every chunk once, but every query too.
  • Hosting — managed API vs. a local open-source model for privacy.

The critical rule: use the same embedding model for ingestion and for queries. Mixing models produces vectors that aren't comparable, and retrieval quietly degrades.

Step 3: Choosing and Loading a Vector Database

The vector store indexes your embeddings and answers nearest-neighbor searches fast. Popular options:

  • pgvector — a Postgres extension. Great if you already run Postgres and want one less moving part.
  • Qdrant / Weaviate / Milvus — purpose-built, scale to millions of vectors, rich filtering.
  • Chroma / FAISS — lightweight, perfect for prototyping locally.
  • Pinecone — fully managed, minimal ops.

For a first project, start with Chroma or pgvector. Don't over-engineer the infrastructure before you've validated that retrieval quality is good.

# Pseudocode for ingestion
for doc in documents:
    for chunk in chunk_text(doc.text):
        vector = embed(chunk)
        store.add(vector=vector, text=chunk, metadata=doc.meta)

Step 4: Retrieval

When a question comes in, embed it and ask the store for the top k most similar chunks (k of 4–8 is a good starting range). Two techniques sharply improve results:

  • Hybrid search: combine semantic (vector) search with keyword (BM25) search. Vectors catch paraphrases; keywords catch exact terms, names, and IDs that embeddings sometimes blur.
  • Re-ranking: retrieve a wider net (say top 20), then run a cross-encoder re-ranker to reorder by true relevance and keep the best 5. This is one of the highest-leverage upgrades you can make.

Also apply metadata filters — e.g., only search documents the user is allowed to see, or only the latest version. Retrieval quality determines everything downstream; a perfect LLM can't answer from the wrong pages.

Step 5: Prompt Construction and Generation

Now assemble the prompt. A reliable template looks like this:

You are a helpful assistant. Answer the question using ONLY the
context below. If the answer is not in the context, say you don't
know. Cite sources by their [number].

Context:
[1] {chunk_1}  (source: handbook.pdf, p.12)
[2] {chunk_2}  (source: policy.md)
...

Question: {user_question}

Key prompting guidelines:

  • Ground the model explicitly. Telling it to answer only from context and to admit ignorance dramatically cuts hallucination.
  • Ask for citations. Numbered context blocks let the model point back to sources, and let users verify.
  • Mind the context window. If retrieved chunks exceed the budget, prioritize the highest-ranked ones rather than truncating arbitrarily.

When building on the Claude API, you can place retrieved context in the system prompt or as a leading user turn, and use prompt caching on the static instruction portion to cut cost and latency when the same system prompt is reused across many queries. Check the current model IDs and pricing in the official docs before wiring it up — don't hardcode assumptions about model names.

Step 6: Evaluation and Iteration

You cannot improve what you don't measure. Build a small evaluation set of real questions with known good answers, then track:

  • Retrieval hit rate — did the correct chunk appear in the top k?
  • Answer faithfulness — is the response actually supported by the retrieved context?
  • Answer relevance — does it address the question asked?

Most RAG quality problems are retrieval problems, not generation problems. If answers are wrong, inspect what got retrieved first. Iterate on chunk size, overlap, hybrid weighting, and re-ranking before blaming the LLM.

Production Considerations

  • Keep the index fresh. Set up incremental re-ingestion when source documents change, and store an updated_at so stale chunks can be purged.
  • Handle access control. Filter retrieval by user permissions; never let the model see chunks the user shouldn't.
  • Log queries and retrieved chunks. This is your debugging goldmine and your eval data source.
  • Show citations in the UI. Trust comes from letting users click through to the source.
  • Watch cost. Embedding and re-ranking add up at scale; cache aggressively and batch where you can.

FAQ

How is RAG different from fine-tuning? Fine-tuning bakes new behavior or style into the model's weights and is expensive to update. RAG injects fresh, factual knowledge at query time and updates instantly when you re-index documents. Use fine-tuning to change how a model responds; use RAG to change what it knows. They can be combined.

How much data do I need before RAG is worth it? Even a few dozen documents benefit. RAG shines anytime answers live in text the base model never saw — internal wikis, product docs, contracts, support tickets.

Why does my RAG system still hallucinate? Usually the right chunk wasn't retrieved, so the model improvised. Check retrieval first: tune chunking, add hybrid search and a re-ranker, and increase k. Also strengthen the prompt instruction to answer only from context and to say "I don't know" otherwise.

What chunk size should I use? Start at 500–1,000 tokens with ~10% overlap, then test against your eval set. Dense reference material often does better with smaller chunks; narrative content with larger ones.

Do I need a dedicated vector database? Not initially. pgvector or Chroma is plenty for prototypes and even modest production loads. Move to a specialized store like Qdrant or Pinecone when you outgrow it on scale or filtering needs.

Can RAG cite its sources? Yes — that's a core strength. Carry metadata through ingestion, number your context blocks, and instruct the model to cite them. Then surface those citations in your interface.

Wrapping Up

A working RAG knowledge base comes down to a disciplined pipeline: chunk thoughtfully, embed consistently, retrieve well (hybrid + re-rank), ground the prompt, and evaluate relentlessly. Start simple — a local vector store and a handful of documents — prove that retrieval surfaces the right passages, then scale the infrastructure. Get retrieval right and the rest of the system falls into place.

Related Articles