Building a RAG Pipeline from Scratch

Nov 20, 2025

RAG is one of the most practical patterns in LLM applications. Instead of fine-tuning, you give the model relevant context at query time. Here is how to build one from scratch.

The Architecture

User Query
    │
    ▼
┌─────────────┐     ┌──────────────┐
│  Embedding   │────▶│ Vector Store  │
│  Model       │     │ (search)     │
└─────────────┘     └──────┬───────┘
                           │ Top-K chunks
                           ▼
                    ┌──────────────┐
                    │   LLM Prompt  │
                    │ Context+Query │
                    └──────┬───────┘
                           │
                           ▼
                       Response

Step 1: Chunking Documents

The quality of your chunks determines the quality of your RAG. Naive splitting by character count loses context.

def chunk_by_paragraphs(text, max_tokens=500, overlap=50):
    paragraphs = text.split("\n\n")
    chunks = []
    current = []
    current_len = 0

    for para in paragraphs:
        para_len = len(para.split())
        if current_len + para_len > max_tokens and current:
            chunks.append("\n\n".join(current))
            # Keep last paragraph for overlap
            current = current[-1:] if overlap else []
            current_len = len(current[0].split()) if current else 0
        current.append(para)
        current_len += para_len

    if current:
        chunks.append("\n\n".join(current))
    return chunks

Better approaches:

Recursive splitting: Split by headers, then paragraphs, then sentences
Semantic chunking: Use embeddings to find natural break points
Document-aware: Respect markdown headers, code blocks, tables

Step 2: Embeddings

import openai

client = openai.OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Alternatives to OpenAI embeddings:

Cohere embed-v3: Great multilingual support
BGE-M3: Open source, runs locally
Voyage AI: Strong for code retrieval

Step 3: Vector Storage

For prototyping, you do not need a vector database. NumPy works:

import numpy as np

class SimpleVectorStore:
    def __init__(self):
        self.vectors = []
        self.documents = []

    def add(self, docs: list[str], vecs: list[list[float]]):
        self.documents.extend(docs)
        self.vectors.extend(vecs)

    def search(self, query_vec: list[float], k: int = 5):
        matrix = np.array(self.vectors)
        query = np.array(query_vec)
        # Cosine similarity
        scores = matrix @ query / (
            np.linalg.norm(matrix, axis=1) * np.linalg.norm(query)
        )
        top_k = np.argsort(scores)[-k:][::-1]
        return [(self.documents[i], scores[i]) for i in top_k]

For production, use pgvector (PostgreSQL), Qdrant, or Pinecone.

Step 4: Query Pipeline

def query_rag(question: str, store: SimpleVectorStore):
    # Embed the question
    q_vec = embed([question])[0]

    # Retrieve relevant chunks
    results = store.search(q_vec, k=5)
    context = "\n\n---\n\n".join([doc for doc, _ in results])

    # Generate answer
    response = client.chat.completions.create(
        model="claude-sonnet-4-5-20250929",
        messages=[{
            "role": "user",
            "content": f"""Answer based on the following context.
If the context does not contain enough information, say so.

Context:
{context}

Question: {question}"""
        }]
    )
    return response.choices[0].message.content

Common Pitfalls

Chunks too large: The model ignores most of a 2000-token chunk. Keep it under 500 tokens.
Chunks too small: Single sentences lose context. Include surrounding paragraphs.
No reranking: Embedding similarity is approximate. Add a reranker (Cohere, cross-encoder) for better precision.
Ignoring metadata: Filter by date, source, or category before vector search.
No evaluation: Measure retrieval quality separately from generation quality.

What Frameworks Add

LangChain and LlamaIndex provide document loaders, splitters, and integrations. But understanding the underlying pipeline helps you debug when things go wrong, and they will.

Start simple, measure, then add complexity where it helps.