Building a RAG Pipeline from Scratch
RAG is one of the most practical patterns in LLM applications. Instead of fine-tuning, you give the model relevant context at query time. Here is how to build one from scratch.
The Architecture
User Query
│
▼
┌─────────────┐ ┌──────────────┐
│ Embedding │────▶│ Vector Store │
│ Model │ │ (search) │
└─────────────┘ └──────┬───────┘
│ Top-K chunks
▼
┌──────────────┐
│ LLM Prompt │
│ Context+Query │
└──────┬───────┘
│
▼
Response
Step 1: Chunking Documents
The quality of your chunks determines the quality of your RAG. Naive splitting by character count loses context.
def chunk_by_paragraphs(text, max_tokens=500, overlap=50):
paragraphs = text.split("\n\n")
chunks = []
current = []
current_len = 0
for para in paragraphs:
para_len = len(para.split())
if current_len + para_len > max_tokens and current:
chunks.append("\n\n".join(current))
# Keep last paragraph for overlap
current = current[-1:] if overlap else []
current_len = len(current[0].split()) if current else 0
current.append(para)
current_len += para_len
if current:
chunks.append("\n\n".join(current))
return chunks
Better approaches:
- Recursive splitting: Split by headers, then paragraphs, then sentences
- Semantic chunking: Use embeddings to find natural break points
- Document-aware: Respect markdown headers, code blocks, tables
Step 2: Embeddings
import openai
client = openai.OpenAI()
def embed(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
Alternatives to OpenAI embeddings:
- Cohere embed-v3: Great multilingual support
- BGE-M3: Open source, runs locally
- Voyage AI: Strong for code retrieval
Step 3: Vector Storage
For prototyping, you do not need a vector database. NumPy works:
import numpy as np
class SimpleVectorStore:
def __init__(self):
self.vectors = []
self.documents = []
def add(self, docs: list[str], vecs: list[list[float]]):
self.documents.extend(docs)
self.vectors.extend(vecs)
def search(self, query_vec: list[float], k: int = 5):
matrix = np.array(self.vectors)
query = np.array(query_vec)
# Cosine similarity
scores = matrix @ query / (
np.linalg.norm(matrix, axis=1) * np.linalg.norm(query)
)
top_k = np.argsort(scores)[-k:][::-1]
return [(self.documents[i], scores[i]) for i in top_k]
For production, use pgvector (PostgreSQL), Qdrant, or Pinecone.
Step 4: Query Pipeline
def query_rag(question: str, store: SimpleVectorStore):
# Embed the question
q_vec = embed([question])[0]
# Retrieve relevant chunks
results = store.search(q_vec, k=5)
context = "\n\n---\n\n".join([doc for doc, _ in results])
# Generate answer
response = client.chat.completions.create(
model="claude-sonnet-4-5-20250929",
messages=[{
"role": "user",
"content": f"""Answer based on the following context.
If the context does not contain enough information, say so.
Context:
{context}
Question: {question}"""
}]
)
return response.choices[0].message.content
Common Pitfalls
- Chunks too large: The model ignores most of a 2000-token chunk. Keep it under 500 tokens.
- Chunks too small: Single sentences lose context. Include surrounding paragraphs.
- No reranking: Embedding similarity is approximate. Add a reranker (Cohere, cross-encoder) for better precision.
- Ignoring metadata: Filter by date, source, or category before vector search.
- No evaluation: Measure retrieval quality separately from generation quality.
What Frameworks Add
LangChain and LlamaIndex provide document loaders, splitters, and integrations. But understanding the underlying pipeline helps you debug when things go wrong, and they will.
Start simple, measure, then add complexity where it helps.