Building a RAG Pipeline from Scratch

Retrieval-Augmented Generation (RAG) has quickly become the go-to pattern for grounding large language models in domain-specific knowledge. Instead of fine-tuning a model (expensive and brittle), you retrieve relevant documents at query time and inject them into the prompt context.

This post walks through the core components of a RAG pipeline I built using Node.js, LangChain, ChromaDB, and a locally-running Ollama instance.

Why Local?

Running inference locally with Ollama means zero API costs during development, no data leaves your machine, and latency is predictable. For a knowledge base over internal documentation, this is a significant advantage.

The Pipeline

A RAG system has three distinct phases:

Ingestion — load documents, split them into chunks, embed each chunk, store in a vector database
Retrieval — embed the user query, find the k nearest chunks by cosine similarity
Generation — stuff the retrieved chunks into a prompt and let the LLM answer

// Simplified ingestion
const docs = await loader.load();
const chunks = splitter.splitDocuments(docs);
const embeddings = new OllamaEmbeddings({ model: "nomic-embed-text" });
await Chroma.fromDocuments(chunks, embeddings, { collectionName: "kb" });

Chunk Size Matters

Chunk size is the most sensitive tuning knob. Too small and retrieved chunks lack context; too large and you waste prompt tokens on irrelevant content. A 512-token chunk with a 64-token overlap is a reasonable default for technical documentation.

Lessons Learned

Embed your metadata. Storing source file paths and section headings in the vector store lets you surface citations alongside answers.
Hybrid search helps. Pure semantic search misses exact keyword matches. A BM25 pre-filter before the vector search improved recall noticeably.
Evaluate early. Build a small golden set of question/answer pairs from day one and measure retrieval recall against it as you tune.

RAG is not magic, but it is remarkably effective when the retrieval step is solid. The quality of your answers is bounded by the quality of what you retrieve.