AI & Machine Learning

RAG Systems Explained: Build Your First Retrieval-Augmented Generation App

Uvin Vindula·February 12, 2024·13 min read

TL;DR

RAG systems let you give an LLM access to your own data without fine-tuning. You chunk your documents, convert them to vector embeddings, store them in a vector database, then at query time you fetch the most relevant chunks and pass them as context to the LLM. I build RAG pipelines using Supabase pgvector for storage and Claude as the generation layer. This guide walks through the full architecture — from ingesting documents to generating grounded answers — with production TypeScript code you can deploy today. If your app needs to answer questions about private data (internal docs, product catalogs, legal documents), RAG is the pattern. It is cheaper, faster to ship, and more controllable than fine-tuning. By the end of this article you will have a working RAG pipeline and a clear understanding of where each piece fits.

What RAG Actually Is

Retrieval-augmented generation is a pattern, not a product. The idea is straightforward: before you ask an LLM to generate a response, you retrieve relevant information from your own data store and include it in the prompt. The LLM then generates its answer grounded in that context.

I have shipped RAG systems for client projects ranging from internal knowledge bases to product recommendation engines. Every single one follows the same fundamental loop:

User asks a question
The system converts that question into a vector embedding
That embedding is compared against a database of pre-computed document embeddings
The most similar documents are retrieved
Those documents are injected into the prompt as context
The LLM generates an answer grounded in that context

That is retrieval-augmented generation in its entirety. No magic. No black box. Just search plus generation.

The term was coined by Facebook AI Research in their 2020 paper, but the pattern predates the name. Search engines have been doing retrieval for decades. What changed is that we now have embedding models that can represent semantic meaning as vectors, and LLMs that can synthesize retrieved information into coherent, contextual answers.

Here is the critical distinction: RAG does not teach the model anything new. It does not change the model's weights. It gives the model temporary access to information at inference time. Think of it as an open-book exam versus memorizing the textbook.

This distinction matters because it defines what RAG is good at and what it is not. RAG excels when you need:

Answers grounded in specific documents — company policies, product specs, legal texts
Up-to-date information — the retrieval layer can be updated without touching the model
Source attribution — you can point to exactly which document informed the answer
Data privacy — your documents never leave your infrastructure

RAG struggles when you need the model to learn a fundamentally new skill or adopt a completely different reasoning style. For that, you are looking at fine-tuning. But in my experience, 90% of real-world use cases are RAG territory.

Why RAG Beats Fine-Tuning

I get asked about fine-tuning constantly. Clients assume they need to train a custom model. Most of the time, they do not.

Here is the honest comparison based on projects I have actually shipped:

Cost. Fine-tuning a model requires preparing training data, running training jobs, and maintaining the fine-tuned model. With Claude or GPT-4, that is thousands of dollars before you write a line of application code. A RAG pipeline costs the embedding computation (pennies per thousand documents) plus vector storage (Supabase free tier handles most MVPs).

Time to production. I can ship a RAG system in a week. Fine-tuning takes weeks of data preparation alone, then iteration cycles on top. For a client project at iamuvin.com/services, speed matters.

Maintainability. When your data changes — and it always changes — RAG lets you re-embed the new documents and you are done. With fine-tuning, you re-train. Every time. On the full dataset or you risk catastrophic forgetting.

Accuracy control. RAG gives you a retrieval step you can inspect, debug, and tune independently of the generation step. If the answer is wrong, you can check: did we retrieve the right documents? If yes, the prompt needs work. If no, the embedding or chunking needs work. Fine-tuning is a black box by comparison.

Hallucination reduction. This is the big one. By providing the model with source documents in the prompt, you dramatically reduce hallucination. The model has the facts right there. It does not need to rely on parametric memory.

The one case where fine-tuning wins: when you need the model to adopt a specific tone, format, or reasoning pattern that is difficult to achieve with prompting alone. I have done this exactly once in production, for a client who needed a very specific medical terminology style. Everything else has been RAG.

The Architecture: Ingest, Embed, Store, Query, Generate

Every RAG system I build follows five stages. Here is the full pipeline:

Stage 1 — Ingest. Raw documents enter the system. PDFs, markdown files, database records, API responses. You parse them into plain text and split them into chunks. Chunk size matters more than people think — too large and you waste context window space on irrelevant text, too small and you lose the semantic coherence the embedding model needs.

Stage 2 — Embed. Each chunk gets converted into a vector embedding using an embedding model. I use Voyage AI or OpenAI's text-embedding-3-small for this. The embedding captures the semantic meaning of the chunk as a high-dimensional vector (typically 1536 dimensions).

Stage 3 — Store. The embeddings go into a vector database. I use Supabase with the pgvector extension↗ because it gives me a full Postgres database with vector search built in. No separate infrastructure to manage. The vectors sit alongside your relational data.

Stage 4 — Query. When a user asks a question, the question is embedded using the same model. That query vector is compared against all stored vectors using cosine similarity (or inner product, depending on your setup). The top-K most similar chunks are returned.

Stage 5 — Generate. The retrieved chunks are injected into the system prompt along with the user's question. The LLM — Claude in my case — generates an answer grounded in those chunks.

Here is how this looks as a data flow:

Documents → Chunking → Embedding Model → Vector DB (Supabase pgvector)
                                              ↓
User Query → Embedding Model → Similarity Search → Top-K Chunks
                                                        ↓
                                              Claude API + Chunks → Answer

Each stage is independently testable and tunable. That is the beauty of this architecture. You can swap the embedding model, change the chunk size, adjust K, or switch LLMs without rebuilding the whole system.

Building the Embedding Pipeline

Let me walk through the code. This is TypeScript, production-grade, the same patterns I use in client projects.

First, the document chunking. I use a recursive character splitter because it respects natural text boundaries (paragraphs, sentences) rather than cutting arbitrarily:

typescript

interface DocumentChunk {
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    totalChunks: number;
  };
}

function chunkDocument(
  text: string,
  source: string,
  maxChunkSize = 1000,
  overlap = 200
): DocumentChunk[] {
  const chunks: DocumentChunk[] = [];
  const separators = ["\n\n", "\n", ". ", " "];

  function splitText(text: string, separatorIndex: number): string[] {
    if (separatorIndex >= separators.length) {
      const result: string[] = [];
      for (let i = 0; i < text.length; i += maxChunkSize) {
        result.push(text.slice(i, i + maxChunkSize));
      }
      return result;
    }

    const separator = separators[separatorIndex];
    const parts = text.split(separator);
    const merged: string[] = [];
    let current = "";

    for (const part of parts) {
      const candidate = current ? current + separator + part : part;
      if (candidate.length > maxChunkSize && current) {
        merged.push(current);
        current = part;
      } else {
        current = candidate;
      }
    }
    if (current) merged.push(current);

    return merged.flatMap((segment) =>
      segment.length > maxChunkSize
        ? splitText(segment, separatorIndex + 1)
        : [segment]
    );
  }

  const rawChunks = splitText(text, 0);

  for (let i = 0; i < rawChunks.length; i++) {
    let content = rawChunks[i];

    if (i > 0 && overlap > 0) {
      const previousEnd = rawChunks[i - 1].slice(-overlap);
      content = previousEnd + content;
    }

    chunks.push({
      content: content.trim(),
      metadata: {
        source,
        chunkIndex: i,
        totalChunks: rawChunks.length,
      },
    });
  }

  return chunks;
}

The overlap parameter is critical. Without it, information that spans a chunk boundary gets lost. With 200 characters of overlap, the retriever can still find relevant content even if it was split across two chunks.

Next, the embedding function. I wrap the API call with retry logic because embedding large document sets means hundreds of API calls:

typescript

import Anthropic from "@anthropic-ai/sdk";

const EMBEDDING_MODEL = "text-embedding-3-small";
const EMBEDDING_DIMENSIONS = 1536;

async function embedTexts(texts: string[]): Promise<number[][]> {
  const response = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: EMBEDDING_MODEL,
      input: texts,
      dimensions: EMBEDDING_DIMENSIONS,
    }),
  });

  if (!response.ok) {
    throw new Error(`Embedding API error: ${response.status}`);
  }

  const data = await response.json();
  return data.data.map(
    (item: { embedding: number[] }) => item.embedding
  );
}

async function embedWithRetry(
  texts: string[],
  maxRetries = 3
): Promise<number[][]> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await embedTexts(texts);
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
  throw new Error("Unreachable");
}

I use OpenAI's embedding model even though I use Claude for generation. There is no loyalty requirement between embedding and generation models. Pick the best tool for each job. OpenAI's text-embedding-3-small is fast, cheap, and produces high-quality embeddings.

Now the ingestion orchestrator that ties chunking and embedding together:

typescript

async function ingestDocuments(
  documents: { text: string; source: string }[]
): Promise<{ content: string; embedding: number[]; metadata: Record<string, unknown> }[]> {
  const allChunks: DocumentChunk[] = [];

  for (const doc of documents) {
    const chunks = chunkDocument(doc.text, doc.source);
    allChunks.push(...chunks);
  }

  console.log(`Processing ${allChunks.length} chunks from ${documents.length} documents`);

  const batchSize = 100;
  const results: { content: string; embedding: number[]; metadata: Record<string, unknown> }[] = [];

  for (let i = 0; i < allChunks.length; i += batchSize) {
    const batch = allChunks.slice(i, i + batchSize);
    const texts = batch.map((chunk) => chunk.content);
    const embeddings = await embedWithRetry(texts);

    for (let j = 0; j < batch.length; j++) {
      results.push({
        content: batch[j].content,
        embedding: embeddings[j],
        metadata: batch[j].metadata,
      });
    }

    console.log(`Embedded ${Math.min(i + batchSize, allChunks.length)}/${allChunks.length} chunks`);
  }

  return results;
}

Batching is not optional. Sending one embedding request per chunk is slow and wasteful. The OpenAI embedding API accepts up to 2048 inputs per request. I batch at 100 to stay well within limits while keeping memory usage reasonable.

Vector Storage with Supabase pgvector

Supabase gives you Postgres with the pgvector extension↗ enabled. This means your vectors live in the same database as your application data. No separate vector database to manage, no data synchronization headaches.

Here is the SQL to set up your table:

sql

-- Enable the pgvector extension
create extension if not exists vector;

-- Create the documents table
create table documents (
  id bigserial primary key,
  content text not null,
  metadata jsonb default '{}',
  embedding vector(1536),
  created_at timestamptz default now()
);

-- Create an index for fast similarity search
create index on documents
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

-- Create the similarity search function
create or replace function match_documents(
  query_embedding vector(1536),
  match_threshold float default 0.7,
  match_count int default 5
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language sql stable
as $$
  select
    documents.id,
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
$$;

A few things to note here. The ivfflat index type is an approximate nearest neighbor index. It trades a small amount of accuracy for significant speed gains. For most RAG applications, this tradeoff is more than acceptable. The lists = 100 parameter controls the number of clusters — increase it as your dataset grows (a good rule of thumb is sqrt(total_rows)).

The match_threshold parameter in the search function filters out low-similarity results. I default to 0.7, which in my experience is a good starting point. Below 0.5, you are usually getting noise.

Now the TypeScript code to store and query:

typescript

import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

async function storeEmbeddings(
  records: { content: string; embedding: number[]; metadata: Record<string, unknown> }[]
): Promise<void> {
  const batchSize = 500;

  for (let i = 0; i < records.length; i += batchSize) {
    const batch = records.slice(i, i + batchSize);

    const { error } = await supabase.from("documents").insert(
      batch.map((record) => ({
        content: record.content,
        metadata: record.metadata,
        embedding: JSON.stringify(record.embedding),
      }))
    );

    if (error) throw new Error(`Storage error: ${error.message}`);
  }
}

async function searchDocuments(
  query: string,
  matchCount = 5,
  threshold = 0.7
): Promise<{ id: number; content: string; metadata: Record<string, unknown>; similarity: number }[]> {
  const [queryEmbedding] = await embedTexts([query]);

  const { data, error } = await supabase.rpc("match_documents", {
    query_embedding: JSON.stringify(queryEmbedding),
    match_threshold: threshold,
    match_count: matchCount,
  });

  if (error) throw new Error(`Search error: ${error.message}`);

  return data;
}

One thing I learned the hard way: always stringify your embeddings when passing them to Supabase via the JavaScript client. The pgvector extension expects a specific format, and the Supabase client does not handle the conversion automatically for RPC calls.

Query and Generation Layer

This is where everything comes together. The user asks a question, we retrieve relevant context, and Claude generates a grounded answer.

typescript

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY!,
});

interface RAGResponse {
  answer: string;
  sources: { content: string; similarity: number }[];
}

async function generateRAGResponse(
  userQuery: string
): Promise<RAGResponse> {
  const relevantDocs = await searchDocuments(userQuery, 5, 0.7);

  if (relevantDocs.length === 0) {
    return {
      answer:
        "I could not find relevant information in the knowledge base to answer this question.",
      sources: [],
    };
  }

  const context = relevantDocs
    .map(
      (doc, i) =>
        `[Source ${i + 1} | Similarity: ${doc.similarity.toFixed(3)}]\n${doc.content}`
    )
    .join("\n\n---\n\n");

  const message = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: `You are a helpful assistant that answers questions based on the provided context.

RULES:
- Answer ONLY based on the provided context
- If the context does not contain enough information, say so explicitly
- Cite which source(s) informed your answer using [Source N] notation
- Be direct and concise
- Never make up information not present in the context`,
    messages: [
      {
        role: "user",
        content: `Context:\n${context}\n\n---\n\nQuestion: ${userQuery}`,
      },
    ],
  });

  const answer =
    message.content[0].type === "text" ? message.content[0].text : "";

  return {
    answer,
    sources: relevantDocs.map((doc) => ({
      content: doc.content.slice(0, 200),
      similarity: doc.similarity,
    })),
  };
}

The system prompt is doing heavy lifting here. The instruction to answer only based on provided context is what prevents hallucination. The citation requirement ([Source N]) gives users traceability — they can verify the answer against the source material.

I return the sources alongside the answer. In production, I render these as expandable sections in the UI so users can verify. Trust in AI systems comes from transparency, not from hiding the machinery.

Here is how you wire it all together as an API endpoint in Next.js:

typescript

// app/api/ask/route.ts
import { NextRequest, NextResponse } from "next/server";

export async function POST(request: NextRequest) {
  const { question } = await request.json();

  if (!question || typeof question !== "string") {
    return NextResponse.json(
      { error: "Question is required" },
      { status: 400 }
    );
  }

  if (question.length > 1000) {
    return NextResponse.json(
      { error: "Question too long" },
      { status: 400 }
    );
  }

  const response = await generateRAGResponse(question);

  return NextResponse.json(response);
}

Production Considerations

Shipping a RAG demo is easy. Shipping a RAG system that works reliably at scale is where the real engineering happens. Here is what I have learned from production deployments.

Chunk size tuning. I start at 1000 characters with 200 overlap and adjust based on the data. Legal documents with dense, precise language need smaller chunks (500-600). Conversational content like support tickets can go larger (1200-1500). There is no universal optimal — you have to experiment with your specific dataset.

Embedding model selection. OpenAI's text-embedding-3-small is my default for cost-sensitive projects. For higher accuracy on domain-specific content, text-embedding-3-large at 3072 dimensions is measurably better. Voyage AI's models are worth evaluating if you are working with code or technical documentation. Always benchmark on your own data before deciding.

Hybrid search. Vector similarity alone misses exact keyword matches. If a user searches for "error code ERR_4521," semantic search might not find it because the embedding captures meaning, not exact strings. I combine vector search with full-text search in Postgres using ts_rank and merge the results. Supabase supports this natively:

sql

-- Hybrid search: combine vector similarity with full-text match
create or replace function hybrid_search(
  query_text text,
  query_embedding vector(1536),
  match_count int default 5
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float,
  text_rank float
)
language sql stable
as $$
  select
    d.id,
    d.content,
    d.metadata,
    1 - (d.embedding <=> query_embedding) as similarity,
    ts_rank(to_tsvector('english', d.content), plainto_tsquery('english', query_text)) as text_rank
  from documents d
  where
    1 - (d.embedding <=> query_embedding) > 0.5
    or to_tsvector('english', d.content) @@ plainto_tsquery('english', query_text)
  order by
    (1 - (d.embedding <=> query_embedding)) * 0.7 +
    ts_rank(to_tsvector('english', d.content), plainto_tsquery('english', query_text)) * 0.3
    desc
  limit match_count;
$$;

Re-ranking. The initial retrieval casts a wide net. A re-ranker (like Cohere's rerank API or a cross-encoder model) takes the top 20 results and re-scores them with higher accuracy. This consistently improves answer quality. I use it on every production system where latency budget allows the extra 100-200ms.

Caching. Identical questions should not hit the embedding API and vector database every time. I cache at two levels: embedding cache (same text produces same vector, deterministically) and result cache (same question within a TTL window returns cached results). Redis works, but even an in-memory LRU cache eliminates the most common repeated queries.

Monitoring. You need to track three metrics: retrieval relevance (are the right documents being found?), generation quality (is the LLM using the context correctly?), and latency per stage (where are the bottlenecks?). I log every query with its retrieved documents and the final answer, then review a sample weekly. This is how you catch drift before users complain.

Document freshness. Set up a pipeline to re-embed documents when they change. For content that updates frequently, I use Supabase Realtime to trigger re-embedding on row changes. For batch updates (like re-indexing a documentation site), I run a nightly cron job that diffs against the last known state and only re-embeds changed content.

Rate limiting and cost control. Embedding and generation both cost money. Rate limit your API endpoint. Set per-user quotas. Monitor spend daily. A single user hammering your RAG endpoint can run up a meaningful bill overnight.

Key Takeaways

RAG is a pattern, not a product: retrieve relevant documents, inject them as context, let the LLM generate grounded answers
For 90% of use cases where clients ask about fine-tuning, RAG is the faster, cheaper, more maintainable solution
Chunk size and overlap directly impact retrieval quality — start at 1000 characters with 200 overlap, then tune for your data
Supabase pgvector keeps your vectors and relational data in one database, eliminating the operational overhead of a separate vector store
Hybrid search (vector + full-text) catches what pure semantic search misses, especially for exact matches and technical identifiers
Always return sources alongside answers — transparency builds trust and makes debugging straightforward
Monitor retrieval relevance, generation quality, and per-stage latency from day one

*Last updated: February 2024*

Written by Uvin Vindula

Uvin Vindula (IAMUVIN) is a Web3 and AI engineer based in Sri Lanka and the United Kingdom. He is the author of The Rise of Bitcoin, Director of Blockchain and Software Solutions at Terra Labz, and founder of uvin.lk — Sri Lanka's Bitcoin education platform with 10,000+ learners.

For development projects: hello@iamuvin.com↗ Book a call: calendly.com/iamuvin↗

Working on a Web3 or AI project?

Let's talk↗

Uvin Vindula

Web3 and AI engineer based in Sri Lanka and the UK. Author of The Rise of Bitcoin. Director of Blockchain and Software Solutions at Terra Labz. Founder of uvin.lk — Sri Lanka's Bitcoin education platform with 10,000+ learners.

hello@iamuvin.com uvin.lk↗LinkedIn↗