RAG step by step for your SaaS: from PDF to chatbot in 2 hours

RAG step by step for your SaaS: from PDF to chatbot in 2 hours

May 16, 20268 minAI, RAG, Next.js, Tutorial

Short answer (60 seconds): RAG (Retrieval-Augmented Generation) is feeding the LLM only the relevant pieces of your documents before it answers. Three steps: (1) index your documents as vectors (embeddings), (2) search for the chunks most similar to the user's question, and (3) inject those chunks into the LLM prompt. In this post you implement all three with Next.js, Supabase pgvector and OpenAI — real code, in 2 hours, no abstractions.

The question isn't "what is RAG?" — any Twitter thread can explain it. The question is "how do I implement it well in my SaaS?" without over-engineering and without unnecessary dependencies.

This post goes step by step, with real code you can copy into your project. The stack is Next.js 15 + Supabase + OpenAI because that's what I see most in LATAM startups in 2026, but the logic translates to any other stack.

What you'll have at the end

A /api/ask endpoint that:

  1. Takes a user's question.
  2. Searches the 5 most relevant chunks in your Supabase.
  3. Passes those chunks as context to OpenAI.
  4. Returns the LLM's answer, grounded in your documents.

Total time: ~2 hours if you already have Next.js + Supabase running. Starting from scratch, ~3 hours.

Prerequisites

  • Next.js 15 project with App Router.
  • Supabase account with a project created.
  • OpenAI API key (~USD 5 is enough to experiment).
  • Documents to index (PDFs, markdown, plain text — whatever).

Step 1 · Enable pgvector and create the embeddings table

In Supabase's SQL Editor, run:

~
-- Enable the pgvector extension create extension if not exists vector; -- Table for indexed chunks create table documents ( id bigserial primary key, content text not null, metadata jsonb, embedding vector(1536), -- text-embedding-3-small produces 1536 dims tenant_id uuid not null, -- multi-tenancy from day 1 created_at timestamptz default now() ); -- Index for fast search (HNSW for up to ~1M vectors) create index documents_embedding_idx on documents using hnsw (embedding vector_cosine_ops); -- Per-tenant index so the filter is cheap create index documents_tenant_idx on documents(tenant_id); -- RLS so each customer sees only their chunks alter table documents enable row level security; create policy "tenants can only access their own documents" on documents using (tenant_id = auth.uid());

Why vector(1536): that's the dimension produced by OpenAI's text-embedding-3-small. If you use another model (Voyage AI, Cohere), adjust.

Why HNSW instead of IVFFlat: HNSW is faster on queries and doesn't require retraining when you insert new vectors. The difference shows up past 100K chunks.

Step 2 · Chunking and embeddings

Install deps:

~
pnpm add openai @supabase/supabase-js pnpm add -D tsx

Indexing script. One-off, doesn't go in the API:

~
// scripts/index-documents.ts import { createClient } from "@supabase/supabase-js"; import OpenAI from "openai"; import fs from "node:fs"; const supabase = createClient( process.env.SUPABASE_URL!, process.env.SUPABASE_SERVICE_ROLE_KEY! // service_role to bypass RLS while indexing ); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const TENANT_ID = process.env.TENANT_ID!; const CHUNK_SIZE = 600; // approx tokens. 1 token ~= 4 chars in English const CHUNK_OVERLAP = 80; // overlap so context doesn't break function chunkText(text: string): string[] { const chunks: string[] = []; const approxCharsPerChunk = CHUNK_SIZE * 4; const overlapChars = CHUNK_OVERLAP * 4; let i = 0; while (i < text.length) { chunks.push(text.slice(i, i + approxCharsPerChunk)); i += approxCharsPerChunk - overlapChars; } return chunks; } async function indexFile(path: string, source: string) { const text = fs.readFileSync(path, "utf-8"); const chunks = chunkText(text); console.log(` > ${chunks.length} chunks`); // Batch embeddings — OpenAI accepts up to 2048 inputs per request const embedRes = await openai.embeddings.create({ model: "text-embedding-3-small", input: chunks, }); const rows = chunks.map((content, idx) => ({ content, embedding: embedRes.data[idx].embedding, tenant_id: TENANT_ID, metadata: { source, chunk_index: idx }, })); const { error } = await supabase.from("documents").insert(rows); if (error) throw error; } async function main() { const files = process.argv.slice(2); for (const f of files) { console.log(`Indexing ${f}...`); await indexFile(f, f); } console.log("Done."); } main();

Run it:

~
TENANT_ID=<uuid> pnpm tsx scripts/index-documents.ts docs/*.txt

Indexing cost: with text-embedding-3-small at USD 0.02 per 1M tokens, indexing 5,000 chunks of 600 tokens runs ~USD 0.06. Yes, six cents.

Step 3 · Semantic search with an RPC

This lets you query from Next.js without ad-hoc SQL:

~
create or replace function match_documents( query_embedding vector(1536), match_threshold float, match_count int, filter_tenant_id uuid ) returns table ( id bigint, content text, metadata jsonb, similarity float ) language sql stable as $$ select documents.id, documents.content, documents.metadata, 1 - (documents.embedding <=> query_embedding) as similarity from documents where documents.tenant_id = filter_tenant_id and 1 - (documents.embedding <=> query_embedding) > match_threshold order by documents.embedding <=> query_embedding limit match_count; $$;

The <=> operator computes cosine distance (0 = identical, 2 = opposite). The 1 - distance gives you similarity (1 = identical).

Typical threshold: 0.7-0.8 for technical text, 0.6 for more conversational. Lower = more recall but more noise. Start at 0.75 and tune if answers aren't good.

Step 4 · The Next.js endpoint

~
// app/api/ask/route.ts import { createClient } from "@supabase/supabase-js"; import OpenAI from "openai"; import { NextResponse } from "next/server"; const supabase = createClient( process.env.SUPABASE_URL!, process.env.SUPABASE_ANON_KEY! // anon key + RLS so it respects user's tenant_id ); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); export async function POST(req: Request) { const { question, tenantId } = await req.json(); // 1. Embed the question const embedRes = await openai.embeddings.create({ model: "text-embedding-3-small", input: question, }); const queryEmbedding = embedRes.data[0].embedding; // 2. Find top-5 chunks const { data: chunks, error } = await supabase.rpc("match_documents", { query_embedding: queryEmbedding, match_threshold: 0.75, match_count: 5, filter_tenant_id: tenantId, }); if (error || !chunks?.length) { return NextResponse.json({ answer: "I couldn't find relevant information. Could you rephrase the question?", sources: [], }); } // 3. Build the augmented prompt const context = chunks .map((c: any, idx: number) => `[Source ${idx + 1}]: ${c.content}`) .join("\n\n"); const systemPrompt = `You're an assistant that answers questions based ONLY on the provided context. If the answer isn't in the context, say you don't know. Cite sources using [Source N] at the end of each statement.`; // 4. Call the LLM const completion = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: systemPrompt }, { role: "user", content: `Context:\n${context}\n\nQuestion: ${question}` }, ], temperature: 0.2, // low for factual answers }); return NextResponse.json({ answer: completion.choices[0].message.content, sources: chunks.map((c: any) => c.metadata), tokens: completion.usage, }); }

Why temperature: 0.2: RAG is for factual answers. High temperature introduces creativity that, in this case, is hallucination in disguise.

Why gpt-4o-mini instead of gpt-4o: for Q&A over already-retrieved context, the model doesn't need much reasoning — it needs good paraphrasing. gpt-4o-mini saves you 10x in token cost.

Step 5 · Monitoring

Three minimum metrics for production:

  1. Tokens per request — they come in completion.usage. Log to your observability stack or a query_logs table.
  2. Average similarity score — if it drops week over week, your corpus is getting stale or users are asking new things.
  3. "Not found" rate — if it climbs, adjust threshold, add more documents, or review chunking.

Minimum setup with Helicone (OpenAI proxy):

~
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY, baseURL: "https://oai.helicone.ai/v1", defaultHeaders: { "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`, "Helicone-User-Id": tenantId, }, });

Helicone gives you a per-user cost dashboard, latency, and errors. Free tier up to 100K requests/month.

Pitfalls you'll hit

  1. Chunks too small cut ideas in half. If answers are bad, check retrieved chunks before changing models. That's probably the issue.
  2. Forgotten multi-tenant filters. If your app serves multiple customers, always pass tenant_id in the filter. Forgetting this is how you leak data between customers.
  3. Re-indexing everything when you add one document. Not needed. Just embed new chunks and insert. pgvector keeps the index updated.
  4. Not logging the prompt sent to the LLM. When an answer is bad, you want to see what context was injected. Without that log you're debugging blind.

Let's talk about your case

If you're considering RAG in your SaaS and want to validate whether your use case justifies it (sometimes it doesn't — see the FAQ), book a 30-minute call at no cost. In 20 minutes I can usually tell you whether RAG is the right tool or whether your case is solved more simply with prompt engineering + a good system prompt.


Read also:

Frequently asked questions

Why pgvector and not Pinecone or another dedicated vector DB?

For most SaaS up to 1M chunks, pgvector on Supabase is enough: zero extra infra, trivial tenant scoping via RLS, flat costs. Pinecone makes sense when you hit 10M+ vectors or need very complex filters. Start with pgvector and migrate later if needed.

What chunk size should I use?

400-800 tokens with 50-100 overlap works well for technical docs and articles. For legal/contracts, go up to 1000-1500 tokens — the semantic unit is longer. For chat logs, drop to 200-300. If a chunk cuts an idea in half, answers will be bad.

Which embedding model in 2026?

OpenAI's text-embedding-3-small is the workhorse: USD 0.02 per 1M tokens, 1536 dimensions, solid quality for almost any SaaS. text-embedding-3-large is 5-10% better recall but 6x the cost. Voyage AI and Cohere embed-3 are valid alternatives, especially if you want to avoid OpenAI dependency.

How much does this cost to run in production?

For 5K indexed documents (10-50K chunks) and 1K queries/month: initial indexing USD 5-15 one-time, queries in operation USD 30-80/month (query embeddings + LLM completion). Supabase usually fits in free tier up to a point.

How do I handle multi-tenant permissions?

Add a `tenant_id` column to the embeddings table and filter it in the SQL query before doing the similarity search. With Supabase, configure Row Level Security policies per tenant_id. Never enforce permissions in app code alone — one bug and a customer sees another's data.

When do I NOT use RAG?

When the answer doesn't require external knowledge (tasks like classification, creative generation, translation). When the relevant documents fit in the model's context window (Claude Sonnet accepts 200K tokens — for small corpora it's simpler to stuff everything in). When you need answers that combine information across many documents — naive RAG fails on questions like 'compare all 2025 contracts'.