How to integrate the OpenAI API in a Next.js SaaS without blowing up costs

How to integrate the OpenAI API in a Next.js SaaS without blowing up costs

May 16, 20269 minAI, OpenAI, Next.js, Tutorial

Short answer (60 seconds): integrating OpenAI in a SaaS without blowing up costs comes down to five practices you can ship in 3 hours: (1) wrap the SDK with telemetry from day 1, (2) route every call to the cheapest model that can do the task (gpt-4o-mini for 80% of cases), (3) cache deterministic prompts in Redis, (4) stream responses for better UX, and (5) rate limiting + budget caps per tenant. Applied properly, this cuts the monthly bill 50-70% with no perceptible quality loss.

The initial OpenAI integration in a SaaS is deceptively easy: four lines of code and you have a working chatbot. The problem starts the second month, when you see the bill.

This post is the guide I wish I had when I integrated AI into my first SaaS. Not "how to call the API" — that's in the OpenAI docs. But how to integrate it so it scales without destroying your margin.

The most common mistake I see in LATAM SaaS

Charging USD 49/month to a user who costs you USD 80/month in tokens. That case is real — a client who came to consulting with exactly that problem: AI feature in mid-tier plan, heavy users consuming more than they pay, negative margin per active user.

The five practices below avoid that scenario.

Step 1 · Wrap the SDK with telemetry from day 1

Before making the first OpenAI call from your app, centralize all calls in one module. This lets you:

  • Log tokens, model, latency, estimated cost.
  • Swap providers or models without touching 50 files.
  • Apply policies (caps, rate limits) at a single point.
~
// lib/ai/client.ts import OpenAI from "openai"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY, baseURL: "https://oai.helicone.ai/v1", defaultHeaders: { "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`, }, }); // Prices per 1M tokens (input / output). Update when OpenAI changes them. const PRICING: Record<string, { input: number; output: number }> = { "gpt-4o": { input: 2.50, output: 10.00 }, "gpt-4o-mini": { input: 0.15, output: 0.60 }, "o3-mini": { input: 1.10, output: 4.40 }, }; export async function callLLM(opts: { tenantId: string; model: keyof typeof PRICING; messages: OpenAI.ChatCompletionMessageParam[]; temperature?: number; }) { const start = Date.now(); const res = await openai.chat.completions.create({ model: opts.model, messages: opts.messages, temperature: opts.temperature ?? 0.2, }, { headers: { "Helicone-User-Id": opts.tenantId }, }); const latencyMs = Date.now() - start; const u = res.usage!; const pricing = PRICING[opts.model]; const costUsd = (u.prompt_tokens / 1_000_000) * pricing.input + (u.completion_tokens / 1_000_000) * pricing.output; // Local log — Helicone also stores, this is redundancy for the internal dashboard await logAiCall({ tenantId: opts.tenantId, model: opts.model, promptTokens: u.prompt_tokens, completionTokens: u.completion_tokens, costUsd, latencyMs, }); return { content: res.choices[0].message.content, usage: u, costUsd }; }

Every AI call in the app goes through callLLM. If later you want to add caching, kill-switch, fallback to Anthropic, you do it here.

Step 2 · Pick the model dynamically

This is the practice that has the biggest impact on the bill.

Mental rule for 2026:

Task typeModelWhy
Classification, extraction, routinggpt-4o-mini16x cheaper than gpt-4o, indistinguishable quality for structured tasks.
Short summary, paraphrasegpt-4o-miniSame.
Rich content generationgpt-4oYou need creativity/voice. The difference shows.
Multi-step reasoningo3-miniExplicit reasoning chains. More expensive but fewer passes needed.
Embeddingstext-embedding-3-small$0.02/1M tokens, enough for almost everything.

Implementation:

~
// lib/ai/routing.ts type TaskType = "classify" | "extract" | "summarize" | "generate" | "reason"; const MODEL_FOR_TASK: Record<TaskType, "gpt-4o" | "gpt-4o-mini" | "o3-mini"> = { classify: "gpt-4o-mini", extract: "gpt-4o-mini", summarize: "gpt-4o-mini", generate: "gpt-4o", reason: "o3-mini", }; export function modelFor(task: TaskType) { return MODEL_FOR_TASK[task]; }

Then, on every call:

~
await callLLM({ tenantId, model: modelFor("classify"), messages: [...] });

Changing the central policy from mini → 4o when needed is a single line.

Anti-pattern: picking the model inside business logic (if (premium) model = 'gpt-4o'). That scatters cost decisions across the code.

Step 3 · Caching deterministic prompts

If the same prompt with the same parameters repeats, there's no reason to pay for it twice.

~
// lib/ai/cache.ts import { Redis } from "@upstash/redis"; import crypto from "node:crypto"; const redis = Redis.fromEnv(); function cacheKey(input: { model: string; messages: any[]; temperature: number }) { const hash = crypto.createHash("sha256").update(JSON.stringify(input)).digest("hex"); return `ai:cache:${hash}`; } export async function cachedCallLLM(opts: Parameters<typeof callLLM>[0]) { // Cache only when the response is deterministic const isDeterministic = (opts.temperature ?? 0.2) === 0 || opts.temperature === 0.2; if (!isDeterministic) return callLLM(opts); const key = cacheKey({ model: opts.model, messages: opts.messages, temperature: opts.temperature ?? 0.2 }); const cached = await redis.get<string>(key); if (cached) { // Hit — cost = 0, latency = <10ms return { content: cached, usage: { prompt_tokens: 0, completion_tokens: 0 }, costUsd: 0, cached: true }; } const result = await callLLM(opts); await redis.set(key, result.content, { ex: 60 * 60 * 24 * 7 }); // TTL 7 days return { ...result, cached: false }; }

When NOT to cache:

  • Conversational chat (each interaction has different context).
  • Creative generation with temperature > 0.5.
  • Any flow where the user expects variation.

When TO cache:

  • Classification ("is this ticket about billing? yes/no").
  • Extraction ("extract email, amount, date from this text").
  • Translations of the same source text.
  • Summaries of documents that don't change.

Typical hit rates in production: 30-60% in extraction flows, which translates into 30-60% lower cost in those endpoints.

Step 4 · Streaming responses

For any response that takes more than 2 seconds, streaming dramatically improves perceived UX. Next.js App Router implementation:

~
// app/api/chat/route.ts import { openai } from "@/lib/ai/client"; export async function POST(req: Request) { const { messages, tenantId } = await req.json(); const stream = await openai.chat.completions.create({ model: "gpt-4o-mini", messages, stream: true, }, { headers: { "Helicone-User-Id": tenantId } }); const encoder = new TextEncoder(); const readable = new ReadableStream({ async start(controller) { for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) controller.enqueue(encoder.encode(content)); } controller.close(); }, }); return new Response(readable, { headers: { "Content-Type": "text/event-stream" }, }); }

On the frontend:

~
const res = await fetch("/api/chat", { method: "POST", body: JSON.stringify({ messages, tenantId }) }); const reader = res.body!.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); setOutput((prev) => prev + chunk); }

20 lines total, big UX impact.

Step 5 · Per-tenant rate limiting

Without this, an abusive user (or a bug in your own frontend) can generate a USD 500 surprise bill overnight.

~
// lib/ai/rate-limit.ts import { Ratelimit } from "@upstash/ratelimit"; import { Redis } from "@upstash/redis"; const redis = Redis.fromEnv(); export const aiLimiter = new Ratelimit({ redis, limiter: Ratelimit.slidingWindow(60, "1 m"), // 60 requests/min per tenant analytics: true, prefix: "rl:ai", }); export async function assertAiLimit(tenantId: string) { const { success, limit, reset, remaining } = await aiLimiter.limit(tenantId); if (!success) { const error: any = new Error("Rate limit exceeded"); error.status = 429; error.headers = { "X-RateLimit-Limit": String(limit), "X-RateLimit-Remaining": String(remaining), "Retry-After": String(Math.ceil((reset - Date.now()) / 1000)), }; throw error; } }

Use in every AI endpoint before making the call:

~
await assertAiLimit(tenantId); const result = await cachedCallLLM({...});

60 requests/min is usually enough for human use. If you need bulk processing, expose a separate endpoint with a queue.

Step 6 · Budget caps with kill switch

The most important protection: a monthly cap per tenant that cuts off when exceeded, not just one that logs.

~
// lib/ai/budget.ts async function tenantMonthSpend(tenantId: string): Promise<number> { const month = new Date().toISOString().slice(0, 7); // YYYY-MM const key = `spend:${tenantId}:${month}`; const v = await redis.get<number>(key); return v ?? 0; } async function addSpend(tenantId: string, usd: number) { const month = new Date().toISOString().slice(0, 7); const key = `spend:${tenantId}:${month}`; await redis.incrbyfloat(key, usd); await redis.expire(key, 60 * 60 * 24 * 40); // expire after 40 days } const TENANT_MONTHLY_CAP_USD = 50; // configurable per plan export async function assertBudget(tenantId: string) { const spend = await tenantMonthSpend(tenantId); if (spend >= TENANT_MONTHLY_CAP_USD) { const error: any = new Error("Monthly AI budget exceeded"); error.status = 402; // Payment Required throw error; } } export async function recordSpend(tenantId: string, costUsd: number) { await addSpend(tenantId, costUsd); if ((await tenantMonthSpend(tenantId)) > TENANT_MONTHLY_CAP_USD * 0.8) { // Notify slack/email when a tenant hits 80% of the cap await notifyApproachingCap(tenantId); } }

Combine in the flow:

~
await assertAiLimit(tenantId); await assertBudget(tenantId); const result = await cachedCallLLM({...}); await recordSpend(tenantId, result.costUsd);

Why return 402 and not a generic error: you tell the frontend exactly what's wrong ("this tenant exceeded their budget") and can show an upsell UI ("Want more capacity? Upgrade to Pro").

What you'll learn in production

Three lessons only earned by paying bills:

  1. 10% of users consume 80% of the tokens. Power-user detection is critical. An alert when an individual user crosses a threshold usually reveals a bug in your UI or a use case you didn't anticipate.

  2. Costs grow non-linearly with user growth. It's not 2x users = 2x cost. It tends to be 2x users = 3-4x cost because each new user discovers heavy flows the rest already knew. Model this in your pricing.

  3. "No breaking change" model updates break things. OpenAI has deprecated models several times without enough notice. Your wrapper from Step 1 lets you migrate in a week instead of a month.

Let's talk about your case

If you're integrating OpenAI in your SaaS and want to review your architecture before costs surprise you, book a 30-minute call at no cost. 30 minutes is usually enough to identify where the biggest optimization is and how much you'd save.


Read also:

Frequently asked questions

What's the most expensive mistake when integrating OpenAI for the first time?

Using gpt-4o for everything. Cost difference between gpt-4o ($2.50/1M input tokens) and gpt-4o-mini ($0.15/1M) is 16x. For classification, structured-data extraction, and short summaries, gpt-4o-mini is indistinguishable. Moving 80% of traffic to mini usually drops the bill 60-70% with no perceptible quality loss.

When is caching worth it and when isn't?

Cache when prompt + parameters are deterministic and the answer doesn't need to vary (classification, extraction, translations). Don't cache conversational chat or creative generation with temperature > 0 — there the user expects variation. Typical hit rate: 30-60% in extraction flows, near 0% in open chat.

Does streaming add real complexity, or is it worth it?

Worth it almost always. Perceived UX goes from 'waiting 8 seconds' to 'reading in real time'. Implementation: 20 lines in Next.js App Router with Server-Sent Events. Only caveat: your observability has to handle streams (Helicone does it transparently).

What happens when OpenAI has an outage?

It happens more often than you'd expect (5-10 minutes every 2-3 months). Three options: (1) retry with exponential backoff — handles most; (2) fallback to Anthropic Claude for critical flows — code barely changes; (3) queue in SQS/Inngest to process later if the flow doesn't need an immediate response.

How do I prevent an abusive user from bankrupting me?

Three combined layers: (1) mandatory auth — never expose AI endpoints unauthenticated; (2) per-tenant rate limit in Redis sliding window (60 requests/min is usually enough for human use); (3) hard monthly budget cap — endpoint returns 402 Payment Required when tenant exceeds, notifies the team.

Use a proxy like Helicone or build telemetry in-house?

For startups, Helicone (or Langsmith, Portkey). Free tier up to 100K requests/month and saves 1-2 weeks of work. When you hit 1M+ requests/month or have strict compliance requirements, build in-house. Migrating later is trivial — just change the OpenAI client's baseURL.