AI & Machine Learning

Vercel AI SDK: Streaming AI Responses in Next.js

Uvin Vindula·May 20, 2024·10 min read

Last updated: April 14, 2026

TL;DR

The Vercel AI SDK is the best abstraction I've found for streaming AI responses in Next.js. It gives you useChat and useCompletion hooks on the client, streamText and generateText on the server, and a provider system that lets you swap between Claude, GPT-4, and Gemini without rewriting your application logic. I use it with the @ai-sdk/anthropic provider in every AI project I ship. Streaming isn't optional anymore — users expect tokens to appear as they're generated, not after a 5-second blank screen. This guide covers the patterns I've tested in production: server routes, client hooks, tool calling, error handling, and the cost controls that stop a chatbot from burning through your API budget overnight. If you're building AI features in Next.js, this is the stack.

Why Streaming Matters

Traditional request-response AI feels broken. The user sends a message, stares at a loading spinner for 3-8 seconds, and then a wall of text appears. Streaming fixes that by sending tokens to the browser as the model generates them. The first token arrives in under 500ms. The user starts reading immediately. Perceived latency drops by 80%.

I learned this the hard way building a chatbot for a client project. The non-streaming version had a 40% abandonment rate on the chat interface. Users would send a message, wait, assume it was broken, and leave. After switching to streaming with the Vercel AI SDK, abandonment dropped to under 10%. The model wasn't faster — the experience was.

Streaming also unlocks patterns that aren't possible with batch responses. You can show partial tool results, update progress indicators as the model reasons through a problem, and cancel generation mid-stream if the user navigates away. These aren't nice-to-haves. They're the difference between an AI feature that feels native and one that feels bolted on.

The Vercel AI SDK handles the hard parts of streaming: chunked transfer encoding, proper backpressure, client-side state management, and graceful error recovery. Without it, you're writing custom ReadableStream parsers and managing your own connection state. I've done that before. I don't recommend it.

Setting Up Vercel AI SDK with Claude

Install the core SDK and the Anthropic provider. The SDK uses a provider pattern — you install a thin core and then add providers for the models you want.

bash

npm install ai @ai-sdk/anthropic

Set your Anthropic API key as an environment variable:

env

ANTHROPIC_API_KEY=sk-ant-api03-...

The provider is configured once and used everywhere. I keep mine in a shared module:

typescript

// lib/ai/provider.ts
import { createAnthropic } from '@ai-sdk/anthropic';

export const anthropic = createAnthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

export const claude = anthropic('claude-sonnet-4-20250514');

That's it. The claude export is a model instance you pass to any AI SDK function. When Anthropic releases a new model, you change one string. When a client wants GPT-4 instead, you swap one provider. The rest of your code doesn't change.

I use claude-sonnet-4-20250514 for most production work. It's the sweet spot between quality and speed for streaming use cases. For complex reasoning where latency matters less, I switch to claude-opus-4-20250514. For high-volume, low-complexity tasks like classification, claude-haiku keeps costs manageable.

The useChat Hook

useChat is the client-side hook that manages an entire chat conversation. It handles message history, streaming state, error recovery, and input management. In most cases, it's the only client-side code you need.

typescript

// app/chat/page.tsx
'use client';

import { useChat } from 'ai/react';

export default function ChatPage() {
  const { messages, input, handleInputChange, handleSubmit, isLoading, error } =
    useChat({
      api: '/api/chat',
      onError: (err) => {
        console.error('Chat error:', err.message);
      },
    });

  return (
    <div className="mx-auto flex max-w-2xl flex-col gap-4 p-6">
      <div className="flex flex-col gap-3">
        {messages.map((message) => (
          <div
            key={message.id}
            className={`rounded-lg p-4 ${
              message.role === 'user'
                ? 'ml-auto bg-[#F7931A] text-white'
                : 'bg-[#111827] text-[#C9D1E0]'
            }`}
          >
            {message.content}
          </div>
        ))}
      </div>

      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask something..."
          disabled={isLoading}
          className="flex-1 rounded-lg border border-[#1A2236] bg-[#0A0E1A] px-4 py-2 text-white"
        />
        <button
          type="submit"
          disabled={isLoading}
          className="rounded-lg bg-[#F7931A] px-6 py-2 font-semibold text-white transition-colors hover:bg-[#E07B0A] disabled:opacity-50"
        >
          Send
        </button>
      </form>

      {error && (
        <p className="text-sm text-[#FF4560]">
          Something went wrong. Please try again.
        </p>
      )}
    </div>
  );
}

A few things I've learned about useChat in production:

Message persistence. useChat keeps messages in local state by default. For persistent conversations, pass an initialMessages prop loaded from your database and use the onFinish callback to save the assistant's response.

Streaming indicators. The isLoading boolean is true while tokens are streaming. I use this to show a typing indicator and disable the input. Don't show a spinner — show the streaming text. That's the whole point.

Multi-conversation support. Pass a unique id prop to useChat to manage multiple independent conversations on the same page. Each ID gets its own message history and streaming state.

typescript

const chat1 = useChat({ id: 'main', api: '/api/chat' });
const chat2 = useChat({ id: 'sidebar', api: '/api/chat' });

For simpler use cases where you just need a single prompt-response without conversation history, useCompletion is the lighter alternative. Same streaming behavior, less state management overhead.

Streaming Server Route

The server route is where the AI SDK connects to Claude. In Next.js App Router, this is a Route Handler that returns a streaming response.

typescript

// app/api/chat/route.ts
import { streamText } from 'ai';
import { claude } from '@/lib/ai/provider';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: claude,
    system: `You are a helpful assistant on iamuvin.com. You help developers 
with TypeScript, React, Next.js, and Web3 questions. Be concise and direct. 
Include code examples when relevant. Never make up library APIs — if you're 
unsure about a specific function signature, say so.`,
    messages,
    maxTokens: 2048,
    temperature: 0.7,
  });

  return result.toDataStreamResponse();
}

The streamText function handles the connection to Claude's API and returns a streamable result. toDataStreamResponse() converts that into the format useChat expects — a data stream with proper headers and chunked encoding.

For edge deployment, add the runtime directive:

typescript

export const runtime = 'edge';

I use Edge Runtime for chat routes because it reduces latency by 100-200ms compared to Node.js serverless functions. The AI SDK works in both runtimes, but Edge is faster for streaming because there's no cold start overhead on subsequent requests.

Important: Never expose your system prompt to the client. The AI SDK sends messages from the client but the system prompt stays server-side. I've seen projects that send the system prompt as the first message from the client — that's a security issue. Anyone can open DevTools and read it.

Tool Use with AI SDK

Tool calling is where the AI SDK really earns its keep. Instead of asking Claude to output JSON and parsing it yourself, you define typed tools with Zod schemas and the SDK handles invocation, validation, and result injection.

typescript

// app/api/chat/route.ts
import { streamText, tool } from 'ai';
import { z } from 'zod';
import { claude } from '@/lib/ai/provider';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: claude,
    system: 'You help users find information about products and orders.',
    messages,
    tools: {
      getProduct: tool({
        description: 'Look up a product by name or SKU',
        parameters: z.object({
          query: z.string().describe('Product name or SKU code'),
        }),
        execute: async ({ query }) => {
          const product = await db.product.findFirst({
            where: {
              OR: [
                { name: { contains: query, mode: 'insensitive' } },
                { sku: { equals: query } },
              ],
            },
          });

          if (!product) {
            return { found: false, message: `No product found for "${query}"` };
          }

          return {
            found: true,
            name: product.name,
            price: product.price,
            stock: product.stockCount,
            sku: product.sku,
          };
        },
      }),

      getOrderStatus: tool({
        description: 'Check the status of an order by order ID',
        parameters: z.object({
          orderId: z.string().describe('The order ID to look up'),
        }),
        execute: async ({ orderId }) => {
          const order = await db.order.findUnique({
            where: { id: orderId },
            select: { status: true, estimatedDelivery: true, trackingUrl: true },
          });

          if (!order) {
            return { found: false, message: 'Order not found' };
          }

          return { found: true, ...order };
        },
      }),
    },
    maxSteps: 3,
  });

  return result.toDataStreamResponse();
}

The maxSteps parameter is critical. It controls how many tool call rounds the model can make before it must produce a final text response. Without it, a misconfigured tool can create an infinite loop. I set it to 3 for most use cases — that covers the pattern of "call tool, get result, maybe call another tool, then respond."

Claude is remarkably good at deciding when to use tools. In my testing, it correctly identifies tool invocations about 97% of the time with well-written descriptions. The key is writing descriptions that explain *when* to use the tool, not just *what* it does. "Look up a product by name or SKU" is better than "product search function."

On the client side, useChat automatically handles tool call messages. You can render them with custom UI:

typescript

{messages.map((message) => (
  <div key={message.id}>
    {message.content}
    {message.toolInvocations?.map((tool) => (
      <div key={tool.toolCallId} className="rounded bg-[#1A2236] p-3 text-sm">
        <span className="text-[#6B7FA3]">Looking up: {tool.toolName}</span>
        {tool.state === 'result' && (
          <pre className="mt-2 text-[#C9D1E0]">
            {JSON.stringify(tool.result, null, 2)}
          </pre>
        )}
      </div>
    ))}
  </div>
))}

Custom Providers

The provider system is one of the AI SDK's strongest design decisions. It abstracts the model layer so you can swap providers without touching application logic. I use this in two scenarios: A/B testing models and fallback chains.

typescript

// lib/ai/provider.ts
import { createAnthropic } from '@ai-sdk/anthropic';
import { createOpenAI } from '@ai-sdk/openai';

const anthropic = createAnthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const openai = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export const models = {
  default: anthropic('claude-sonnet-4-20250514'),
  reasoning: anthropic('claude-opus-4-20250514'),
  fast: anthropic('claude-haiku-4-20250514'),
  fallback: openai('gpt-4o'),
} as const;

export type ModelKey = keyof typeof models;

For a fallback chain that switches providers when one is down:

typescript

import { streamText } from 'ai';
import { models } from '@/lib/ai/provider';

async function streamWithFallback(
  params: Omit<Parameters<typeof streamText>[0], 'model'>
) {
  const chain: Array<keyof typeof models> = ['default', 'fallback'];

  for (const modelKey of chain) {
    try {
      const result = streamText({
        ...params,
        model: models[modelKey],
      });

      return result;
    } catch (error) {
      console.error(`Model ${modelKey} failed, trying next:`, error);
      continue;
    }
  }

  throw new Error('All AI providers failed');
}

I always default to Claude, but having a fallback provider has saved me twice in production when Anthropic had brief API issues. The user never noticed because the response came from GPT-4o instead.

Error Handling in Streams

Streaming introduces error scenarios that don't exist with batch responses. The connection can drop mid-stream, the model can hit a rate limit after 200 tokens, or the client can navigate away while tokens are still arriving. The AI SDK handles most of this, but you need to configure it correctly.

Server-side error handling:

typescript

// app/api/chat/route.ts
import { streamText, APICallError } from 'ai';
import { claude } from '@/lib/ai/provider';

export async function POST(req: Request) {
  try {
    const { messages } = await req.json();

    if (!messages || !Array.isArray(messages) || messages.length === 0) {
      return Response.json(
        { error: 'Messages array is required' },
        { status: 400 }
      );
    }

    const lastMessage = messages[messages.length - 1];
    if (lastMessage.content.length > 10000) {
      return Response.json(
        { error: 'Message too long. Maximum 10,000 characters.' },
        { status: 400 }
      );
    }

    const result = streamText({
      model: claude,
      messages,
      maxTokens: 2048,
      abortSignal: req.signal,
    });

    return result.toDataStreamResponse();
  } catch (error) {
    if (error instanceof APICallError) {
      if (error.statusCode === 429) {
        return Response.json(
          { error: 'Rate limited. Please wait a moment and try again.' },
          { status: 429, headers: { 'Retry-After': '30' } }
        );
      }

      if (error.statusCode === 529) {
        return Response.json(
          { error: 'AI service is temporarily overloaded. Please try again.' },
          { status: 503 }
        );
      }
    }

    console.error('Chat API error:', error);
    return Response.json(
      { error: 'Internal server error' },
      { status: 500 }
    );
  }
}

The abortSignal: req.signal is important. It tells the SDK to stop generating tokens when the client disconnects. Without it, you're paying for tokens nobody will read.

Client-side, useChat provides the error state and onError callback. I always show a user-friendly message and provide a retry mechanism:

typescript

const { messages, error, reload } = useChat({
  api: '/api/chat',
  onError: (err) => {
    // Log to your error tracking service
    captureException(err);
  },
});

// In the UI
{error && (
  <div className="flex items-center gap-2 text-[#FF4560]">
    <span>Failed to send message.</span>
    <button onClick={() => reload()} className="underline">
      Try again
    </button>
  </div>
)}

The reload function resends the last user message. It's the single most important UX pattern for streaming chat — one click to retry instead of retyping the question.

Token Counting and Cost Control

AI costs sneak up on you. A chatbot with 1,000 daily users and an average conversation of 10 messages can rack up hundreds of dollars per day if you're not tracking token usage. The AI SDK gives you hooks to monitor and control this.

Server-side token tracking with onFinish:

typescript

import { streamText } from 'ai';
import { claude } from '@/lib/ai/provider';

export async function POST(req: Request) {
  const { messages, userId } = await req.json();

  const result = streamText({
    model: claude,
    messages,
    maxTokens: 2048,
    onFinish: async ({ usage, text }) => {
      await db.aiUsage.create({
        data: {
          userId,
          model: 'claude-sonnet-4-20250514',
          promptTokens: usage.promptTokens,
          completionTokens: usage.completionTokens,
          totalTokens: usage.totalTokens,
          estimatedCost: calculateCost(usage),
          timestamp: new Date(),
        },
      });
    },
  });

  return result.toDataStreamResponse();
}

function calculateCost(usage: { promptTokens: number; completionTokens: number }) {
  const SONNET_INPUT_PER_MILLION = 3.0;
  const SONNET_OUTPUT_PER_MILLION = 15.0;

  const inputCost = (usage.promptTokens / 1_000_000) * SONNET_INPUT_PER_MILLION;
  const outputCost =
    (usage.completionTokens / 1_000_000) * SONNET_OUTPUT_PER_MILLION;

  return inputCost + outputCost;
}

Beyond tracking, here are the controls I put in place for every production AI feature:

Per-user daily limits. I check token usage before each request and return a 429 if the user has exceeded their daily budget. Free tier gets 50K tokens/day. Paid tier gets 500K.

Max tokens per response. Always set maxTokens. Never let the model decide how long to talk. 2048 is my default for chat. 4096 for longer-form generation. 512 for classification and short answers.

Message history trimming. Long conversations accumulate tokens fast. I keep the last 20 messages in the context window and summarize older ones. The AI SDK passes the full messages array to the model, so you need to trim it yourself:

typescript

function trimMessages(messages: Message[], maxMessages = 20): Message[] {
  if (messages.length <= maxMessages) return messages;

  const systemMessage = messages.find((m) => m.role === 'system');
  const recentMessages = messages.slice(-maxMessages);

  return systemMessage ? [systemMessage, ...recentMessages] : recentMessages;
}

Model routing by task. Not every request needs Sonnet. I route simple questions to Haiku and save Sonnet for complex multi-step reasoning. This cut my API costs by 45% on one project without any noticeable quality drop for end users.

My Production Patterns

After shipping streaming AI in several projects, here are the patterns that have survived production:

1. Middleware for auth and rate limiting. Never expose your chat API without authentication. I use Next.js middleware to validate the session and check rate limits before the request reaches the route handler.

typescript

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
import { getToken } from 'next-auth/jwt';

export async function middleware(req: NextRequest) {
  if (req.nextUrl.pathname.startsWith('/api/chat')) {
    const token = await getToken({ req });

    if (!token) {
      return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
    }

    const rateLimit = await checkRateLimit(token.sub as string);
    if (!rateLimit.allowed) {
      return NextResponse.json(
        { error: 'Rate limit exceeded' },
        { status: 429, headers: { 'Retry-After': String(rateLimit.retryAfter) } }
      );
    }
  }

  return NextResponse.next();
}

2. Structured system prompts. I template my system prompts with runtime data — user name, subscription tier, available tools. This makes the same chat route serve different experiences without code duplication.

typescript

function buildSystemPrompt(user: User): string {
  return `You are an AI assistant on iamuvin.com.
Current user: ${user.name} (${user.tier} tier).
Available tools: ${user.tier === 'pro' ? 'product search, order tracking, analytics' : 'product search'}.
Date: ${new Date().toISOString().split('T')[0]}.
Be concise. Use code examples for technical questions.`;
}

3. Optimistic UI for message sending. Don't wait for the server to acknowledge the message before showing it in the chat. useChat does this by default — the user's message appears immediately and the assistant's response streams in. This makes the interface feel instant.

4. Graceful degradation. If the AI service is completely down, show a fallback. I've used a simple FAQ search as a backup — it's not as good as Claude, but it's infinitely better than an error screen.

5. Analytics on every conversation. Track message count, response time, tool usage, error rate, and user satisfaction. I use a thumbs up/down on each assistant message and pipe it to my analytics dashboard. This data drives prompt improvements more than any amount of testing in a playground.

6. Edge-first deployment. Every chat API route runs on Edge Runtime. The latency difference is noticeable — especially for users in regions far from the default serverless region. Edge routes cold start in under 50ms.

Key Takeaways

The Vercel AI SDK eliminates the boilerplate of streaming AI responses. useChat on the client, streamText on the server, and a provider pattern that abstracts the model layer.
Streaming is a UX requirement, not an optimization. Users abandon non-streaming chat interfaces. First token in under 500ms is the target.
Tool calling with Zod schemas replaces brittle prompt engineering. Define what the model can do, let it decide when.
The provider pattern lets you swap models without rewriting application logic. Default to Claude, fall back to GPT-4o, route cheap tasks to Haiku.
Cost control is architecture, not afterthought. Track tokens per user, set maxTokens, trim message history, and route by task complexity.
Error handling in streams requires `abortSignal`, retry mechanisms with reload(), and server-side validation before streaming begins.
Always authenticate chat API routes. Never expose an unauthenticated endpoint that calls a paid AI API.

The Vercel AI SDK with Claude as the provider is the stack I recommend for any Next.js project that needs AI features. It handles the streaming infrastructure so you can focus on building the product. If you're planning an AI integration and want it done right, check out my services — I've shipped this pattern multiple times and know where the edge cases hide.

*Uvin Vindula is a Web3 and AI engineer based between Sri Lanka and the UK. He builds production AI features, full-stack web applications, and smart contracts at iamuvin.com↗. Follow his work @IAMUVIN↗.*

Working on a Web3 or AI project?

Let's talk↗

Uvin Vindula

Web3 and AI engineer based in Sri Lanka and the UK. Author of The Rise of Bitcoin. Director of Blockchain and Software Solutions at Terra Labz. Founder of uvin.lk — Sri Lanka's Bitcoin education platform with 10,000+ learners.

hello@iamuvin.com uvin.lk↗LinkedIn↗