I Built an Interactive RAG Demo to Show How Retrieval-Augmented Generation Actually Works

RAG (Retrieval-Augmented Generation) is one of those concepts that comes up in every AI engineering job description. Everyone wants you to know it. But most explanations are either too abstract (“it grounds LLM responses in external knowledge”) or too buried in code to follow.

I wanted something I could point to in a job application and say: here, this shows I understand the full pipeline, not just the buzzword. So I built How does RAG work? — an interactive demo that visualizes every step of the RAG pipeline as it happens.

What it does

The app is a chat interface with a side-by-side pipeline visualization. You ask a question, and the left panel shows exactly what’s happening under the hood in four steps:

Analyze query terms — each word in your question is weighted by importance (using TF-IDF or a heuristic). High-importance domain terms are blue, common stop words are grey. Each word shows how many source chunks contain it.
Compare against all chunks — an animated bar chart shows the cosine similarity score for every chunk in the knowledge base. The top 3 are highlighted in blue. You can see at a glance that most chunks are irrelevant and only a few are close matches.
Retrieve top chunks — the winning chunks slide in with their full text, source file, and similarity score. Each card shows exactly what context the LLM will receive.
Augment and generate — the retrieved chunks are injected into Claude’s system prompt, and the model generates a response with [Source N] citations.

The interactive part is what makes it useful as a demo: you can hover any word in the chat — in your question or the AI’s response — and see it highlighted across the retrieved source chunks in the sidebar. It makes the connection between “what did I ask” and “what did the system find” immediately visible.

How I built it

The whole thing was built in an afternoon using Claude Code. Here’s the stack:

Next.js 16 with App Router and Tailwind
Vercel AI SDK for streaming chat and custom data parts
Anthropic Claude (via Vercel AI Gateway) for generation
OpenAI text-embedding-3-small for vector embeddings
In-memory vector store — just an array with cosine similarity, no database

The knowledge base

I started with four markdown files in a docs/ folder covering RAG, vector databases, prompt engineering, and AI agents. Each is 300-600 words. On first request, the app chunks them by paragraph, embeds all chunks, and holds everything in memory. For a demo with ~13 chunks, this is instant and requires zero infrastructure.

The retrieval pipeline

When you send a message, the API route:

Embeds your query using the same embedding model
Computes cosine similarity against every chunk
Picks the top 3
Injects them into Claude’s system prompt with source labels
Streams the response back

Nothing fancy — this is textbook RAG. But the interesting part is how the metadata gets to the frontend.

Custom data stream parts

The Vercel AI SDK supports custom data-* stream parts that ride alongside the LLM response. I used three:

data-sources — the full text and metadata of retrieved chunks
data-scores — similarity scores for all chunks (for the bar chart)
data-queryTerms — each query word with its importance weight and which sources contain it

These arrive as typed parts in the message.parts array on the client, so the sidebar can render them in real time as the response streams in.

The hover system

A React context (HoveredTermContext) connects three layers: the query term chips in the sidebar, the source card text, and every word in the chat messages. Hovering any word sets the context, and all three layers react — the sidebar highlights matching text with <mark> tags, source cards get a blue border if they contain the term, and the chat word gets a blue background.

This was the feature that took the most iteration to get right. The first version only made query terms hoverable, which meant the fallback mode (no API key) showed no interaction at all. Making every word hoverable and using prefix matching for stemming (“retriev” matches “retrieval”, “retrieved”, etc.) made it work across both modes.

Local fallback

The app works without any API key. When AI_GATEWAY_API_KEY isn’t set, it swaps in TF-IDF vectors for embeddings and skips the LLM generation step. The full retrieval pipeline still runs — chunking, vectorization, similarity search, retrieval — so you can see the visualization without spending a cent. The chat just shows a note that says “running in local mode” instead of a Claude response.

This was important to me. If an interviewer clicks the link, I don’t want them to hit a “please configure your API key” wall.

What I’d add next

If I were extending this into a production tool rather than a demo:

Upload your own documents — let users drag in PDFs or paste URLs
Persistent vector store — pgvector or Pinecone instead of in-memory
Chunk strategy comparison — show how different chunking approaches (fixed size, semantic, sentence-level) affect retrieval quality
Evaluation metrics — track retrieval precision and answer faithfulness

But for a portfolio piece that demonstrates RAG knowledge, the current version hits the mark. It shows the pipeline, it’s interactive, and it runs without setup.

The source code is on GitHub and the live demo is at how-does-rag-work.vercel.app.