~/log/rag-chatbot-on-a-static-site
I built the chatbot on my portfolio: RAG on a static site for $0
AI 3 min read
The constraint is the design
The “Ask About Me” box in the lab is a retrieval-augmented chatbot over my resume, case studies, and blog posts. The constraint that shaped it: this site is fully prerendered static HTML on Cloudflare Pages, and I wanted the chatbot to cost nothing at rest. That rules out a vector database, a long-running server, and anything with a monthly invoice.
What’s left turns out to be enough — and the corpus makes it obvious why. Everything I’ve written here chunks into 35 pieces. You do not need infrastructure to search 35 things.
Build time: the index is just a JSON file
A script runs before every build: it walks the content, strips frontmatter and code blocks, splits on headings, packs paragraphs into ~200-token chunks, and writes chunks.json with a source label on each chunk (case study: Chatify — Decisions & tradeoffs). The label matters: it goes into the prompt later, so the model can say where an answer came from.
The script has two modes. With Cloudflare credentials it embeds every chunk with bge-base-en-v1.5 on Workers AI and stores the vectors — cosine similarity at query time. Without credentials it writes text-only chunks and retrieval falls back to BM25, computed in the worker from the same file. At this corpus size, honestly, the lexical path holds its own: most questions about a person’s work share vocabulary with the answer.
Runtime: one edge function
POST /api/ask does five things:
- Rate limit by IP — a sliding window, 10 requests per 5 minutes, before any real work happens. Cheap rejection first; I learned that pattern wiring Arcjet into Chatify.
- Cap everything — 4-message context window, 500 chars per message, 512 output tokens. Free tiers stay free because of caps, not good intentions.
- Retrieve top-4 chunks (cosine if the index has vectors and the AI binding is up, BM25 otherwise).
- Prompt Llama 3.1 8B on Workers AI with the chunks and rules: answer only from context, say “I don’t know” when it’s not there, refuse off-topic in one sentence.
- Stream the response straight through as server-sent events, so tokens render as they arrive.
const chunks = retrieve(lastUser.content, queryVector); // top-4
const stream = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'system', content: systemPrompt(context) }, ...messages],
stream: true,
max_tokens: 512
});
return new Response(stream, { headers: sseHeaders }); The honesty clause
The system prompt’s most important line isn’t about retrieval — it’s “be honest about what is in progress vs. shipped; understatement beats overstatement.” One of the suggested questions on the page is “Is he actually good at football analytics yet?”, and the indexed answer says: in progress, models run, not validated against a real tournament. A chatbot that admits that is a better credential than one that oversells.
What I’d tell you to steal
- Chunk labels in the prompt. Source attribution costs nothing and grounds the model visibly.
- A lexical fallback. Your RAG demo shouldn’t be down because an embedding API is.
- Caps before quality. Token limits, message windows, and rate limits are the difference between a fun feature and a surprise bill.
- Skip the vector DB until your chunks need pagination. Mine fit in a JSON file the worker imports. Yours might too.
~/log/related