Tech Stack, Limitations & Sharing

How Capybara Actually Works

Capybara isn't magic. It's a specific set of technology choices that happen to work really well together. Here's exactly what's going on.

Voice goes straight to OpenAI. When you click the mic button, your browser opens a WebRTC connection directly to OpenAI's Realtime API. Your audio is not routed through our server — it goes straight from your microphone to OpenAI's servers, and their response comes straight back. This is why the latency feels so natural. There's no middleman transcribing your speech, sending it to a chatbot, getting text back, and then synthesizing speech. It's one continuous real-time audio stream.

The model is gpt-4o-mini-realtime-preview. That's OpenAI's smallest real-time model. No need for the biggest, most expensive option — mini handles conversation, tool calling, and multilingual switching perfectly well for this use case. The voice is called 'shimmer.'

Capybara has 11 tools and they all run in your browser. When Capybara says 'I'll take you to the academics page,' it's not just talking — it's calling a function called navigate_to that actually changes the page. Same with font size, contrast, dyslexia font, reduced motion. The tools also include read_section (reads page content aloud), search_site (keyword search across all pages), and get_page_state (knows what page you're on and what's visible). All of these execute as JavaScript in your browser. OpenAI's model decides which tool to call; your browser executes it.

The site is a sneaky SPA. This looks like a normal multi-page website, but under the hood it's a single-page app. When you click a link or Capybara navigates you somewhere, the page doesn't actually reload — the router fetches the new content and swaps out the main area. This is critical because a full page reload would kill the WebRTC connection, and you'd have to reconnect to Capybara every time you changed pages. The SPA router is the reason you can say 'take me to academics' and Capybara is still right there talking to you when you arrive.

The backend is tiny. The Capybara backend is a FastAPI server running on port 8084 behind nginx. Its entire job is two things: (1) create ephemeral API keys so the OpenAI key isn't exposed to the browser, and (2) maintain a WebSocket 'sideband' connection to the same OpenAI session so it can observe what's happening. That's it. The backend doesn't process your audio, doesn't run your tools, and doesn't generate responses. It's a bouncer that hands you a wristband and then stands by the door.

The personality is in the prompt. Capybara's personality — warm, concise, proud of the school, stays on topic, won't be prompt-injected — is entirely defined in a configuration file. It's about 40 lines of instructions that tell the model who it is, what tools to use when, and what to refuse. There's no fine-tuning, no custom model, no training data. It's a well-written system prompt and a good model.

It was recycled. The WebRTC client, the sideband pattern, the session management — all of it was adapted from Commando, an earlier voice assistant project Gwen built as a learning exercise. The architecture already worked. It just needed a new personality, new tools, and a new purpose — and honestly, finding a real use for that code was one of the most satisfying parts of this whole project.

The Rest of the Tech Stack

The website itself is deliberately simple.

Flask + Jinja2 templates. The site is a Python Flask app. Each page has an HTML template and a YAML content file. The template defines the layout; the YAML file holds every word of text, every card, every fact. This means the student team can eventually edit content by changing YAML files without touching HTML or code.

Vanilla JavaScript. There is no React, no Vue, no Angular, no npm, no webpack, no build step. The JavaScript is plain ES6 modules loaded directly by the browser. The SPA router, the accessibility toolbar, and the entire Capybara frontend are about 1,500 lines of vanilla JS across six files. This keeps things simple, fast, and easy to understand.

The accessibility toolbar is homemade. Font size adjustment, high contrast mode, dark mode, dyslexia-friendly font (OpenDyslexic), and reduced motion — all built from scratch in CSS and JS. Settings persist in localStorage. Every feature is keyboard accessible and voice-controllable through Capybara.

The VM costs almost nothing. Everything runs on a GCP e2-micro instance — 2 vCPUs, 1GB of RAM. That's the free-tier machine. It runs the Flask site, the Capybara backend, and two other unrelated projects. Nginx handles routing and SSL (via Certbot). The whole thing is held together with systemd service files and good intentions.

Content search is YAML-powered. Capybara's search_site tool works by hitting a /site-content.json endpoint that builds a flat index from every YAML content file at startup. Each section has a 'summary' and 'facts' field specifically written so Capybara can search them. This is why Capybara can answer questions about pages you haven't visited — it's searching a pre-built index of the entire site.

AI's Limitations

Capybara is powered by OpenAI. It's impressive. It's also a language model, and language models have real limitations worth being upfront about.

It can hallucinate. There are guardrails — Capybara is instructed to search the site content before answering and to say 'I don't know' rather than guess. But language models can still state things confidently that aren't true. If Capybara tells you something that sounds specific (a date, a number, a policy detail), verify it on the actual page.

No memory. Every time you start a new session with Capybara, it starts completely fresh. It doesn't remember your last conversation, your preferences, or what you asked yesterday. And even within a single session, the conversation history is pruned aggressively to keep responses fast and on-topic — so it may lose track of something you said a few minutes ago. Close the tab and it's all gone.

Translation is model-quality, not human-quality. When Capybara responds in Spanish or Mandarin or ASL-aware English, it's using the model's multilingual training. It's good — surprisingly good — but it's not a professional translator. Idioms, cultural nuance, and specialized vocabulary may be off. The Google Translate widget on the site has the same limitation — it's machine translation, not professional translation.

Latency depends on a lot of factors. The WebRTC connection is fast, but you're still talking to a server on the other side of the internet. If OpenAI's servers are busy, or your connection is slow, there will be pauses. This is an inherent limitation of any cloud-based real-time AI system.

It can't verify itself. Capybara can't fact-check its own responses. It doesn't know when it's wrong. This is a fundamental limitation of current AI — the model that generates the answer cannot also reliably judge whether that answer is correct.

Gwen's Limitations

Let's be completely honest about who built this and what that means.

Gwen is not a developer. She doesn't write code. She vibe-codes — she describes what she wants to AI tools (primarily Claude and ChatGPT) and they write the code. She troubleshoots by describing error messages to AI and doing what it says. The entire site, including the Capybara integration, was built this way. Every line of Python, JavaScript, HTML, CSS, and YAML on this site was generated or guided by AI.

She is not an accessibility expert. She's a mom whose daughter studies special education. She cares deeply about accessibility but has no formal training, no certifications, and no professional experience in accessible design. She tried to follow WCAG guidelines and best practices, but 'tried to follow best practices' and 'actually followed best practices' are two very different things.

This site has had virtually no testing. No screen reader testing. No switch device testing. No testing with actual users who have disabilities. No automated accessibility audit beyond quick Lighthouse runs. No cross-browser testing matrix. No mobile device lab. A real accessible website would need all of these things. This site has none of them.

Beer may have been involved. Not going to pretend this was a rigorous engineering process. It was a grad school class project that a mom with ADHD and zero impulse control decided to build a website for, because her daughter mentioned needing one and Gwen's brain said 'I could do that' before the rational part could intervene. The development environment may have included an IPA or two.

She doesn't know what she doesn't know. This is the most important limitation. An actual developer or accessibility professional would look at this code and probably find issues that Gwen wouldn't even recognize as issues. The AI tools she used are very good, but they're only as good as the questions you ask them — and you can't ask about problems you don't know exist.

Because the point was never to claim this is something perfect. The point is to show what's possible.

A non-developer built a voice-controlled accessible website in about 20 hours. That's the headline. Not 'this website is perfectly accessible' — it almost certainly isn't. But the fact that someone with no coding background, using AI tools, could build a site with real-time voice navigation, multilingual support, and a full accessibility toolkit? That means something.

Voice interaction as an accessibility tool is an idea worth taking seriously. Most accessible websites give you toggles and settings. This one gives you a conversation. You can say 'I can't see the screen, describe what's on this page' and get a spoken description. You can say 'make the text bigger and turn on high contrast' without finding a menu. That's not a gimmick — that's a fundamentally different approach to accessibility, and it works.

Imagine what this would look like if professionals built it. If a real development team, with accessibility experts, UX researchers, and proper testing infrastructure took this concept and built it right — with screen reader optimization, proper ARIA patterns, user testing with disabled communities, professional translation, and security hardening — it could be extraordinary. This is the proof of concept. Someone else should build the real thing.

Get the Code

Want to see the code? Fork it? Laugh at it? Learn from it? All of the above?

The repo is available — just reach out and give Gwen a minute to clean it up first. (There are probably comments in there that say things like 'why does this work' and 'DO NOT TOUCH THIS LINE' and she'd like to pretend she's more professional than that.)

Seriously though — if you're interested in voice interaction as an accessibility tool, or in how to build a website by vibe coding with AI, or you just want to see what the code looks like when a non-developer builds something ambitious with Claude and duct tape — get in touch. Gwen is happy to share, explain, and collaborate.

How Capybara Actually Works

The Rest of the Tech Stack

AI's Limitations

Gwen's Limitations

So Why Share It?

Get the Code