There’s an AI running in your browser tab right now. Not calling an API. Not phoning home to a data center in Iowa. Just sitting there, on your GPU, doing its best with 1 billion parameters and zero supervision.
I gave it one job: be a Seattle weather assistant. It’s… not great at being helpful. But it is running entirely on your machine, which means your embarrassing weather questions die when you close the tab. Like nature intended.
Scroll to the bottom if you want to talk to it. It has opinions about your rain jacket.
Why Run AI in the Browser?
Every time you ask an AI something, your question travels to a server, gets processed, and the answer comes back. That server logs everything. Someone’s paying for it. And if the server goes down, so does your assistant.
What if it didn’t work that way?
I built edge-llm to run LLMs directly on the client, in your browser, on your hardware, with no network dependency. It supports Llama 3, FunctionGemma (a fine-tuned Gemma for tool calling), and other small models via WebGPU and WASM. It even includes a fine-tuning pipeline so you can train your own edge models for specific tool-calling tasks.
Three reasons this matters:
Privacy. Your data never leaves your device. No logs. No server. No one in a data center reading your prompts at 2am. For apps handling financials, health data, or personal notes, this isn’t a nice-to-have, it’s the whole point.
Progressive loading. edge-llm doesn’t force you to pick one approach. It’s a hybrid inference engine. Your app starts immediately with a cloud API and silently hot-swaps to local inference once the model finishes downloading in the background:
API (instant start) → WebGPU (fast, private) → WASM (fallback, widest support)
The user gets a working AI assistant in milliseconds via the API. Meanwhile, the local model downloads silently. Once it’s ready, edge-llm swaps the backend and the user never notices, except their data stops leaving the device. One codebase, three tiers. The app adapts to the device and gets more private over time.
Cost. Tokens are expensive. User hardware is free (to you). Offloading inference to the client means your GPU bill doesn’t scale with your user count. For features that don’t need frontier-model intelligence (autocomplete, classification, simple chat), running a small model locally makes more economic sense than calling GPT-4 a million times a day.
How It Actually Works
The browser wasn’t designed to run neural networks. But WebGPU changed that. It gives JavaScript direct GPU access, enabling the kind of parallel computation that LLMs need. WebLLM runs MLC-compiled models (Llama 3, Phi-4 Mini) on the GPU at 8-12 tokens/second on a mid-range laptop. Transformers.js handles WASM/ONNX inference for smaller models like FunctionGemma. In 2026, models like Gemma 3n are specifically optimized for on-device inference (the “n” stands for nano) and the 2B version runs comfortably in most browsers.
But most browser LLM demos stop at chat. You type, it responds. That’s a toy.
I wanted agents: AI that can do things. Query data, call functions, navigate a UI. That means function calling, which most small models don’t support out of the box. That’s why edge-llm ships with a fine-tuning pipeline. You can take FunctionGemma, train it on your specific tools using LoRA adapters, export to ONNX, and deploy a custom tool-calling model directly to the browser. The whole pipeline runs from npm run pipeline.
edge-llm implements a “Thought-Action-Observation” loop entirely client-side:
while (true) {
const prompt = constructPrompt(history, tools);
const response = await model.generate(prompt);
if (isToolCall(response)) {
const result = await executeTool(response);
history.push({ role: "tool", content: result });
} else {
return response;
}
}
The model thinks, decides whether to call a tool or respond directly, observes the result, and loops. All in the browser. No server round-trips. Combined with structured output from modern small models, this makes the browser a legitimate platform for agentic applications, not just chatbots.
The Fallback Chain
This is the part I think most people miss about client-side AI. It’s not “local OR cloud.” It’s a spectrum:
| Phase | Runtime | Speed | Privacy | When |
|---|---|---|---|---|
| Start | Cloud API | Instant | None | App loads, model downloading in background |
| Swap | WebGPU | Fast | Full | Model downloaded, GPU available |
| Fallback | WASM | Slow | Full | No GPU, but model is local |
edge-llm handles this transparently. You define your tools once (Zod or JSON Schema) and they work on any runtime. The framework handles format differences automatically (JSON for Llama 3, XML for FunctionGemma). The app doesn’t need to know which tier it’s on. The interface is the same.
This means your app starts working immediately AND gets more private over time. For a weather assistant, the local model is plenty. For a complex multi-step agent, the API handles it until the local model is ready. The developer chooses the tradeoff, not the infrastructure.
What’s Next
Small models are getting smarter fast. Phi-4 Mini, Gemma 3n, Qwen 3. Each generation closes the gap between “local toy” and “actually useful.” Consumer hardware is catching up too (NPUs in Apple Silicon and Qualcomm Snapdragon X, better integrated GPUs everywhere). The “browser as an OS” concept is extending to “browser as an AI runtime.”
We’re not there yet. The model below will occasionally say something unhinged about Seattle weather. That’s part of the charm.
Check out the full source on GitHub.
Talk to the Weather Assistant
The demo below loads a small language model entirely in your browser. No server, no API keys, no data leaves your machine. I gave it a system prompt that makes it a passive-aggressive Seattle weather assistant. Think The Needling meets your friend who insists they don’t own an umbrella.
Ask it about the weather. Ask it about your rain jacket. Ask it anything. It’s running on your hardware and it’s trying its best.