Skip to content

Client-Side AI with edge-llm

The AI landscape is bifurcating. On one end, we have massive frontier models (GPT-4, Claude 3.5 Sonnet) running in data centers, consuming gigawatts of power. On the other end, we have a quiet revolution happening on edge devices: laptops, phones, and even browsers.

I’ve been building edge-llm, a framework to bring capable, function-calling AI agents directly to the client. Here’s why I think this matters.

The Case for Local Inference

  1. Privacy: Sending sensitive user data (financials, personal notes, health data) to an API is a non-starter for many apps. Local execution means data never leaves the device.
  2. Latency: A round-trip to an API takes time. Local inference can be instant (after initial load), enabling real-time UI interactions that feel “native” rather than “chatty.”
  3. Cost: Tokens cost money. User hardware is free (to you). Offloading inference to the client slashes infrastructure bills.

Enter WebGPU

WebGPU is the game changer. It allows JavaScript to access the GPU directly, enabling massive parallel computation in the browser. Libraries like Transformers.js leverage this to run quantized models (like Gemma 2B or Llama 3 8B) at surprising speeds.

Building edge-llm

My goal with edge-llm wasn’t just to “run a chatbot.” I wanted agents. That means function calling.

Most local LLM demos are simple chat interfaces. But to build useful software, the LLM needs to do things—query a database, filter a list, or navigate a UI.

edge-llm implements a rigorous “Thought-Action-Observation” loop client-side:

// Simplified loop
while (true) {
  const prompt = constructPrompt(history, tools);
  const response = await model.generate(prompt);

  if (isToolCall(response)) {
    const result = await executeTool(response);
    history.push({ role: "tool", content: result });
  } else {
    return response;
  }
}

This loop, combined with the structured output capabilities of modern small language models (SLMs), makes the browser a legitimate platform for agentic applications.

What’s Next?

We are just scratching the surface. As models get smaller and smarter (Phi-3, Gemma 2), and consumer hardware gets faster (NPU integration), the “browser as an OS” concept will extend to “browser as an AI runtime.”

Check out the project on GitHub to see the code.



Next Post
Building a Kubernetes Homelab