Overview
edge-llm is a TypeScript monorepo for running LLMs as close to the user as possible. It provides a unified API across three runtime tiers — WebLLM (WebGPU), Transformers.js (WASM), and traditional API fallback — with automatic capability detection and hot-swapping between them. The goal: use the fastest available runtime without the app knowing or caring which one is active.
Architecture
The monorepo is split into four packages:
@edge-llm/core— Runtime abstraction layer. Detects WebGPU support, falls back to WASM via Transformers.js, then to API. Manages model loading, chat sessions, context windows, and streaming.@edge-llm/react— React bindings.LLMProviderfor context,useLLMhook for components. Handles loading states, streaming tokens, and tool call lifecycle.@edge-llm/server— Server-side utilities for hybrid inference. When a client can’t run a model locally, requests route through here transparently.@edge-llm/fine-tune— MLX-based fine-tuning pipeline. LoRA adapters on FunctionGemma, ONNX export with quantization for deployment back to the browser.
Key Features
- 3-Tier Runtime: WebGPU → WASM → API, selected automatically based on device capabilities. Hot-swap mid-session if conditions change.
- Universal Tool Calling: Define tools once, works across all runtimes. Supports both JSON and XML format parsing for model compatibility.
- Built-In Fine-Tuning: Train LoRA adapters on FunctionGemma using MLX, export to ONNX with quantization, deploy the result back to edge devices.
- Hybrid Inference: Seamlessly split workloads between client and server. Heavy reasoning goes to the server, quick interactions stay local.
- Privacy-First: On-device inference means data never leaves the user’s machine when running locally.
Why I Built It
Cloud LLMs are powerful but come with latency, cost, and privacy tradeoffs that don’t make sense for every interaction. A lot of what apps need — form filling, entity extraction, classification, basic tool use — can run locally on modern hardware. I wanted a framework that makes that easy without giving up the ability to fall back to a server when needed.
The fine-tuning piece came from wanting to ship models that are actually good at specific tool schemas rather than relying on generic instruction-following and hoping for the best.