edge-llm | NOINDEX

TypeScript monorepo for running LLMs on-device — WebLLM, Transformers.js, hybrid inference, tool calling, and LoRA fine-tuning.

Overview

edge-llm is a TypeScript monorepo for running LLMs as close to the user as possible. It provides a unified API across three runtime tiers — WebLLM (WebGPU), Transformers.js (WASM), and traditional API fallback — with automatic capability detection and hot-swapping between them. The goal: use the fastest available runtime without the app knowing or caring which one is active.

Architecture

The monorepo is split into four packages:

@edge-llm/core — Runtime abstraction layer. Detects WebGPU support, falls back to WASM via Transformers.js, then to API. Manages model loading, chat sessions, context windows, and streaming.
@edge-llm/react — React bindings. LLMProvider for context, useLLM hook for components. Handles loading states, streaming tokens, and tool call lifecycle.
@edge-llm/server — Server-side utilities for hybrid inference. When a client can’t run a model locally, requests route through here transparently.
@edge-llm/fine-tune — MLX-based fine-tuning pipeline. LoRA adapters on FunctionGemma, ONNX export with quantization for deployment back to the browser.

Key Features

3-Tier Runtime: WebGPU → WASM → API, selected automatically based on device capabilities. Hot-swap mid-session if conditions change.
Universal Tool Calling: Define tools once, works across all runtimes. Supports both JSON and XML format parsing for model compatibility.
Built-In Fine-Tuning: Train LoRA adapters on FunctionGemma using MLX, export to ONNX with quantization, deploy the result back to edge devices.
Hybrid Inference: Seamlessly split workloads between client and server. Heavy reasoning goes to the server, quick interactions stay local.
Privacy-First: On-device inference means data never leaves the user’s machine when running locally.

Why I Built It

Cloud LLMs are powerful but come with latency, cost, and privacy tradeoffs that don’t make sense for every interaction. A lot of what apps need — form filling, entity extraction, classification, basic tool use — can run locally on modern hardware. I wanted a framework that makes that easy without giving up the ability to fall back to a server when needed.

The fine-tuning piece came from wanting to ship models that are actually good at specific tool schemas rather than relying on generic instruction-following and hoping for the best.