One interface, every model at your command. Your local AI routing butler.
Quick Start · Features · Configuration · Security
Smart Routing Butler is a 100% self-hosted, OpenAI-compatible API smart router purpose-built for AI agents (OpenClaw, Cursor, Continue, etc.) and developer tools. It automatically balances cost, latency, and quality — connect to a single endpoint and seamlessly dispatch requests across cloud LLMs and local models.
📑 Table of Contents
- 💡 Why Smart Routing Butler?
- 🔀 Comparison with Alternatives
- ✨ Core Features
- 📸 UI Preview
- 🎯 Rule Creation — Three Ways
- 🔌 OpenAI-Compatible Local Proxy
- 🔍 Routing Layers Deep Dive
- 🏗️ Architecture Overview
- 🚀 Quick Start (Self-Hosted)
- ⚙️ Configuration Summary
- 📂 Repository Structure
- 🛠️ Development & Health Checks
- 🗺️ Roadmap
- ⚖️ Open-Source Governance
- 🛡️ Security & Privacy
- 🤝 Contributing
- 📜 License & Disclaimer
- 🙏 Acknowledgments
When using AI agents (OpenClaw, Cursor, Continue, etc.) and IDE-assisted coding daily, we constantly hit these pain points:
- Steep API costs — Whether it's a simple spell check or complex architecture design, tools always use default models which may not at the right price.
- Rigid global config — No way to assign the right model per task type (code completion, long-form summarization, multi-step reasoning).
- Black-box fragility — Routing logic is opaque; when a single model provider goes down, the entire agent workflow collapses.
Smart Routing Butler turns "which model to use" into a policy-driven, hot-reloadable configuration problem. It acts as your local proxy layer, intercepts all LLM requests, and intelligently dispatches them based on your rules and semantic understanding.
| Dimension | Typical Cloud API Gateway | Smart Routing Butler |
|---|---|---|
| Integration | Requires dedicated plugins, browser extensions, or SDK wrappers per tool | Standard OpenAI-compatible endpoint — any tool that supports base_url + API key works instantly. No plugins needed. |
| Data privacy | Traffic routed through third parties — leak risk | 100% self-hosted, data stays on your local network |
| Routing logic | Platform black-box, no user control | L0–L3 white-box, transparent, configurable, explainable |
| Rule customization | Limited or no user-defined rules | Full visual editor + natural language + AI wizard for custom routing rules |
| Compliance | Dependent on vendor terms, region-locked | Deploy on your own network, meets the strictest enterprise requirements |
| Cost control | Platform fees or fixed monthly charges | Zero platform fees, route on-demand to maximize free/cheap model value |
- Drop-in OpenAI-compatible proxy — Exposes standard
POST /v1/chat/completionsandGET /v1/modelsendpoints on your local network. Any tool that supports OpenAI API (OpenClaw, Cursor, Continue, ChatBox, etc.) works out of the box — just set the base URL and API token. No plugins, no browser extensions, no SDK changes. All traffic stays on your local network and never passes through any external gateway. See OpenAI-Compatible Local Proxy for details. - Multi-layer intelligent routing — L0 (exact cache) + L0.5 (semantic cache) + L1 (user-defined rules) + L2 (semantic matching) + L3 (local model arbitration) — five-layer decision chain for precise task-to-model matching. See Routing Layers Deep Dive for details.
- Flexible rule creation — Define your own L1 routing rules via a visual editor, or let AI do it for you: describe your intent in natural language, or use the AI questionnaire wizard to auto-generate a complete rule set in minutes. See Rule Creation — Three Ways for details.
- Significant cost reduction — Offload simple tasks to local models or cheap APIs; reserve flagship models for complex tasks only.
- High availability & auto-fallback — Built-in circuit breaker and fallback chains. When the primary model times out or errors, traffic automatically shifts to backups.
- Full observability — Beautiful Next.js web dashboard with request logs, token usage, rule hit analysis at a glance.
- 100% data control — Fully self-hosted, data never leaves your infrastructure. API keys encrypted with AES-256-GCM — no third-party gateway privacy risks.
- Blazing performance — L1 rule engine matches in-memory synchronously (<2ms). Full SSE streaming passthrough for zero-latency feel.
Click any category to browse screenshots.
Smart Routing Butler provides three distinct approaches to creating routing rules, from fully manual to fully AI-driven. Mix and match to suit your workflow.
Build rules visually through the web dashboard — no code required. Define conditions based on task type, keywords, token count, model preferences, and more; set priority, target model, and up to 3 fallback models per rule. Rules take effect immediately via hot-reload.
Example — Route coding tasks to a code-specialized model:
| Field | Value |
|---|---|
| Rule name | Coding Rule |
| Priority | 900 (high) |
| Condition | Task type = coding |
| Target model | Alibaba/qwen3-coder-plus |
| Fallback | Alibaba/qwen3.5-plus |
Once saved, any request classified as a coding task automatically goes to the code-optimized model, with a general-purpose fallback if the primary is unavailable.
Describe your routing intent in plain language, and the built-in LLM translates it into structured rules automatically. Ideal for users who know what they want but prefer not to configure fields manually.
Example prompts:
Use DeepSeek Coder for code; GPT-4o-mini for chatUse cheaper models when budget is under $5 per million tokensIf OpenAI is unhealthy, switch to AnthropicLong docs (>10000 tokens) use Claude; short questions use GPT-4o-miniGPT-4o for math/analysis; DeepSeek for translation
Type a sentence, click Generate rules, and the system produces one or more ready-to-use rules that you can review, edit, and enable in one click.
A 5-step interactive wizard that walks you through your use cases, preferred providers, budget, and priorities — then automatically generates a complete initial rule set tailored to your needs.
Wizard steps:
- Select use cases — Coding & debugging, data analysis, content creation, general chat, translation, math & reasoning, long document processing
- Choose providers — Pick from your configured providers (OpenAI, Anthropic, Alibaba, local Ollama, etc.)
- Set budget preference — Cost-sensitive, balanced, or quality-first
- Define priorities — Latency vs. quality vs. cost trade-offs
- Review & apply — Preview all generated rules, tweak if needed, then activate
Perfect for first-time setup — go from zero rules to a fully operational routing strategy in under 2 minutes.
Smart Routing Butler is purpose-built for AI agents like OpenClaw, Cursor, Continue, ChatBox, and any tool that speaks the OpenAI API protocol. Integration requires zero plugins and zero SDK modifications — configure a local URL and token, and you're done.
The proxy (Node.js, default port 8080) exposes two standard OpenAI-compatible endpoints on your local network:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST |
Chat completions (streaming and non-streaming) |
/v1/models |
GET |
List all available models (includes a synthetic auto model for smart routing) |
Client configuration (any OpenAI-compatible tool):
Base URL: http://localhost:8080/v1
API Key: <token created in the Dashboard → API Tokens page>
Model: auto (let the router decide)
— or —
Provider/model (e.g. openai/gpt-4o, to bypass routing)
- Your agent sends a standard
POST /v1/chat/completionsrequest withAuthorization: Bearer <token>. - The proxy validates the token (SHA-256 hashed lookup in PostgreSQL, cached in Redis for 60s).
- If
model=auto, the request enters the five-layer routing chain. If a specific model is given, it goes directly to that provider. - The proxy resolves the target provider, decrypts the stored API key (AES-256-GCM), and forwards the request to the upstream provider API.
- For streaming (
stream: true), SSE chunks are relayed in real-time (for await … res.write(chunk)) — no buffering, no extra latency. - For non-streaming responses, the proxy rewrites the
modelfield to the actual target and caches the result.
All traffic stays local: Agent → localhost:8080 → upstream provider API. The proxy runs on your machine or your Docker host; no request is rerouted through any third-party gateway or external relay.
When model is set to auto, the request passes through five decision layers in order. The first layer to produce a match wins. Each miss passes control to the next layer. See the flow diagram in Architecture Overview.
- Key:
exact:<SHA256(model + messages_json)> - Storage: Redis
GET/SET EX; default TTL 24h. - Speed: Sub-millisecond Redis lookup.
- When it fires: Identical request (same model + same messages) seen before and not expired.
- Mechanism: The user message is embedded into a 384-dimensional vector (via
BAAI/bge-small-zh-v1.5). RediSearch performs a KNN-1 cosine similarity query against stored embeddings. - Threshold: Cosine similarity >= 0.95 (configurable). A near-identical question returns the cached response even if wording differs slightly.
- Timeout: 55ms HTTP budget from proxy to router; on timeout the layer is skipped.
This is where your custom routing rules take effect. Rules are loaded into memory at startup and hot-reloaded via Redis Pub/Sub — matching is fully synchronous with < 2ms P99 latency.
Supported conditions (combinable with AND / OR):
| Condition | Description |
|---|---|
taskType |
Auto-detected task category (coding, translation, analysis, math, creative, chat, summarization, general) |
keywords |
Case-insensitive substring match on the last user message |
tokenCount |
Estimated token count within a min/max range |
maxCost |
Input cost per million tokens <= threshold |
maxLatency |
Provider average latency <= threshold |
providerHealth |
Provider health status matches |
Rules are evaluated in priority descending order (0–1000). The first match wins and returns the rule's targetModel plus an optional fallback chain of up to 3 models. You can create rules via the visual editor, natural language generator, or AI wizard (see Rule Creation above).
- Mechanism: The last user message is embedded and compared against pre-configured route utterance embeddings (8 semantic categories, e.g. "code", "translation", "math"). Best cosine similarity match above 0.85 threshold wins.
- Timeout: 55ms; on miss or timeout, passes to L3.
- Model mapping: Each semantic category maps to a
provider/modelviaROUTE_MODEL_MAP.
- Mechanism: Sends the user message to a small local LLM running on the host via Ollama (default:
fauxpaslife/arch-router:1.5b, ~900MB). The model returns a JSON classification{"category": "...", "confidence": ...}that maps to a target model. - Timeout: 140ms read budget; on timeout, error, or unrecognized category → falls through to default model.
- No external calls: Ollama runs on your host machine; the router accesses it via
host.docker.internal:11434.
If all layers miss, the system selects the first enabled model from the database as the default target. A L3_FALLBACK counter is incremented asynchronously for monitoring via the dashboard.
Mermaid source (interactive on GitHub desktop)
graph TD
Client["Client / AI Agent"] -->|"OpenAI-compatible API"| Proxy("Node.js Proxy")
subgraph decisionChain ["Decision & Cache Chain"]
Proxy --> L0{"L0 Exact Cache"}
L0 -->|miss| L05{"L0.5 Semantic Cache"}
L05 -->|miss| L1{"L1 Rule Engine"}
L1 -->|miss| L2{"L2 Semantic Route"}
L2 -->|miss| L3{"L3 Local Model Arbiter"}
end
L3 -->|decision| Dispatch("Request Dispatch")
L1 -->|hit| Dispatch
L2 -->|hit| Dispatch
Dispatch -->|"SSE streaming"| Cloud["Cloud LLM Providers"]
Dispatch -->|"SSE streaming"| Local["Local LLM / Ollama"]
Proxy -.->|"async logging"| DB[("PostgreSQL")]
Proxy -.->|"async caching"| Redis[("Redis")]
| Dependency | Description |
|---|---|
| Docker & Compose | One-command orchestration of proxy / router / dashboard / postgres / redis |
| Ollama (optional) | For L3 local model arbitration; containers access the host via host.docker.internal:11434 |
Source distribution: Currently distributed via GitHub only (git clone, Code → Download ZIP, or Releases). npm install is only used to install third-party dependencies inside proxy/ and dashboard/ after cloning.
-
Clone
git clone https://github.com/Moonaria123/Smart-Routing-Butler-for-OpenClaws.git cd Smart-Routing-Butler-for-OpenClaws -
Environment variables
cp .env.example .env # Edit: DATABASE_URL, REDIS_URL, ENCRYPTION_KEY, BETTER_AUTH_SECRET, etc. -
(Optional) Pull L3 model
ollama pull fauxpaslife/arch-router:1.5b
-
Launch
docker compose up -d docker compose exec dashboard npx prisma migrate deploy -
Access: Dashboard at
http://localhost:3000; point your client to the proxy's OpenAI-compatible endpoint athttp://localhost:8080/v1.
See docker-compose.yml, docker-compose.release.yml, and sub-directory READMEs for details.
When running npm ci in proxy/ or dashboard/, the .npmrc file only affects the dependency package download source — it does not replace git clone. Dockerfiles copy .npmrc automatically. The router uses pip install -r requirements.txt.
| Category | Entry Point |
|---|---|
| Global & ports | .env.example, compose/ports.env |
| Proxy / routing | PYTHON_ROUTER_URL, OLLAMA_URL, ARCH_ROUTER_MODEL, ROUTING_ENABLE_L2 / L3, etc. |
| Dashboard & auth | BETTER_AUTH_URL, BETTER_AUTH_SECRET, DATABASE_URL, PROXY_URL |
| Pre-built images | GHCR_OWNER, SMARTROUTER_IMAGE_TAG |
Never commit .env, API keys, or production connection strings to Git.
| Directory | Description |
|---|---|
proxy/ |
Node.js proxy: OpenAI-compatible API, L0/L1 cache & rules, SSE |
router/ |
FastAPI: semantic routing, caching, L3 integration |
dashboard/ |
Next.js: rules, providers, logs, settings |
contracts/ |
Inter-service contracts |
# proxy/
npm run type-check && npm run lint
# router/
python -m mypy app/ --strict && python -m ruff check app/
# dashboard/
npm run type-check && npm run lintRecently shipped — 20260405
- Multimodal & generative traffic: modalities on
request_logs, overview KPIs, proxy routes (/v1/images/generations, multimodal chat forwarding helpers). - API Token (local key) dimension:
apiTokenId/apiTokenNameon logs, CSV export, rules-hit filters, and cost aggregates. - Dashboard Overview analytics: dedicated API (
/api/stats/overview-analytics) with trend & pie charts and filters (dashboard/src/components/overview/*). - Thinking / reasoning mode: model flags, rule
thinkingStrategy, request log fields; OpenAIreasoning_effortmapping. - Security: Redis sliding-window rate limiting on the proxy (SEC-003); npm
overridesfor audited transitive deps (SEC-002).
Up next (backlog)
- Plugin system for custom routing strategies
- Multi-user team collaboration with role-based access
- Token budget tracking and usage alerts
- More LLM provider integrations (Google Gemini, Mistral, etc.)
- API key rotation and lifecycle management
- Prometheus / Grafana metrics export
Have a feature request? Open an issue and describe your use case.
| Document | Description |
|---|---|
| LICENSE | MIT License |
| CODE_OF_CONDUCT.md | Community standards based on Contributor Covenant 2.1 |
| CONTRIBUTING.md | Contribution workflow, IP policy, and coding standards |
| SECURITY.md | Vulnerability reporting & responsible disclosure |
Please read the Code of Conduct before participating in Issues, PRs, or Discussions. Maintainers reserve the right to moderate disruptive or harassing content.
- Vulnerability reports: Do not disclose exploitable details publicly. Follow SECURITY.md.
- Deployment & data: This software is self-hosted. User prompts, responses, logs, and keys are managed by the deployer on their own infrastructure. You are responsible for reviewing upstream LLM provider terms of service and data residency policies.
- Supply chain: We recommend locking image and dependency versions (
package-lock.json,requirements.txt) in production and monitoring security advisories.
Issues and Pull Requests are welcome — see CONTRIBUTING.md. By contributing, you agree to the CODE_OF_CONDUCT.md and the licensing terms in LICENSE.
- Released under the MIT License.
- Provided "AS IS": No warranties of merchantability, fitness for a particular purpose, or non-infringement — use at your own risk.
- Limitation of liability: To the extent permitted by law, authors and contributors shall not be liable for any indirect, incidental, special, or consequential damages.
Smart Routing Butler is built on the shoulders of these great open-source projects:
- Next.js — React framework for the dashboard
- Fastify / Express — Node.js server framework for the proxy
- FastAPI — Python framework for the semantic router
- Ollama — Local LLM runtime for L3 arbitration
- Prisma — Database ORM
- Redis — In-memory cache
- PostgreSQL — Persistent storage
- For discussions on intelligent routing and cost optimization, see similar projects in the community. This repository makes no claims of feature parity with third-party products.
















