Claude Code with Self-Hosted Models
Claude Code with Self-Hosted Models
Section titled “Claude Code with Self-Hosted Models”Claude Code CLI can be pointed at any backend that implements the Anthropic Messages API — including self-hosted models via vLLM, Ollama, LM Studio, or LiteLLM. This means you can use Claude Code’s superior tooling (skills, team spawners, MCP, plugins) with free, unlimited self-hosted models like Devstral.
How It Works
Section titled “How It Works”Four environment variables redirect Claude Code to your backend:
export ANTHROPIC_BASE_URL="http://your-server:port"export ANTHROPIC_AUTH_TOKEN="your-token"export ANTHROPIC_CUSTOM_MODEL_OPTION="model-name"export ANTHROPIC_MODEL="model-name"The backend must implement the Anthropic Messages API (/v1/messages), not the OpenAI API (/v1/chat/completions). Several tools support this natively:
| Backend | Anthropic API Support | Notes |
|---|---|---|
| LiteLLM | Yes (built-in) | Translates Anthropic requests to any backend |
| Ollama (v0.14+) | Yes (native) | Direct Anthropic endpoint |
| LM Studio (v0.4.1+) | Yes (native) | Direct Anthropic endpoint |
| vLLM | Yes (native) | /v1/messages endpoint |
| Open-WebUI | Via LiteLLM | Routes through LiteLLM which speaks Anthropic |
Why This Matters
Section titled “Why This Matters”| Feature | Claude Code (Anthropic Cloud) | Claude Code (Self-Hosted) |
|---|---|---|
| Tool use (Read, Edit, Bash, Grep) | Yes | Yes (verified) |
| Skills (/team-review, /commit, etc.) | Yes | Yes |
| MCP servers | Yes | Yes (but search disabled by default) |
| Plugins (Superpowers, etc.) | Yes | Yes |
| CLAUDE.md rules | Yes | Yes |
| Context window | 1M tokens | Model-dependent (128K for Devstral) |
| Cost | $20-200/mo | Free (self-hosted) |
| Data stays on-network | No | Yes |
| SWE-bench quality | 80.8% (Opus) | ~72% (Devstral 2) |
Don’t Want a CLI? Le Chat at $15/month
Section titled “Don’t Want a CLI? Le Chat at $15/month”Before you self-host: if your only reason to escape Anthropic’s cloud is the price tag ($20–200/mo), the cheapest path may not be self-hosting at all — it may be chat.mistral.ai (“Le Chat”) Pro at $14.99/month (commonly rounded to “$15/mo”). That’s notably under both Claude Pro and ChatGPT Plus, and you get a web UI to Mistral Large, Codestral, and Pixtral with document upload, web search, image generation, and Canvas (in-browser code editing).
| Claude Code (self-hosted) | Le Chat Pro (web, $15/mo) | |
|---|---|---|
| Cost | $0 software + your compute / hardware | $14.99/mo flat |
| Hardware needed | GPU server, or LiteLLM/Ollama on a beefy Mac | Just a browser |
| Touches your filesystem | Yes (full agentic) | No — upload/download only |
| Runs shell commands | Yes | No |
| Skills, MCP, subagents | Yes | No (web-only) |
| Data stays on-network | Yes (with local model) | No — Mistral cloud |
| Best for | Multi-file refactors, automation, repo-aware work | Q&A, brainstorm, one-off code snippets |
Rule of thumb: if your workflow is “ask the model, copy code back into the editor,” Le Chat Pro is the better deal — same Mistral models, none of the setup. If you need the model to act on your codebase — read files, edit in place, run commands, spawn subagents — keep going with self-hosted Claude Code (or use Mistral Vibe, which is also a terminal agent and runs against the same backends).
Many teams use both: Le Chat Pro for chat-style work, a CLI agent (Claude Code or Vibe) for agentic edits.
Setup with LiteLLM Gateway
Section titled “Setup with LiteLLM Gateway”Best for teams already running LiteLLM (e.g., behind Open-WebUI). LiteLLM translates Anthropic-format requests to your vLLM/Ollama backend.
Quick Setup (Switcher Script)
Section titled “Quick Setup (Switcher Script)”The easiest way is the backend switcher script from the Vibe Coding repo (canonical) — also reachable at the legacy alias git.irregularchat.com/public-repos/vibe-coding:
# Install the switchercp claude-selfhosted/claude-switch.sh ~/.local/bin/claude-switchchmod +x ~/.local/bin/claude-switch
# First run creates ~/.claude-backends.env — edit with your endpointsnano ~/.claude-backends.env
# Switch backendssource claude-switch local # Self-hosted Devstralsource claude-switch cloud # Back to Anthropicsource claude-switch ollama # Local Ollamasource claude-switch status # Show current
# Then run Claude as normal — no --model flag neededclaude --dangerously-skip-permissions --teammate-mode autoRecommended Shell Aliases
Section titled “Recommended Shell Aliases”These three aliases are the daily-driver — switch backends and launch Claude Code in one step. Copy them verbatim into ~/.zshrc (or ~/.bashrc):
# ── Claude Code launchers ───────────────────────────────────────────# Plain launcher — uses whichever backend env vars are currently exported.alias cc='claude --dangerously-skip-permissions --teammate-mode auto'
# Switch to self-hosted (Mistral Medium via LiteLLM @ ai.digitalfacility.io)# AND launch Claude in one step.alias cc-local='source claude-switch local && claude --dangerously-skip-permissions --teammate-mode auto'
# Switch to Anthropic cloud (Opus/Sonnet, paid) AND launch Claude.alias cc-cloud='source claude-switch cloud && claude --dangerously-skip-permissions --teammate-mode auto'After source ~/.zshrc:
| Command | What it does | When to use |
|---|---|---|
cc-local | Sets self-hosted env vars and launches Claude Code | Default daily-driver — free, unlimited tokens, same tooling |
cc-cloud | Sets Anthropic env vars and launches Claude Code | Hard reasoning tasks (architecture, security, ambiguous specs) |
cc-vibe | Routes Claude Code through the local LiteLLM proxy to the official Mistral API | Anyone without IrregularChat backend access — pay-as-you-go from Mistral |
cc | Launches Claude with whichever backend was last selected | Resuming after a source claude-switch in the same shell |
source claude-switch status | Prints current backend (no launch) | Sanity-check what you’re paying for before running heavy work |
The cc-vibe alias is defined as:
alias cc-vibe='source claude-switch vibe && claude --dangerously-skip-permissions --teammate-mode auto'…with a corresponding switch_vibe() in claude-switch that points ANTHROPIC_BASE_URL at http://localhost:4000 and uses LITELLM_MASTER_KEY (the proxy’s auth) as ANTHROPIC_AUTH_TOKEN. The local LiteLLM proxy translates Anthropic Messages format → Mistral chat-completions format because Claude Code cannot speak to api.mistral.ai directly. See the full step-by-step on the Mistral Vibe page.
Switching Mid-Session
Section titled “Switching Mid-Session”You don’t have to commit at launch — you can flip backends inside an existing shell:
# Started with cc-local, hit a hard problem^C # Exit the local Claude sessionsource claude-switch cloud # Flip env to cloudclaude # Or just: cc# ... solve the hard thing ...^Csource claude-switch local # Flip backcc # Continue routine work, free againThe cloud-escalator skill (see below) automates this for a single sub-task — call out to cloud Opus for one prompt, return to local, without exiting your session.
Manual Environment Variables
Section titled “Manual Environment Variables”If you prefer manual setup over the switcher script:
# Point Claude Code at your LiteLLM gatewayexport ANTHROPIC_BASE_URL="https://your-litellm-or-openwebui-server/api"export ANTHROPIC_AUTH_TOKEN="your-litellm-api-key"export ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium"export ANTHROPIC_MODEL="mistral-medium"export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1Then run:
claude # ANTHROPIC_MODEL tells Claude which model to useOr Configure in settings.json
Section titled “Or Configure in settings.json”Add to ~/.claude/settings.local.json for persistent config:
{ "env": { "ANTHROPIC_BASE_URL": "https://your-server/api", "ANTHROPIC_AUTH_TOKEN": "your-api-key", "ANTHROPIC_CUSTOM_MODEL_OPTION": "mistral-medium", "ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "Mistral Medium 3.5 (Self-Hosted)", "ANTHROPIC_MODEL": "mistral-medium", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1" }}LiteLLM Security Advisory
LiteLLM PyPI versions 1.82.7 and 1.82.8 were compromised with credential-stealing malware. Verify your version: pip show litellm | grep Version. Upgrade to 1.82.9+ if affected.
LiteLLM Config
Section titled “LiteLLM Config”Your LiteLLM config.yaml needs the model accessible:
model_list: - model_name: mistral-medium litellm_params: model: hosted_vllm/mistralai/Mistral-Medium-3.5-128B api_base: http://host.docker.internal:8000/v1 api_key: none temperature: 0.7 timeout: 600LiteLLM automatically handles the Anthropic-to-OpenAI translation.
Setup with Ollama
Section titled “Setup with Ollama”Simplest setup — Ollama v0.14+ exposes the Anthropic Messages API natively.
# Pull a modelollama pull devstral # or any supported model
# Set env varsexport ANTHROPIC_BASE_URL="http://localhost:11434"export ANTHROPIC_AUTH_TOKEN="ollama"export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
# Run Claude Code with the local modelclaude --model devstralSetup with vLLM Direct
Section titled “Setup with vLLM Direct”If you’re running vLLM with the Anthropic-compatible endpoint:
# vLLM must be launched with these flagsvllm serve mistralai/Devstral-2-123B-Instruct-2512 \ --served-model-name devstral \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 2 \ --quantization fp8
# Set env varsexport ANTHROPIC_BASE_URL="http://your-vllm-server:8000"export ANTHROPIC_AUTH_TOKEN="vllm"export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
claude --model devstralVerified Test Results
Section titled “Verified Test Results”Tested 2026-04-21 with Claude Code v2.1.98 against Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via LiteLLM gateway:
| Test | Result |
|---|---|
| Simple prompt (“Say OK”) | Pass — correct response |
| File reading (Read tool) | Pass — read and summarized a TOML file |
Headless mode (-p flag) | Pass — non-interactive mode works |
--dangerously-skip-permissions | Pass |
Without --model flag (ANTHROPIC_MODEL) | Pass |
| Code generation | Pass — 21.6s (correct output) |
| Code review (3 bugs) | Pass — 19.5s (3 findings with line numbers) |
| Reasoning (mutex vs semaphore) | Pass — 10.0s (correct, 3 bullet points) |
| Error recovery (missing file) | Pass — 8.8s |
Performance vs Other Tools (Same Devstral 123B Model)
Section titled “Performance vs Other Tools (Same Devstral 123B Model)”| Test | Claude Self-Hosted | Vibe | OpenCode |
|---|---|---|---|
| Code generation | 21.6s | 7.4s | 6.1s |
| Reasoning | 10.0s | 3.6s | 2.6s |
| Error recovery | 8.8s | 2.5s | 2.7s |
Claude Code’s harness adds 2-3x overhead vs Vibe/OpenCode on the same model (system prompt, tool registration, plugin loading, CLAUDE.md parsing). This means:
- For interactive sessions: Self-hosted Claude is great — same experience as cloud, unlimited tokens
- For headless batch dispatch: Vibe/OpenCode are 2-4x faster per invocation
- Self-hosted Claude’s value is tooling, not speed — you get skills, MCP, plugins, CLAUDE.md rules for free
What We Tested
Section titled “What We Tested”# Test 1: Basic promptANTHROPIC_BASE_URL="https://ai.digitalfacility.io/api" \ANTHROPIC_AUTH_TOKEN="sk-your-key" \ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium" \CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 \claude -p "Say SELFHOSTED_CLAUDE_TEST_OK" --model devstral-123b --output-format text
# Result: "SELFHOSTED_CLAUDE_TEST_OK"
# Test 2: Tool use (file reading)claude -p "Read vibe/agents/infra.toml and summarize it" --model devstral-123b --output-format text
# Result: Correctly read and summarized the TOML fileEnvironment Variables Reference
Section titled “Environment Variables Reference”| Variable | Required | Description |
|---|---|---|
ANTHROPIC_BASE_URL | Yes | Your backend URL (LiteLLM, Ollama, vLLM) |
ANTHROPIC_AUTH_TOKEN | Yes | API key or dummy token |
ANTHROPIC_CUSTOM_MODEL_OPTION | Yes | Model name (must match backend’s served name) |
ANTHROPIC_MODEL | Yes | Default model — without this, Claude defaults to sonnet/opus which your backend won’t have |
ANTHROPIC_CUSTOM_MODEL_OPTION_NAME | No | Display name in model picker |
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS | Recommended | Suppresses beta headers that may cause 400/403 |
Limitations
Section titled “Limitations”- Officially unsupported — Anthropic may change the behavior at any time
- No prompt caching — Anthropic’s prompt caching features won’t work with third-party backends
- MCP tool search disabled — When using a non-first-party host, MCP tool search is disabled by default
- Quality depends on model — Devstral (72.2% SWE-bench) vs Claude Opus (80.8%). Complex multi-step reasoning will be weaker.
- Tool-call fidelity varies — Some models may not handle the Anthropic tool_use/tool_result content blocks perfectly. Devstral with
--tool-call-parser mistralis confirmed working. - Context window limits — Claude Code expects 64K+ tokens. Verify your model supports this.
Switching Between Cloud and Self-Hosted
Section titled “Switching Between Cloud and Self-Hosted”With the Switcher Script (Recommended)
Section titled “With the Switcher Script (Recommended)”source claude-switch local # Self-hosted Devstral (free)claude # Uses Devstral automatically
source claude-switch cloud # Back to Anthropicclaude # Uses Opus/Sonnet
source claude-switch status # Show current backendWith Shell Aliases
Section titled “With Shell Aliases”# These combine switching + launching with preferred flagscc-local # Switch to self-hosted + launch Claudecc-cloud # Switch to Anthropic + launch Claudecc # Launch with whichever backend is currently activeSmart Escalation
Section titled “Smart Escalation”When running self-hosted, most tasks work fine on Mistral Medium. For hard tasks that need Opus-quality reasoning (architecture, security assessment), escalate mid-session:
# Start self-hosted for routine workcc-local
# ... working on routine tasks (free, unlimited) ...
# Hit a hard problem? Switch to cloud for just this sessioncc-cloud
# ... solve the hard problem with Opus ...
# Switch back to self-hosted for the restcc-localTwo Complementary Skills: vibe-orchestrator and cloud-escalator
Section titled “Two Complementary Skills: vibe-orchestrator and cloud-escalator”The shell aliases above let you switch at session granularity. Two installable skills let you switch at sub-task granularity — dispatch one prompt to the other backend without exiting your current session. They are mirror images:
| Skill | You’re running… | Dispatch sub-task to… | Goal |
|---|---|---|---|
vibe-orchestrator | cc-cloud (Anthropic, paid) | Local Vibe or OpenCode on Mistral Medium | Save subscription tokens — keep Claude for judgment, send grunt work to free local model |
cloud-escalator | cc-local (Mistral Medium, free) | One-shot cloud Claude (Anthropic Opus) | Stay on free local — only pay cloud for hard reasoning calls |
Both pursue the same principle: cheapest tool that does the job well. They differ only in starting point.
vibe-orchestrator — cloud Claude offloads grunt work to local
Section titled “vibe-orchestrator — cloud Claude offloads grunt work to local”You’re paying Anthropic per token. Most of what Claude Code does (file reads, mechanical edits, test writing, documentation) doesn’t actually require Opus-quality reasoning. The vibe-orchestrator skill dispatches that mechanical work to a local Mistral Medium via Vibe or OpenCode CLI, then Claude synthesizes the results.
Typical session: Claude plans (Opus, 5K tokens) → 3 Vibe agents implement in parallel (free, unlimited) → Claude reviews and commits (Opus, 5K tokens). Result: ~90% reduction in cloud token usage on multi-file features.
Install: clone git.juntogroups.org/public-repos/vibe-coding, copy orchestrator/vibe-orchestrator-skill.md to ~/.claude/skills/vibe-orchestrator/SKILL.md. Trigger by saying “use vibe-orchestrator” or just letting the skill description auto-match your task.
cloud-escalator — local Claude pulls in cloud for hard sub-tasks
Section titled “cloud-escalator — local Claude pulls in cloud for hard sub-tasks”You’re on cc-local, getting unlimited free tokens. But you’ve hit a genuinely hard sub-question — an architecture choice with subtle tradeoffs, a security finding whose exploitability isn’t clear, an ambiguous spec. Mistral Medium is good but not Opus. The cloud-escalator skill dispatches that one prompt to cloud Claude via claude-switch cloud + claude -p in a subshell (so the env vars don’t leak back), captures the answer, and returns control to your local session.
Typical session: 95% of work stays on cc-local for free, with 2-3 cloud escalations for the genuinely hard calls. The hard calls cost cents each instead of dollars per session.
Install: clone git.juntogroups.org/public-repos/vibe-coding, copy the skill into ~/.claude/skills/cloud-escalator/SKILL.md. Requires claude-switch on PATH and ~/.claude-backends.env configured for both backends.
Quick decision matrix
Section titled “Quick decision matrix”| Your situation | Start with | Use which skill? |
|---|---|---|
| Building a multi-file feature, paying for Claude subscription | cc-cloud | vibe-orchestrator |
| Reading/editing a lot, occasional architecture call needed | cc-local | cloud-escalator |
| Code review of someone else’s PR | cc-local | None — Mistral handles checklist review fine |
| Greenfield architecture, every decision matters | cc-cloud | None — keep everything on Opus |
| Bulk migration / refactor with mechanical changes | cc-local | vibe-orchestrator from inside (dispatch to OpenCode for LSP-aware edits) |
| You don’t know yet | cc-local | Start free, escalate if needed |
Related Resources
Section titled “Related Resources”- Claude Code - Full Claude Code guide (cloud)
- Mistral Vibe - Alternative open-source CLI for self-hosted models
- OpenCode - Another alternative with LSP integration
- Agent Comparison - Benchmark all tools head-to-head
- Claude Code Funding - Pricing for cloud subscriptions
External Links
Section titled “External Links”- Claude Code Environment Variables - Official env var reference
- Claude Code LLM Gateway - Official gateway requirements
- Claude Code Model Configuration - Model selection docs
- Ollama Claude Code Integration - Official Ollama docs
- LM Studio Claude Code Integration - LM Studio setup