Skip to content

Claude Code with Self-Hosted Models

Claude Code CLI can be pointed at any backend that implements the Anthropic Messages API — including self-hosted models via vLLM, Ollama, LM Studio, or LiteLLM. This means you can use Claude Code’s superior tooling (skills, team spawners, MCP, plugins) with free, unlimited self-hosted models like Devstral.

Four environment variables redirect Claude Code to your backend:

Terminal window
export ANTHROPIC_BASE_URL="http://your-server:port"
export ANTHROPIC_AUTH_TOKEN="your-token"
export ANTHROPIC_CUSTOM_MODEL_OPTION="model-name"
export ANTHROPIC_MODEL="model-name"

The backend must implement the Anthropic Messages API (/v1/messages), not the OpenAI API (/v1/chat/completions). Several tools support this natively:

BackendAnthropic API SupportNotes
LiteLLMYes (built-in)Translates Anthropic requests to any backend
Ollama (v0.14+)Yes (native)Direct Anthropic endpoint
LM Studio (v0.4.1+)Yes (native)Direct Anthropic endpoint
vLLMYes (native)/v1/messages endpoint
Open-WebUIVia LiteLLMRoutes through LiteLLM which speaks Anthropic
FeatureClaude Code (Anthropic Cloud)Claude Code (Self-Hosted)
Tool use (Read, Edit, Bash, Grep)YesYes (verified)
Skills (/team-review, /commit, etc.)YesYes
MCP serversYesYes (but search disabled by default)
Plugins (Superpowers, etc.)YesYes
CLAUDE.md rulesYesYes
Context window1M tokensModel-dependent (128K for Devstral)
Cost$20-200/moFree (self-hosted)
Data stays on-networkNoYes
SWE-bench quality80.8% (Opus)~72% (Devstral 2)

Before you self-host: if your only reason to escape Anthropic’s cloud is the price tag ($20–200/mo), the cheapest path may not be self-hosting at all — it may be chat.mistral.ai (“Le Chat”) Pro at $14.99/month (commonly rounded to “$15/mo”). That’s notably under both Claude Pro and ChatGPT Plus, and you get a web UI to Mistral Large, Codestral, and Pixtral with document upload, web search, image generation, and Canvas (in-browser code editing).

Claude Code (self-hosted)Le Chat Pro (web, $15/mo)
Cost$0 software + your compute / hardware$14.99/mo flat
Hardware neededGPU server, or LiteLLM/Ollama on a beefy MacJust a browser
Touches your filesystemYes (full agentic)No — upload/download only
Runs shell commandsYesNo
Skills, MCP, subagentsYesNo (web-only)
Data stays on-networkYes (with local model)No — Mistral cloud
Best forMulti-file refactors, automation, repo-aware workQ&A, brainstorm, one-off code snippets

Rule of thumb: if your workflow is “ask the model, copy code back into the editor,” Le Chat Pro is the better deal — same Mistral models, none of the setup. If you need the model to act on your codebase — read files, edit in place, run commands, spawn subagents — keep going with self-hosted Claude Code (or use Mistral Vibe, which is also a terminal agent and runs against the same backends).

Many teams use both: Le Chat Pro for chat-style work, a CLI agent (Claude Code or Vibe) for agentic edits.

Best for teams already running LiteLLM (e.g., behind Open-WebUI). LiteLLM translates Anthropic-format requests to your vLLM/Ollama backend.

The easiest way is the backend switcher script from the Vibe Coding repo (canonical) — also reachable at the legacy alias git.irregularchat.com/public-repos/vibe-coding:

Terminal window
# Install the switcher
cp claude-selfhosted/claude-switch.sh ~/.local/bin/claude-switch
chmod +x ~/.local/bin/claude-switch
# First run creates ~/.claude-backends.env — edit with your endpoints
nano ~/.claude-backends.env
# Switch backends
source claude-switch local # Self-hosted Devstral
source claude-switch cloud # Back to Anthropic
source claude-switch ollama # Local Ollama
source claude-switch status # Show current
# Then run Claude as normal — no --model flag needed
claude --dangerously-skip-permissions --teammate-mode auto

These three aliases are the daily-driver — switch backends and launch Claude Code in one step. Copy them verbatim into ~/.zshrc (or ~/.bashrc):

Terminal window
# ── Claude Code launchers ───────────────────────────────────────────
# Plain launcher — uses whichever backend env vars are currently exported.
alias cc='claude --dangerously-skip-permissions --teammate-mode auto'
# Switch to self-hosted (Mistral Medium via LiteLLM @ ai.digitalfacility.io)
# AND launch Claude in one step.
alias cc-local='source claude-switch local && claude --dangerously-skip-permissions --teammate-mode auto'
# Switch to Anthropic cloud (Opus/Sonnet, paid) AND launch Claude.
alias cc-cloud='source claude-switch cloud && claude --dangerously-skip-permissions --teammate-mode auto'

After source ~/.zshrc:

CommandWhat it doesWhen to use
cc-localSets self-hosted env vars and launches Claude CodeDefault daily-driver — free, unlimited tokens, same tooling
cc-cloudSets Anthropic env vars and launches Claude CodeHard reasoning tasks (architecture, security, ambiguous specs)
cc-vibeRoutes Claude Code through the local LiteLLM proxy to the official Mistral APIAnyone without IrregularChat backend access — pay-as-you-go from Mistral
ccLaunches Claude with whichever backend was last selectedResuming after a source claude-switch in the same shell
source claude-switch statusPrints current backend (no launch)Sanity-check what you’re paying for before running heavy work

The cc-vibe alias is defined as:

Terminal window
alias cc-vibe='source claude-switch vibe && claude --dangerously-skip-permissions --teammate-mode auto'

…with a corresponding switch_vibe() in claude-switch that points ANTHROPIC_BASE_URL at http://localhost:4000 and uses LITELLM_MASTER_KEY (the proxy’s auth) as ANTHROPIC_AUTH_TOKEN. The local LiteLLM proxy translates Anthropic Messages format → Mistral chat-completions format because Claude Code cannot speak to api.mistral.ai directly. See the full step-by-step on the Mistral Vibe page.

You don’t have to commit at launch — you can flip backends inside an existing shell:

Terminal window
# Started with cc-local, hit a hard problem
^C # Exit the local Claude session
source claude-switch cloud # Flip env to cloud
claude # Or just: cc
# ... solve the hard thing ...
^C
source claude-switch local # Flip back
cc # Continue routine work, free again

The cloud-escalator skill (see below) automates this for a single sub-task — call out to cloud Opus for one prompt, return to local, without exiting your session.

If you prefer manual setup over the switcher script:

Terminal window
# Point Claude Code at your LiteLLM gateway
export ANTHROPIC_BASE_URL="https://your-litellm-or-openwebui-server/api"
export ANTHROPIC_AUTH_TOKEN="your-litellm-api-key"
export ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium"
export ANTHROPIC_MODEL="mistral-medium"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1

Then run:

Terminal window
claude # ANTHROPIC_MODEL tells Claude which model to use

Add to ~/.claude/settings.local.json for persistent config:

{
"env": {
"ANTHROPIC_BASE_URL": "https://your-server/api",
"ANTHROPIC_AUTH_TOKEN": "your-api-key",
"ANTHROPIC_CUSTOM_MODEL_OPTION": "mistral-medium",
"ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "Mistral Medium 3.5 (Self-Hosted)",
"ANTHROPIC_MODEL": "mistral-medium",
"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
}
}

LiteLLM Security Advisory

LiteLLM PyPI versions 1.82.7 and 1.82.8 were compromised with credential-stealing malware. Verify your version: pip show litellm | grep Version. Upgrade to 1.82.9+ if affected.

Your LiteLLM config.yaml needs the model accessible:

model_list:
- model_name: mistral-medium
litellm_params:
model: hosted_vllm/mistralai/Mistral-Medium-3.5-128B
api_base: http://host.docker.internal:8000/v1
api_key: none
temperature: 0.7
timeout: 600

LiteLLM automatically handles the Anthropic-to-OpenAI translation.

Simplest setup — Ollama v0.14+ exposes the Anthropic Messages API natively.

Terminal window
# Pull a model
ollama pull devstral # or any supported model
# Set env vars
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
# Run Claude Code with the local model
claude --model devstral

If you’re running vLLM with the Anthropic-compatible endpoint:

Terminal window
# vLLM must be launched with these flags
vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
--served-model-name devstral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--quantization fp8
# Set env vars
export ANTHROPIC_BASE_URL="http://your-vllm-server:8000"
export ANTHROPIC_AUTH_TOKEN="vllm"
export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
claude --model devstral

Tested 2026-04-21 with Claude Code v2.1.98 against Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via LiteLLM gateway:

TestResult
Simple prompt (“Say OK”)Pass — correct response
File reading (Read tool)Pass — read and summarized a TOML file
Headless mode (-p flag)Pass — non-interactive mode works
--dangerously-skip-permissionsPass
Without --model flag (ANTHROPIC_MODEL)Pass
Code generationPass — 21.6s (correct output)
Code review (3 bugs)Pass — 19.5s (3 findings with line numbers)
Reasoning (mutex vs semaphore)Pass — 10.0s (correct, 3 bullet points)
Error recovery (missing file)Pass — 8.8s

Performance vs Other Tools (Same Devstral 123B Model)

Section titled “Performance vs Other Tools (Same Devstral 123B Model)”
TestClaude Self-HostedVibeOpenCode
Code generation21.6s7.4s6.1s
Reasoning10.0s3.6s2.6s
Error recovery8.8s2.5s2.7s

Claude Code’s harness adds 2-3x overhead vs Vibe/OpenCode on the same model (system prompt, tool registration, plugin loading, CLAUDE.md parsing). This means:

  • For interactive sessions: Self-hosted Claude is great — same experience as cloud, unlimited tokens
  • For headless batch dispatch: Vibe/OpenCode are 2-4x faster per invocation
  • Self-hosted Claude’s value is tooling, not speed — you get skills, MCP, plugins, CLAUDE.md rules for free
Terminal window
# Test 1: Basic prompt
ANTHROPIC_BASE_URL="https://ai.digitalfacility.io/api" \
ANTHROPIC_AUTH_TOKEN="sk-your-key" \
ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium" \
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 \
claude -p "Say SELFHOSTED_CLAUDE_TEST_OK" --model devstral-123b --output-format text
# Result: "SELFHOSTED_CLAUDE_TEST_OK"
# Test 2: Tool use (file reading)
claude -p "Read vibe/agents/infra.toml and summarize it" --model devstral-123b --output-format text
# Result: Correctly read and summarized the TOML file
VariableRequiredDescription
ANTHROPIC_BASE_URLYesYour backend URL (LiteLLM, Ollama, vLLM)
ANTHROPIC_AUTH_TOKENYesAPI key or dummy token
ANTHROPIC_CUSTOM_MODEL_OPTIONYesModel name (must match backend’s served name)
ANTHROPIC_MODELYesDefault model — without this, Claude defaults to sonnet/opus which your backend won’t have
ANTHROPIC_CUSTOM_MODEL_OPTION_NAMENoDisplay name in model picker
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETASRecommendedSuppresses beta headers that may cause 400/403
  • Officially unsupported — Anthropic may change the behavior at any time
  • No prompt caching — Anthropic’s prompt caching features won’t work with third-party backends
  • MCP tool search disabled — When using a non-first-party host, MCP tool search is disabled by default
  • Quality depends on model — Devstral (72.2% SWE-bench) vs Claude Opus (80.8%). Complex multi-step reasoning will be weaker.
  • Tool-call fidelity varies — Some models may not handle the Anthropic tool_use/tool_result content blocks perfectly. Devstral with --tool-call-parser mistral is confirmed working.
  • Context window limits — Claude Code expects 64K+ tokens. Verify your model supports this.
Terminal window
source claude-switch local # Self-hosted Devstral (free)
claude # Uses Devstral automatically
source claude-switch cloud # Back to Anthropic
claude # Uses Opus/Sonnet
source claude-switch status # Show current backend
Terminal window
# These combine switching + launching with preferred flags
cc-local # Switch to self-hosted + launch Claude
cc-cloud # Switch to Anthropic + launch Claude
cc # Launch with whichever backend is currently active

When running self-hosted, most tasks work fine on Mistral Medium. For hard tasks that need Opus-quality reasoning (architecture, security assessment), escalate mid-session:

Terminal window
# Start self-hosted for routine work
cc-local
# ... working on routine tasks (free, unlimited) ...
# Hit a hard problem? Switch to cloud for just this session
cc-cloud
# ... solve the hard problem with Opus ...
# Switch back to self-hosted for the rest
cc-local

Two Complementary Skills: vibe-orchestrator and cloud-escalator

Section titled “Two Complementary Skills: vibe-orchestrator and cloud-escalator”

The shell aliases above let you switch at session granularity. Two installable skills let you switch at sub-task granularity — dispatch one prompt to the other backend without exiting your current session. They are mirror images:

SkillYou’re running…Dispatch sub-task to…Goal
vibe-orchestratorcc-cloud (Anthropic, paid)Local Vibe or OpenCode on Mistral MediumSave subscription tokens — keep Claude for judgment, send grunt work to free local model
cloud-escalatorcc-local (Mistral Medium, free)One-shot cloud Claude (Anthropic Opus)Stay on free local — only pay cloud for hard reasoning calls

Both pursue the same principle: cheapest tool that does the job well. They differ only in starting point.

vibe-orchestrator — cloud Claude offloads grunt work to local

Section titled “vibe-orchestrator — cloud Claude offloads grunt work to local”

You’re paying Anthropic per token. Most of what Claude Code does (file reads, mechanical edits, test writing, documentation) doesn’t actually require Opus-quality reasoning. The vibe-orchestrator skill dispatches that mechanical work to a local Mistral Medium via Vibe or OpenCode CLI, then Claude synthesizes the results.

Typical session: Claude plans (Opus, 5K tokens) → 3 Vibe agents implement in parallel (free, unlimited) → Claude reviews and commits (Opus, 5K tokens). Result: ~90% reduction in cloud token usage on multi-file features.

Install: clone git.juntogroups.org/public-repos/vibe-coding, copy orchestrator/vibe-orchestrator-skill.md to ~/.claude/skills/vibe-orchestrator/SKILL.md. Trigger by saying “use vibe-orchestrator” or just letting the skill description auto-match your task.

cloud-escalator — local Claude pulls in cloud for hard sub-tasks

Section titled “cloud-escalator — local Claude pulls in cloud for hard sub-tasks”

You’re on cc-local, getting unlimited free tokens. But you’ve hit a genuinely hard sub-question — an architecture choice with subtle tradeoffs, a security finding whose exploitability isn’t clear, an ambiguous spec. Mistral Medium is good but not Opus. The cloud-escalator skill dispatches that one prompt to cloud Claude via claude-switch cloud + claude -p in a subshell (so the env vars don’t leak back), captures the answer, and returns control to your local session.

Typical session: 95% of work stays on cc-local for free, with 2-3 cloud escalations for the genuinely hard calls. The hard calls cost cents each instead of dollars per session.

Install: clone git.juntogroups.org/public-repos/vibe-coding, copy the skill into ~/.claude/skills/cloud-escalator/SKILL.md. Requires claude-switch on PATH and ~/.claude-backends.env configured for both backends.

Your situationStart withUse which skill?
Building a multi-file feature, paying for Claude subscriptioncc-cloudvibe-orchestrator
Reading/editing a lot, occasional architecture call neededcc-localcloud-escalator
Code review of someone else’s PRcc-localNone — Mistral handles checklist review fine
Greenfield architecture, every decision matterscc-cloudNone — keep everything on Opus
Bulk migration / refactor with mechanical changescc-localvibe-orchestrator from inside (dispatch to OpenCode for LSP-aware edits)
You don’t know yetcc-localStart free, escalate if needed
  • Docker — for containerizing the inference backend (vLLM, Ollama, LiteLLM)
  • Pi LLM — running smaller models on Raspberry Pi hardware