Claude Code with Self-Hosted Models

Claude Code CLI can be pointed at any backend that implements the Anthropic Messages API — including self-hosted models via vLLM, Ollama, LM Studio, or LiteLLM. This means you can use Claude Code’s superior tooling (skills, team spawners, MCP, plugins) with self-hosted open-weights models like Devstral, swapping the per-token cloud bill for a flat hardware/electricity cost.

The fastest way to try self-hosted Claude Code, no separate gateway needed. Requires Claude Code already installed (npm i -g @anthropic-ai/claude-code or official installer) and ~30GB free disk for the model:

brew install ollama && ollama serve &                    # 1. install + start (Linux: see ollama.com/download)
ollama pull devstral                                     # 2. pull a tool-calling model (~22GB)
export ANTHROPIC_BASE_URL="http://localhost:11434" \
       ANTHROPIC_AUTH_TOKEN="ollama" \
       ANTHROPIC_MODEL="devstral" \
       ANTHROPIC_CUSTOM_MODEL_OPTION="devstral" \
       CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1           # 3. point Claude Code at Ollama
claude -p "say SELFHOSTED_OK" --output-format text       # 4. smoke test → SELFHOSTED_OK

If you get SELFHOSTED_OK, you’re done — run claude in any project to start an interactive session. The rest of this page covers LiteLLM gateways (for teams), vLLM (for serious GPUs), shell aliases for switching backends, and orchestration patterns.

How It Works

Four environment variables redirect Claude Code to your backend:

export ANTHROPIC_BASE_URL="http://your-server:port"
export ANTHROPIC_AUTH_TOKEN="your-token"
export ANTHROPIC_CUSTOM_MODEL_OPTION="model-name"
export ANTHROPIC_MODEL="model-name"

The backend must implement the Anthropic Messages API (/v1/messages), not the OpenAI API (/v1/chat/completions). Several tools support this natively:

Backend	Anthropic API Support	Notes
LiteLLM	Yes (built-in)	Translates Anthropic requests to any backend
Ollama (v0.14+)	Yes (native)	Direct Anthropic endpoint
LM Studio (v0.4.1+)	Yes (native)	Direct Anthropic endpoint
vLLM	Yes (native)	`/v1/messages` endpoint
Open-WebUI	Via LiteLLM	Routes through LiteLLM which speaks Anthropic

Why This Matters

Feature	Claude Code (Anthropic Cloud)	Claude Code (Self-Hosted)
Tool use (Read, Edit, Bash, Grep)	Yes	Yes (verified)
Skills (/team-review, /commit, etc.)	Yes	Yes
MCP servers	Yes	Yes (but search disabled by default)
Plugins (Superpowers, etc.)	Yes	Yes
CLAUDE.md rules	Yes	Yes
Context window	1M tokens	Model-dependent (128K for Devstral)
Cost	$20-200/mo	Free (self-hosted)
Data stays on-network	No	Yes
SWE-bench quality	80.8% (Opus)	~72% (Devstral 2)

Setup with LiteLLM Gateway

Best for teams already running LiteLLM (e.g., behind Open-WebUI). LiteLLM translates Anthropic-format requests to your vLLM/Ollama backend.

Quick Setup (Switcher Script)

The easiest way is the backend switcher script from the Vibe Coding repo (canonical) — also reachable at the legacy alias git.irregularchat.com/public-repos/vibe-coding:

# Install the switcher
cp claude-selfhosted/claude-switch.sh ~/.local/bin/claude-switch
chmod +x ~/.local/bin/claude-switch

# First run creates ~/.claude-backends.env — edit with your endpoints
nano ~/.claude-backends.env

# Switch backends
source claude-switch local       # Self-hosted Devstral
source claude-switch cloud       # Back to Anthropic
source claude-switch ollama      # Local Ollama
source claude-switch status      # Show current

# Then run Claude as normal — no --model flag needed
claude --dangerously-skip-permissions --teammate-mode auto

Recommended Shell Aliases

These four aliases are the daily-driver — switch backends and launch Claude Code in one step. Copy the whole block verbatim into ~/.zshrc (or ~/.bashrc):

# ── Claude Code launchers ───────────────────────────────────────────
# Plain launcher — uses whichever backend env vars are currently exported.
alias cc='claude --dangerously-skip-permissions --teammate-mode auto'

# Self-hosted gateway (your own Open-WebUI / LiteLLM in front of vLLM,
# typically running Devstral Small or Devstral 123B).
alias cc-local='source claude-switch local && claude --dangerously-skip-permissions --teammate-mode auto'

# Official Mistral API via local LiteLLM proxy on :4000
# (proxy translates Anthropic Messages -> Mistral chat-completions).
# Full proxy setup: see Mistral Vibe page, "cc-vibe" section.
alias cc-vibe='source claude-switch vibe && claude --dangerously-skip-permissions --teammate-mode auto'

# Anthropic cloud (Opus/Sonnet, paid subscription).
alias cc-cloud='source claude-switch cloud && claude --dangerously-skip-permissions --teammate-mode auto'

After source ~/.zshrc:

Command	Backend	Cost model	When to use
`cc-local`	Your self-hosted gateway (Devstral / Qwen Coder / etc.)	Flat — your hardware + electricity	Default daily-driver if you have a GPU server; no per-token metering
`cc-vibe`	Official Mistral API via local LiteLLM proxy	Pay-per-token via console.mistral.ai	No GPU? Get Mistral Medium quality without self-hosting. Pair with Le Chat Pro for web UI.
`cc-cloud`	Anthropic Opus/Sonnet	Paid subscription ($20–200/mo)	Hard reasoning (architecture, security, ambiguous specs)
`cc`	Whichever backend was last selected	—	Resuming after a `source claude-switch` in the same shell
`source claude-switch status`	—	—	Sanity-check what you’re paying for before running heavy work

How cc-vibe works under the hood: switch_vibe() in claude-switch points ANTHROPIC_BASE_URL at http://localhost:4000 and uses LITELLM_MASTER_KEY (the proxy’s auth) as ANTHROPIC_AUTH_TOKEN. The local LiteLLM proxy translates Anthropic Messages format → Mistral chat-completions format because Claude Code cannot speak to api.mistral.ai directly. Full proxy + key setup: Mistral Vibe → cc-vibe.

Switching Mid-Session

You don’t have to commit at launch — you can flip backends inside an existing shell:

# Started with cc-local, hit a hard problem
^C                                # Exit the local Claude session
source claude-switch cloud        # Flip env to cloud
claude                            # Or just: cc
# ... solve the hard thing ...
^C
source claude-switch local        # Flip back
cc                                # Continue routine work, free again

The cloud-escalator skill (see below) automates this for a single sub-task — call out to cloud Opus for one prompt, return to local, without exiting your session.

Manual Environment Variables

If you prefer manual setup over the switcher script:

# Point Claude Code at your LiteLLM gateway
export ANTHROPIC_BASE_URL="https://your-litellm-or-openwebui-server/api"
export ANTHROPIC_AUTH_TOKEN="your-litellm-api-key"
export ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium"
export ANTHROPIC_MODEL="mistral-medium"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1

Then run:

claude    # ANTHROPIC_MODEL tells Claude which model to use

Or Configure in settings.json

Add to ~/.claude/settings.local.json for persistent config:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://your-server/api",
    "ANTHROPIC_AUTH_TOKEN": "your-api-key",
    "ANTHROPIC_CUSTOM_MODEL_OPTION": "mistral-medium",
    "ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "Mistral Medium 3.5 (Self-Hosted)",
    "ANTHROPIC_MODEL": "mistral-medium",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}

LiteLLM Security Advisory

LiteLLM PyPI versions 1.82.7 and 1.82.8 were compromised with credential-stealing malware. Verify your version: pip show litellm | grep Version. Upgrade to 1.82.9+ if affected.

LiteLLM Config

Your LiteLLM config.yaml needs the model accessible:

model_list:
  # Open-weights option — Apache 2.0, fits on a single RTX 4090 (24GB).
  - model_name: devstral-small
    litellm_params:
      model: hosted_vllm/mistralai/Devstral-Small-2-Instruct-2512
      api_base: http://host.docker.internal:8000/v1
      api_key: none
      temperature: 0.2
      timeout: 600

  # Higher-quality open-weights option — Mistral Research license,
  # needs 2× H100/A100 class GPUs and FP8 quantization.
  - model_name: devstral-123b
    litellm_params:
      model: hosted_vllm/mistralai/Devstral-2-123B-Instruct-2512
      api_base: http://host.docker.internal:8000/v1
      api_key: none
      temperature: 0.2
      timeout: 600

LiteLLM automatically handles the Anthropic-to-OpenAI translation.

Setup with Ollama

Simplest setup — Ollama v0.14+ exposes the Anthropic Messages API natively.

# Pull a model
ollama pull devstral       # or any supported model

# Set env vars
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1

# Run Claude Code with the local model
claude --model devstral

Setup with vLLM Direct

If you’re running vLLM with the Anthropic-compatible endpoint:

# vLLM must be launched with these flags
vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
  --served-model-name devstral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 2 \
  --quantization fp8

# Set env vars
export ANTHROPIC_BASE_URL="http://your-vllm-server:8000"
export ANTHROPIC_AUTH_TOKEN="vllm"
export ANTHROPIC_CUSTOM_MODEL_OPTION="devstral"
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1

claude --model devstral

Verified Test Results

Tested 2026-04-21 with Claude Code v2.1.98 against Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via LiteLLM gateway:

Test	Result
Simple prompt (“Say OK”)	Pass — correct response
File reading (Read tool)	Pass — read and summarized a TOML file
Headless mode (`-p` flag)	Pass — non-interactive mode works
`--dangerously-skip-permissions`	Pass
Without `--model` flag (ANTHROPIC_MODEL)	Pass
Code generation	Pass — 21.6s (correct output)
Code review (3 bugs)	Pass — 19.5s (3 findings with line numbers)
Reasoning (mutex vs semaphore)	Pass — 10.0s (correct, 3 bullet points)
Error recovery (missing file)	Pass — 8.8s

Performance vs Other Tools (Same Devstral 123B Model)

Test	Claude Self-Hosted	Vibe	OpenCode
Code generation	21.6s	7.4s	6.1s
Reasoning	10.0s	3.6s	2.6s
Error recovery	8.8s	2.5s	2.7s

Claude Code’s harness adds 2-3x overhead vs Vibe/OpenCode on the same model (system prompt, tool registration, plugin loading, CLAUDE.md parsing). This means:

For interactive sessions: Self-hosted Claude is great — same experience as cloud, unlimited tokens
For headless batch dispatch: Vibe/OpenCode are 2-4x faster per invocation
Self-hosted Claude’s value is tooling, not speed — you get skills, MCP, plugins, CLAUDE.md rules for free

What We Tested

# Test 1: Basic prompt
ANTHROPIC_BASE_URL="https://your-openwebui-instance.example.com/api" \
ANTHROPIC_AUTH_TOKEN="sk-your-key" \
ANTHROPIC_CUSTOM_MODEL_OPTION="mistral-medium" \
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 \
claude -p "Say SELFHOSTED_CLAUDE_TEST_OK" --model devstral-123b --output-format text

# Result: "SELFHOSTED_CLAUDE_TEST_OK"

# Test 2: Tool use (file reading)
claude -p "Read vibe/agents/infra.toml and summarize it" --model devstral-123b --output-format text

# Result: Correctly read and summarized the TOML file

Environment Variables Reference

Variable	Required	Description
`ANTHROPIC_BASE_URL`	Yes	Your backend URL (LiteLLM, Ollama, vLLM)
`ANTHROPIC_AUTH_TOKEN`	Yes	API key or dummy token
`ANTHROPIC_CUSTOM_MODEL_OPTION`	Yes	Model name (must match backend’s served name)
`ANTHROPIC_MODEL`	Yes	Default model — without this, Claude defaults to sonnet/opus which your backend won’t have
`ANTHROPIC_CUSTOM_MODEL_OPTION_NAME`	No	Display name in model picker
`CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS`	Recommended	Suppresses beta headers that may cause 400/403

Limitations

Officially unsupported — Anthropic may change the behavior at any time
No prompt caching — Anthropic’s prompt caching features won’t work with third-party backends
MCP tool search disabled — When using a non-first-party host, MCP tool search is disabled by default
Quality depends on model — Devstral (72.2% SWE-bench) vs Claude Opus (80.8%). Complex multi-step reasoning will be weaker.
Tool-call fidelity varies — Some models may not handle the Anthropic tool_use/tool_result content blocks perfectly. Devstral with --tool-call-parser mistral is confirmed working.
Context window limits — Claude Code expects 64K+ tokens. Verify your model supports this.

Switching Between Cloud and Self-Hosted

With the Switcher Script (Recommended)

source claude-switch local       # Self-hosted Devstral (flat cost)
claude                            # Uses Devstral automatically

source claude-switch cloud       # Back to Anthropic
claude                            # Uses Opus/Sonnet

source claude-switch status      # Show current backend

With Shell Aliases

# These combine switching + launching with preferred flags
cc-local    # Switch to self-hosted + launch Claude
cc-cloud    # Switch to Anthropic + launch Claude
cc          # Launch with whichever backend is currently active

Smart Escalation

When running self-hosted, most tasks work fine on Mistral Medium. For hard tasks that need Opus-quality reasoning (architecture, security assessment), escalate mid-session:

# Start self-hosted for routine work
cc-local

# ... working on routine tasks (no per-token bill) ...

# Hit a hard problem? Switch to cloud for just this session
cc-cloud

# ... solve the hard problem with Opus ...

# Switch back to self-hosted for the rest
cc-local

Two Complementary Skills: vibe-orchestrator and cloud-escalator

The shell aliases above let you switch at session granularity. Two installable skills let you switch at sub-task granularity — dispatch one prompt to the other backend without exiting your current session. They are mirror images:

Skill	You’re running…	Dispatch sub-task to…	Goal
`vibe-orchestrator`	`cc-cloud` (Anthropic, paid)	Local Vibe or OpenCode on Mistral Medium	Save subscription tokens — keep Claude for judgment, send grunt work to free local model
`cloud-escalator`	`cc-local` (Mistral Medium, free)	One-shot cloud Claude (Anthropic Opus)	Stay on free local — only pay cloud for hard reasoning calls

Both pursue the same principle: cheapest tool that does the job well. They differ only in starting point.

vibe-orchestrator — cloud Claude offloads grunt work to local

You’re paying Anthropic per token. Most of what Claude Code does (file reads, mechanical edits, test writing, documentation) doesn’t actually require Opus-quality reasoning. The vibe-orchestrator skill dispatches that mechanical work to a local Mistral Medium via Vibe or OpenCode CLI, then Claude synthesizes the results.

Typical session: Claude plans (Opus, 5K tokens) → 3 Vibe agents implement in parallel (free, unlimited) → Claude reviews and commits (Opus, 5K tokens). Result: ~90% reduction in cloud token usage on multi-file features.

Install: clone git.juntogroups.org/public-repos/vibe-coding, copy orchestrator/vibe-orchestrator-skill.md to ~/.claude/skills/vibe-orchestrator/SKILL.md. Trigger by saying “use vibe-orchestrator” or just letting the skill description auto-match your task.

Before dispatching parallel Vibe jobs, read Vibe Headless Mode Gotchas — vibe -p has four non-obvious behaviors (tool-approval default, output buffering, retry budget, per-key rate limits) that have collectively burned days of community debugging time. The page includes a safe-dispatch template to copy.

cloud-escalator — local Claude pulls in cloud for hard sub-tasks

You’re on cc-local, getting unlimited free tokens. But you’ve hit a genuinely hard sub-question — an architecture choice with subtle tradeoffs, a security finding whose exploitability isn’t clear, an ambiguous spec. Mistral Medium is good but not Opus. The cloud-escalator skill dispatches that one prompt to cloud Claude via claude-switch cloud + claude -p in a subshell (so the env vars don’t leak back), captures the answer, and returns control to your local session.

Typical session: 95% of work stays on cc-local for free, with 2-3 cloud escalations for the genuinely hard calls. The hard calls cost cents each instead of dollars per session.

Install: clone git.juntogroups.org/public-repos/vibe-coding, copy the skill into ~/.claude/skills/cloud-escalator/SKILL.md. Requires claude-switch on PATH and ~/.claude-backends.env configured for both backends.

Quick decision matrix

Your situation	Start with	Use which skill?
Building a multi-file feature, paying for Claude subscription	`cc-cloud`	`vibe-orchestrator`
Reading/editing a lot, occasional architecture call needed	`cc-local`	`cloud-escalator`
Code review of someone else’s PR	`cc-local`	None — Mistral handles checklist review fine
Greenfield architecture, every decision matters	`cc-cloud`	None — keep everything on Opus
Bulk migration / refactor with mechanical changes	`cc-local`	`vibe-orchestrator` from inside (dispatch to OpenCode for LSP-aware edits)
You don’t know yet	`cc-local`	Start free, escalate if needed

Alternative — Don’t Want a CLI? Le Chat at $14.99/mo

If your only reason for chasing a self-hosted setup is the Claude Code subscription price tag ($20–200/mo), the cheapest path may not be self-hosting at all — it may be chat.mistral.ai (“Le Chat”) Pro at $14.99/month, which is under both Claude Pro and ChatGPT Plus. You get a web UI to Mistral Large, Codestral, and Pixtral with document upload, web search, image generation, and Canvas (in-browser code editing).

	Claude Code (self-hosted)	Le Chat Pro (web, $14.99/mo)
Cost	$0 software + your compute / hardware	$14.99/mo flat
Hardware needed	GPU server, or LiteLLM/Ollama on a beefy Mac	Just a browser
Touches your filesystem	Yes (full agentic)	No — upload/download only
Runs shell commands	Yes	No
Skills, MCP, subagents	Yes	No (web-only)
Data stays on-network	Yes (with local model)	No — Mistral cloud
Best for	Multi-file refactors, automation, repo-aware work	Q&A, brainstorm, one-off code snippets

Rule of thumb: if your workflow is “ask the model, copy code back into the editor,” Le Chat Pro is the better deal — same Mistral models, none of the setup. If you need the model to act on your codebase — read files, edit in place, run commands, spawn subagents — stick with self-hosted Claude Code (or use Mistral Vibe, which is also a terminal agent and runs against the same backends).

Many teams use both: Le Chat Pro for chat-style work, a CLI agent (Claude Code or Vibe) for agentic edits.

Claude Code - Full Claude Code guide (cloud)
Mistral Vibe - Alternative open-source CLI for self-hosted models
OpenCode - Another alternative with LSP integration
Agent Comparison - Benchmark all tools head-to-head
Claude Code Funding - Pricing for cloud subscriptions

External Links

Claude Code Environment Variables - Official env var reference
Claude Code LLM Gateway - Official gateway requirements
Claude Code Model Configuration - Model selection docs
Ollama Claude Code Integration - Official Ollama docs
LM Studio Claude Code Integration - LM Studio setup

Docker — for containerizing the inference backend (vLLM, Ollama, LiteLLM)
Pi LLM — running smaller models on Raspberry Pi hardware

Claude Code with Self-Hosted Models

Claude Code with Self-Hosted Models

How It Works

Why This Matters

Setup with LiteLLM Gateway

Quick Setup (Switcher Script)

Recommended Shell Aliases

Switching Mid-Session

Manual Environment Variables

Or Configure in settings.json

LiteLLM Config

Setup with Ollama

Setup with vLLM Direct

Verified Test Results

Performance vs Other Tools (Same Devstral 123B Model)

What We Tested

Environment Variables Reference

Limitations

Switching Between Cloud and Self-Hosted

With the Switcher Script (Recommended)

With Shell Aliases

Smart Escalation

Two Complementary Skills: vibe-orchestrator and cloud-escalator

vibe-orchestrator — cloud Claude offloads grunt work to local

cloud-escalator — local Claude pulls in cloud for hard sub-tasks

Quick decision matrix

Alternative — Don’t Want a CLI? Le Chat at $14.99/mo

Related Resources

External Links

Related