Skip to content

CLI Coding Agent Comparison

A practical comparison of the leading AI coding tools with real benchmark timing data — including Claude Code on both Anthropic cloud and self-hosted Devstral. Each tool has a dedicated guide linked below.

Each tool has a dedicated wiki page with full setup instructions, configuration, and usage patterns:

ToolGuideWhat It Covers
Claude CodeFull GuidePlugins, skills, team spawners, CLAUDE.md, tech stacks, funding
Claude Code Self-HostedSetup GuideUse Claude Code with Devstral/Ollama/vLLM — switcher script, shell aliases, env vars
Mistral VibeFull GuideInstall, config, MCP, skills, AGENTS.md, self-hosted vLLM, Open-WebUI, orchestration
OpenCodeFull GuideLSP integration, client/server, headless JSON events, MCP OAuth, custom agents
OpenHandsFull GuideWeb UI, Docker sandbox, Devstral setup, autonomous batch mode, GitHub integration
Gemini CLIFull GuideSubagents, Jules extension, multi-model orchestration, GEMINI.md
PageWhat It Covers
Agent PricingPlan pricing, token limits, self-hosted cost comparison
Claude Code FundingMilitary/government procurement, billing, subscriptions
Project RulesCLAUDE.md / AGENTS.md patterns and templates
Lessons LearnedRules and incident reports by stack type
ResourceLink
Vibe Coding Repositorygit.irregularchat.com/public-repos/vibe-coding — Rules, skills, configs, orchestrator, switcher script
IrregularChat Wikiirregularpedia.org — Full knowledge base
FeatureClaude CodeGemini CLIMistral VibeOpenCodeOpenHands
LicenseProprietaryApache 2.0Apache 2.0MITMIT
InterfaceTerminal CLITerminal CLITerminal CLITerminal TUIWeb UI
Model supportClaude onlyGemini onlyAny OpenAI-compatAny (ai-sdk)Any
Self-hosted modelsNoNoYesYesYes
LSP integrationNoNoNoYes (20+ langs)No
MCP serversYesYesYes (stdio+HTTP)Yes (stdio+SSE+OAuth)Yes
Client/server modeNoNoNoYes (serve+attach)Yes (web server)
Headless mode-p (print)-p (prompt)-p (headless)runCLI binary
Auto-approve flag--dangerously-skip-permissions--approval-mode yolo(auto in -p)--dangerously-skip-permissions
JSON output--output-format stream-json-o stream-json--output json|streaming--format json (JSONL)
Cost tracking--max-budget-usdN/A--max-price + input_price/output_price in configJSONL step_finish events include cost/tokens
--workdir flaginherits CWDinherits CWD--workdir DIRinherits CWDN/A
Session continuity--continue / --resume--resume--continue--continue, --sessionPer-session
Context window1M tokens2M+ tokensModel-dependentModel-dependentModel-dependent
Context compactionSophisticatedAutomaticauto_compact_thresholdAuto at 75%Memory condensation
Custom instructionsCLAUDE.md (hierarchical)GEMINI.md / AGENTS.mdAGENTS.md (root only)AGENTS.md + CLAUDE.mdSettings UI
Skills/agentsSKILL.md + team spawnersExtensions + subagentsSKILL.md + TOML agentsopencode.json agents
Sandbox isolationNo--sandbox optionNoNoYes (Docker)
GitHub/GitLabVia gh CLIVia extensionsManualManualNative (issues → PRs)
SWE-bench Verified80.8% (Opus 4.6)~70% (Gemini 2.5 Pro)72.2% (Devstral 2)Model-dependent46.8-61.7%
Cost$20-200/moFree (with Google account)Free (self-host)Free (self-host)Free (self-host)
GitHub starsN/A~55k~5k~147k~65k

Tested April 2026. Vibe and OpenCode hit the same Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via vLLM + LiteLLM. Claude Code uses Anthropic’s cloud (Opus 4.6). Gemini CLI uses Google’s cloud (Gemini 2.5 Pro). All four tools ran the exact same prompts.

Equivalent headless flags across tools:

ActionClaude CodeGemini CLIMistral VibeOpenCode
Headless mode-p "prompt"-p "prompt"-p "prompt"run "prompt"
Auto-approve--dangerously-skip-permissions--approval-mode yolo / -y(auto in -p)--dangerously-skip-permissions
Output format--output-format text|json|stream-json-o text|json|stream-json--output text|json--format json (JSONL)
Model overrideN/A (uses subscription)-m model-nameN/A (uses config)--model provider/model
Resume session-r / --resume-r / --resume--continue--continue / --session
Working directoryinherits CWDinherits CWD--workdir DIRinherits CWD
Budget limit--max-budget-usdN/AN/AN/A

Prompt: “Write a Python function called merge_sorted that merges two sorted lists. Include type hints. Just the function.”

ToolTimeOutputQuality
Claude Code10.2sText + insightCorrect, uses list[int] (modern), added educational note
Gemini CLI8.1sText responseCorrect, uses TypeVar, clean
Vibe7.0sText + wrote fileCorrect, uses TypeVar
OpenCode5.6sText + wrote fileCorrect, uses TypeVar

All four produced correct two-pointer merge implementations. Claude Code added unsolicited educational insights. All self-hosted tools (Vibe, OpenCode) wrote files to disk AND showed the code in text output. Claude Code and Gemini responded with text only (no file created).

Prompt: “Read vibe/install.sh and list the top 3 bugs or improvements with line numbers.”

ToolTimeFindingsQuality
Claude Code18.2s3 with reasoningBest — nullglob, mkdir dry-run, grep pattern
Gemini CLI124.2s1 (after 19 tool calls)Worst — couldn’t find file, SSH’d to server, fetched URL
Vibe4.3s3 with line numbersGood — grep, set -e, metadata
OpenCode3.6s3 with line numbersGood — mkdir, grep error handling

Claude Code’s findings were the most insightful (caught nullglob issue). Gemini spent 2+ minutes trying to find the file — it searched the remote server via SSH instead of reading locally. Vibe and OpenCode were fast and accurate.

Gemini CLI Path Resolution

In this test, Gemini CLI failed to find a local file and attempted to SSH into a remote server to find it. This is a significant issue for local code review tasks. It may relate to CWD handling or the agent’s tendency to use tools aggressively.

Prompt: “Explain the difference between a mutex and a semaphore in exactly 3 bullet points.”

ToolTimeQuality
Claude Code9.8sBest — ownership, counting, purpose distinction
Gemini CLI7.0sGood — locking vs signaling, ownership, use cases
Vibe3.8sGood — binary vs counter, correct
OpenCode3.0sGood — concise, correct

Pure reasoning, no tools. All four correct. Claude Code’s answer was most technically precise (mentioned ownership semantics). Speed inversely correlated with quality — cloud models (Claude, Gemini) were slower but more detailed.

Prompt: “Read DOES_NOT_EXIST.py and summarize it.”

ToolTimeResponse
Claude Code8.1s”File doesn’t exist. Check the path.”
Gemini CLI9.6sAttempted to read, reported not found (verbose)
Vibe2.5s”The file does not exist.”
OpenCode2.0sShows ✗ read failed, reports not found

Self-hosted tools (Vibe, OpenCode) recovered 4x faster than cloud tools — no network round trip.

MetricClaude CloudClaude Self-HostedGemini CLIVibeOpenCode
Cold start0.04s0.04s0.63s0.49s0.85s
Code generation9.0s21.6s12.5s7.4s6.1s
Code review18.2s19.5s124.2s4.3s3.6s
Reasoning8.0s10.0s8.5s3.6s2.6s
Error recovery6.8s8.8s9.7s2.5s2.7s
ModelOpus 4.6Devstral 123BGemini 2.5 ProDevstral 123BDevstral 123B
CostSubscriptionFreeFree (Google)FreeFree
QualityBestGood+GoodGoodGood

Key takeaways:

  • Highest quality: Claude Cloud (Opus) — best reasoning, most detailed
  • Fastest execution: OpenCode and Vibe (2.5-7s on Devstral, no harness overhead)
  • Best value for interactive sessions: Claude Self-Hosted — same tooling as cloud, unlimited tokens, free
  • Best for headless dispatch: Vibe (fastest + --workdir) or OpenCode (LSP + sessions)
  • Setup guide: Claude Code Self-Hosted

Self-Hosted Model Comparison (4-Model Benchmark)

Section titled “Self-Hosted Model Comparison (4-Model Benchmark)”

Tested April 2026 on 8x NVIDIA B200 (183GB each). All models served via vLLM 0.19.0 through LiteLLM gateway. 7 tests covering speed, quality, instruction following, security detection, complex code gen, and convention understanding.

ModelParametersSWE-benchGPUs (FP8)LicensevLLM Parser
Devstral 2 123B123B dense72.2%2Mod. MITmistral
MiniMax M2.5230B/10B MoE80.2%2Mod. MITminimax_m2
Qwen3.6-35B-A3B35B/3B MoE73.4%1Apache 2.0qwen3_coder
Devstral Small 2 24B24B dense68.0%1Mod. MITmistral
TestDevstral 123BMiniMax M2.5Qwen3.6 35BDevstral Small 24B
Code generation1,7913,8302,314742
Code review6,8713,8553,3271,987
Reasoning (3 bullets)2,4972,3481,5261,109
Instruction following7926419464
Complex decorator2,5374,5802,9561,261
Convention quiz14726078257
Average2,3202,5231,850870
TestDevstral 123BMiniMax M2.5Qwen3.6 35BDevstral Small 24B
Correct function + typesPassPass (in thinking)Pass (in thinking)Pass
Found 3 review bugs3/30 (thinking)23/3
3 clean bullet points32 (verbose)3 (in thinking)3
One-word answer”Four”Failed (13 words)Failed (thinking)“Four”
Working retry decoratorPassPass (in thinking)PartialPass
Correct convention answerB (correct)WrongC (wrong)B (correct)
Quality Score6/61/62/66/6

Thinking Trace Problem: MiniMax M2.5 and Qwen3.6 are “reasoning models” that output their thinking process inline (e.g., “Here’s a thinking process: 1. Analyze User Input…”). This pollutes the response for coding dispatch — tools like Vibe and OpenCode receive the thinking traces as the actual answer. vLLM’s --reasoning-parser can strip these, but:

  • Qwen3.6: reasoning parser strips ALL content (returns empty)
  • MiniMax M2.5: minimax_m2 parser untested for this
  • Neither supports enable_thinking=false on vLLM 0.19.0

Convention Understanding (T7): Only the Devstral models correctly identified fix(auth): resolve null pointer in session validation as the correct conventional commit format. MiniMax chose wrong. Qwen3.6 chose C) Fix Auth Bug. This matters for coding agents that need to follow project conventions.

Instruction Following (T4): When asked for “one word,” both Devstral models responded “Four.” MiniMax responded with 13 words including its thinking. Qwen3.6 started with “Here’s a thinking process.” For headless dispatch where the response is parsed programmatically, this is a critical failure.

Use CaseBest ModelWhy
Production coding agentDevstral 123B6/6 quality, clean output, proven tools
Fast-lane dispatchDevstral Small 24B6/6 quality, 870ms avg, same parser
Monitor for futureMiniMax M2.580.2% SWE-bench — needs reasoning parser fix
Monitor for futureQwen3.6 35B73.4% SWE-bench — needs thinking mode disable
Chat (non-coding)Gemma 4 31BBroken tool calling, fine for conversation
GPU 0: Gemma 4 31B (chat only)
GPU 1: Devstral Small 24B (fast lane, 870ms avg)
GPU 2: FREE (MiniMax future / Kimi hot-swap)
GPU 3: FREE (MiniMax future / Kimi hot-swap)
GPU 4: FREE (2nd Devstral 123B replica)
GPU 5: FREE (2nd Devstral 123B replica)
GPU 6: Devstral 123B (primary coding agent)
GPU 7: Devstral 123B (primary coding agent)
Need highest benchmark scores / complex reasoning?
→ Claude Code (Opus/Sonnet, 80.8% SWE-bench)
Need massive context window (2M+ tokens)?
→ Gemini CLI (Gemini 2.5 Pro, free with Google account)
Need self-hosted / data stays on-network?
→ Is it a web-based batch workflow?
→ OpenHands (web UI, sandbox, GitHub integration)
→ Is it a CLI workflow?
→ Do you need LSP diagnostics? (TypeScript, Go, Rust)
→ OpenCode (LSP catches type errors in-loop)
→ Do you need --workdir or scripted dispatch?
→ Vibe (--workdir flag, cleaner output, faster startup)
→ Do you need session continuity?
→ OpenCode (--continue, --session, serve+attach)
→ Simple one-shot tasks?
→ Vibe (lightest, most predictable)
TaskBest ToolWhy
Complex architecture / planningClaude CodeSuperior multi-step reasoning, 80.8% SWE-bench
Security review (judgment)Claude CodeBetter at exploitability assessment
Large codebase analysisGemini CLI2M+ token context window
Multimodal (images, screenshots)Gemini CLINative multimodal support
TypeScript / Go / Rust editingOpenCodeLSP catches type errors in-loop
Long multi-step sessionsOpenCodeContext compaction + session persistence
CI/CD automationOpenCodeJSONL events with token/cost tracking
Quick one-shot generationVibeFastest startup, --workdir, clean text
Parallel batch dispatchVibePredictable, --max-turns safety
Docs / configs / markdownVibeLSP irrelevant, lighter tool
Autonomous issue resolutionOpenHandsFire-and-forget, sandbox, GitHub native
Untrusted / experimental codeOpenHandsDocker sandbox isolation

The most powerful setup combines all four tools:

LayerToolRole
BrainClaude CodePlans, reviews, synthesizes, makes judgment calls
Large ContextGemini CLIAnalyze entire codebases, multimodal reviews
LSP WorkerOpenCodeTypeScript/Go/Rust implementation with type checking
Grunt WorkerVibeBulk generation, research, docs, configs
Batch AgentOpenHandsAutonomous issue resolution, sandbox experiments

See the Claude Code + Vibe Orchestration guide for the dispatch pattern.

Ran “review vibe/install.sh for bugs” 20 times per tool on the same Devstral 123B backend. Only Vibe and OpenCode tested (self-hosted, no per-run cost).

MetricVibe (20 runs)OpenCode (20 runs)
Success rate100%100%
Found 3+ issues100%100%
Avg time4,899ms3,964ms
Min time4,368ms3,589ms
Max time6,061ms4,402ms
Time variance1,693ms range813ms range

Both tools are 100% reliable over 20 runs — zero failures, zero hallucinations. OpenCode is 19% faster with tighter variance (more predictable for CI/CD).

Created a TypeScript file with 3 deliberate type errors. Asked each tool to find and fix them.

MetricClaude CodeGemini CLIVibeOpenCode
Time33.9s21.4s8.8sTIMEOUT
Bugs fixed3/33/33/30/3
File modifiedYesYesYesNo
ApproachMinimal diffCreative (throw Error, better IDs)Clean fixFailed

Claude Code produced the cleanest fix (minimal changes). Gemini CLI was most creative — used .toString(36).substring(2) for better ID generation and threw Error for missing users. Vibe was fastest at 8.8s with all 3 bugs correctly fixed.

ToolCurrentReleases/MonthBreaking Changes (2026)Automation Risk
Claude Code2.1.98~8-12LowLow
Vibe2.7.6~2-4LowLow
Gemini CLI0.38.2~4-6MediumMedium
OpenCode1.14.2060+ (multi/day)High (-p deprecated)High

For automated pipelines: pin OpenCode versions (breaks frequently). Claude Code and Vibe have the most stable headless APIs.

All three open-source tools can point at the same vLLM backend:

Terminal window
vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--tensor-parallel-size 2 \
--quantization fp8

Vibe (~/.vibe/config.toml):

[[providers]]
name = "local"
api_base = "http://localhost:8000/v1"
api_key_env_var = "VLLM_KEY"
api_style = "openai"
backend = "generic"
[[models]]
name = "devstral-123b"
provider = "local"
alias = "devstral"
temperature = 0.2

OpenCode (opencode.json):

{
"provider": {
"local": {
"npm": "@ai-sdk/openai-compatible",
"options": { "baseURL": "http://localhost:8000/v1" },
"models": { "devstral-123b": { "name": "Devstral 123B" } }
}
},
"model": "local/devstral-123b"
}

OpenHands (Settings > LLM > Advanced):

Custom Model: openai/devstral
Base URL: http://host.docker.internal:8000/v1
SetupMonthly CostTokens/Month
Claude Code Pro$20Shared with web (limited)
Claude Code Max$100-200Higher limits (unpublished)
Claude Code APIPay-per-token$3-15/M tokens (model dependent)
Self-hosted Devstral (Vibe/OpenCode/OpenHands)$0 (electricity only)Unlimited
Self-hosted (cloud GPU)$2-4/hr (H100)Unlimited while running

All tests were run on 2026-04-21. Scripts and raw results are published so you can replicate or extend them.

ComponentVersionDetails
HardwareDell XE9780 (Obelisk)8x NVIDIA B200, 2TB RAM
GPU allocationGPUs 6-7Devstral 123B (FP8, tensor-parallel=2)
vLLMv0.19.0--tool-call-parser mistral --enable-auto-tool-choice
LiteLLM1.82.6Gateway on port 4000, drop_params: true
Claude Code2.1.98Anthropic cloud (Opus 4.6)
Gemini CLI0.38.2Google cloud (Gemini 2.5 Pro)
Mistral Vibe2.7.6Self-hosted Devstral via LiteLLM
OpenCode1.14.20Self-hosted Devstral via LiteLLM
macOSDarwin 25.3.0Apple Silicon (M-series)

Test 1: Code Generation (Single Run, 4-Way)

Section titled “Test 1: Code Generation (Single Run, 4-Way)”
Terminal window
PROMPT="Write a Python function called merge_sorted that merges two sorted lists into one sorted list. Include type hints. Just the function, no explanation."
# Claude Code
time claude -p "$PROMPT" --output-format text
# Gemini CLI
time gemini -p "$PROMPT" --approval-mode yolo -o text
# Mistral Vibe (self-hosted Devstral)
time vibe -p "$PROMPT" --workdir /tmp --max-turns 5 --output text
# OpenCode (self-hosted Devstral)
time opencode run "$PROMPT" --model litellm/devstral-123b --dangerously-skip-permissions
Terminal window
PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."
WORKDIR="/path/to/your/project"
# Claude Code
time claude -p "$PROMPT" --output-format text
# Gemini CLI
time gemini -p "$PROMPT" --approval-mode yolo -o text
# Mistral Vibe
time vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \
--enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash"
# OpenCode (must cd to project first — no --workdir flag)
cd "$WORKDIR" && time opencode run "$PROMPT" \
--model litellm/devstral-123b --dangerously-skip-permissions
Terminal window
PROMPT="Explain the difference between a mutex and a semaphore in exactly 3 bullet points. No code."
# Same 4 commands as Test 1 (just change the prompt)
Terminal window
PROMPT="Read the file DOES_NOT_EXIST.py and summarize it."
# Same 4 commands as Test 2 (just change the prompt)
Terminal window
time claude --version
time gemini --version
time vibe --version
time opencode --version

Reliability Test (20 Runs, Vibe + OpenCode Only)

Section titled “Reliability Test (20 Runs, Vibe + OpenCode Only)”
Terminal window
PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."
WORKDIR="/path/to/your/project"
# Run 20 times each, capture output to files
for i in $(seq 1 20); do
# Vibe
vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \
--enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash" \
> "results/vibe/run-$i.txt" 2>/dev/null
# OpenCode
(cd "$WORKDIR" && opencode run "$PROMPT" \
--model litellm/devstral-123b --dangerously-skip-permissions \
2>/dev/null) | cat > "results/opencode/run-$i.txt"
done
# Analyze: count runs that found the file and produced 3+ findings
for f in results/vibe/run-*.txt; do
grep -qiE "install.sh|line [0-9]" "$f" && echo "PASS" || echo "FAIL"
done

Create a broken TypeScript file with these 3 deliberate errors:

// src/auth.ts — 3 bugs to fix
interface User {
id: string;
name: string;
email: string;
role: "admin" | "user";
}
// Bug 1: find() returns User | undefined, but return type says User
function getUser(users: User[], id: string): User {
const user = users.find(u => u.id === id);
return user;
}
// Bug 2: 'isAdmin' doesn't exist on User type
function isAdmin(user: User): boolean {
return user.isAdmin === true;
}
// Bug 3: Math.random() returns number, not string
function createUser(name: string, email: string): User {
return {
id: Math.random(),
name,
email,
role: "user",
};
}
export { getUser, isAdmin, createUser };

Then run each tool with the same prompt:

Terminal window
PROMPT="The file src/auth.ts has TypeScript errors. Read it, identify ALL type errors, and fix them. Write the corrected file."
# Claude Code (use --add-dir to give access to the temp project)
claude -p "$PROMPT" --output-format text --add-dir /path/to/test-project
# Gemini CLI
cd /path/to/test-project && gemini -p "$PROMPT" --approval-mode yolo -o text
# Vibe
vibe -p "$PROMPT" --workdir /path/to/test-project --max-turns 15 --output text
# OpenCode
cd /path/to/test-project && opencode run "$PROMPT" \
--model litellm/devstral-123b --dangerously-skip-permissions

Score each tool on: time to complete, number of bugs found (out of 3), whether it modified the file, quality of the fix.

CategoryHow We Scored
TimeWall clock from date +%s%N before and after, includes startup + API + response
QualityDid it produce correct, runnable code? Did it find the right bugs?
ReliabilityOver 20 runs: did it read the file, find 3+ issues, no hallucinated findings?
RecoveryOut of 3 planted bugs: how many correctly identified and fixed?
File modificationDid the tool actually write the fix to disk, or just show text?
  • Different model: Change litellm/devstral-123b to your model name. Adjust Vibe’s config.toml and OpenCode’s opencode.json accordingly.
  • Different backend: Replace http://localhost:4000/v1 with your vLLM/LiteLLM/Ollama endpoint.
  • No self-hosted GPU: Skip Vibe and OpenCode tests, or point them at a cloud provider (OpenRouter, Together AI, etc.).
  • Different repo: Replace vibe/install.sh with any file in your project. The reliability test works with any “read file and review it” prompt.

Full test scripts and raw output files are available at:

git.irregularchat.com/irregulars/ai-coding-env — vibe/benchmarks/

vibe/benchmarks/
├── run-scale-test.sh # Real PR review (26 files, 5400+ lines)
├── run-reliability-test.sh # 20 runs per tool
├── run-cost-test.sh # Token measurement (Claude solo vs orchestrator)
├── run-recovery-test.sh # TypeScript bug fix (4-way)
├── RESULTS.md # Compiled analysis
└── results/
├── reliability/
│ ├── vibe/run-{1..20}.txt
│ └── opencode/run-{1..20}.txt
└── recovery/
├── vibe-fixed.ts
├── claude-fixed.ts
├── gemini-fixed.ts
└── *-recovery.txt