CLI Coding Agent Comparison
CLI Coding Agent Comparison
Section titled “CLI Coding Agent Comparison”A practical comparison of the leading AI coding tools with real benchmark timing data — including Claude Code on both Anthropic cloud and self-hosted Devstral. Each tool has a dedicated guide linked below.
Tool Guides
Section titled “Tool Guides”Each tool has a dedicated wiki page with full setup instructions, configuration, and usage patterns:
| Tool | Guide | What It Covers |
|---|---|---|
| Claude Code | Full Guide | Plugins, skills, team spawners, CLAUDE.md, tech stacks, funding |
| Claude Code Self-Hosted | Setup Guide | Use Claude Code with Devstral/Ollama/vLLM — switcher script, shell aliases, env vars |
| Mistral Vibe | Full Guide | Install, config, MCP, skills, AGENTS.md, self-hosted vLLM, Open-WebUI, orchestration |
| OpenCode | Full Guide | LSP integration, client/server, headless JSON events, MCP OAuth, custom agents |
| OpenHands | Full Guide | Web UI, Docker sandbox, Devstral setup, autonomous batch mode, GitHub integration |
| Gemini CLI | Full Guide | Subagents, Jules extension, multi-model orchestration, GEMINI.md |
Supporting Pages
Section titled “Supporting Pages”| Page | What It Covers |
|---|---|
| Agent Pricing | Plan pricing, token limits, self-hosted cost comparison |
| Claude Code Funding | Military/government procurement, billing, subscriptions |
| Project Rules | CLAUDE.md / AGENTS.md patterns and templates |
| Lessons Learned | Rules and incident reports by stack type |
Community Resources
Section titled “Community Resources”| Resource | Link |
|---|---|
| Vibe Coding Repository | git.irregularchat.com/public-repos/vibe-coding — Rules, skills, configs, orchestrator, switcher script |
| IrregularChat Wiki | irregularpedia.org — Full knowledge base |
Feature Matrix
Section titled “Feature Matrix”| Feature | Claude Code | Gemini CLI | Mistral Vibe | OpenCode | OpenHands |
|---|---|---|---|---|---|
| License | Proprietary | Apache 2.0 | Apache 2.0 | MIT | MIT |
| Interface | Terminal CLI | Terminal CLI | Terminal CLI | Terminal TUI | Web UI |
| Model support | Claude only | Gemini only | Any OpenAI-compat | Any (ai-sdk) | Any |
| Self-hosted models | No | No | Yes | Yes | Yes |
| LSP integration | No | No | No | Yes (20+ langs) | No |
| MCP servers | Yes | Yes | Yes (stdio+HTTP) | Yes (stdio+SSE+OAuth) | Yes |
| Client/server mode | No | No | No | Yes (serve+attach) | Yes (web server) |
| Headless mode | -p (print) | -p (prompt) | -p (headless) | run | CLI binary |
| Auto-approve flag | --dangerously-skip-permissions | --approval-mode yolo | (auto in -p) | --dangerously-skip-permissions | |
| JSON output | --output-format stream-json | -o stream-json | --output json|streaming | --format json (JSONL) | — |
| Cost tracking | --max-budget-usd | N/A | --max-price + input_price/output_price in config | JSONL step_finish events include cost/tokens | — |
--workdir flag | inherits CWD | inherits CWD | --workdir DIR | inherits CWD | N/A |
| Session continuity | --continue / --resume | --resume | --continue | --continue, --session | Per-session |
| Context window | 1M tokens | 2M+ tokens | Model-dependent | Model-dependent | Model-dependent |
| Context compaction | Sophisticated | Automatic | auto_compact_threshold | Auto at 75% | Memory condensation |
| Custom instructions | CLAUDE.md (hierarchical) | GEMINI.md / AGENTS.md | AGENTS.md (root only) | AGENTS.md + CLAUDE.md | Settings UI |
| Skills/agents | SKILL.md + team spawners | Extensions + subagents | SKILL.md + TOML agents | opencode.json agents | — |
| Sandbox isolation | No | --sandbox option | No | No | Yes (Docker) |
| GitHub/GitLab | Via gh CLI | Via extensions | Manual | Manual | Native (issues → PRs) |
| SWE-bench Verified | 80.8% (Opus 4.6) | ~70% (Gemini 2.5 Pro) | 72.2% (Devstral 2) | Model-dependent | 46.8-61.7% |
| Cost | $20-200/mo | Free (with Google account) | Free (self-host) | Free (self-host) | Free (self-host) |
| GitHub stars | N/A | ~55k | ~5k | ~147k | ~65k |
Benchmark Results
Section titled “Benchmark Results”Tested April 2026. Vibe and OpenCode hit the same Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via vLLM + LiteLLM. Claude Code uses Anthropic’s cloud (Opus 4.6). Gemini CLI uses Google’s cloud (Gemini 2.5 Pro). All four tools ran the exact same prompts.
Equivalent headless flags across tools:
| Action | Claude Code | Gemini CLI | Mistral Vibe | OpenCode |
|---|---|---|---|---|
| Headless mode | -p "prompt" | -p "prompt" | -p "prompt" | run "prompt" |
| Auto-approve | --dangerously-skip-permissions | --approval-mode yolo / -y | (auto in -p) | --dangerously-skip-permissions |
| Output format | --output-format text|json|stream-json | -o text|json|stream-json | --output text|json | --format json (JSONL) |
| Model override | N/A (uses subscription) | -m model-name | N/A (uses config) | --model provider/model |
| Resume session | -r / --resume | -r / --resume | --continue | --continue / --session |
| Working directory | inherits CWD | inherits CWD | --workdir DIR | inherits CWD |
| Budget limit | --max-budget-usd | N/A | N/A | N/A |
Test 1: Code Generation
Section titled “Test 1: Code Generation”Prompt: “Write a Python function called merge_sorted that merges two sorted lists. Include type hints. Just the function.”
| Tool | Time | Output | Quality |
|---|---|---|---|
| Claude Code | 10.2s | Text + insight | Correct, uses list[int] (modern), added educational note |
| Gemini CLI | 8.1s | Text response | Correct, uses TypeVar, clean |
| Vibe | 7.0s | Text + wrote file | Correct, uses TypeVar |
| OpenCode | 5.6s | Text + wrote file | Correct, uses TypeVar |
All four produced correct two-pointer merge implementations. Claude Code added unsolicited educational insights. All self-hosted tools (Vibe, OpenCode) wrote files to disk AND showed the code in text output. Claude Code and Gemini responded with text only (no file created).
Test 2: Code Review
Section titled “Test 2: Code Review”Prompt: “Read vibe/install.sh and list the top 3 bugs or improvements with line numbers.”
| Tool | Time | Findings | Quality |
|---|---|---|---|
| Claude Code | 18.2s | 3 with reasoning | Best — nullglob, mkdir dry-run, grep pattern |
| Gemini CLI | 124.2s | 1 (after 19 tool calls) | Worst — couldn’t find file, SSH’d to server, fetched URL |
| Vibe | 4.3s | 3 with line numbers | Good — grep, set -e, metadata |
| OpenCode | 3.6s | 3 with line numbers | Good — mkdir, grep error handling |
Claude Code’s findings were the most insightful (caught nullglob issue). Gemini spent 2+ minutes trying to find the file — it searched the remote server via SSH instead of reading locally. Vibe and OpenCode were fast and accurate.
Gemini CLI Path Resolution
In this test, Gemini CLI failed to find a local file and attempted to SSH into a remote server to find it. This is a significant issue for local code review tasks. It may relate to CWD handling or the agent’s tendency to use tools aggressively.
Test 3: Reasoning (No Tools)
Section titled “Test 3: Reasoning (No Tools)”Prompt: “Explain the difference between a mutex and a semaphore in exactly 3 bullet points.”
| Tool | Time | Quality |
|---|---|---|
| Claude Code | 9.8s | Best — ownership, counting, purpose distinction |
| Gemini CLI | 7.0s | Good — locking vs signaling, ownership, use cases |
| Vibe | 3.8s | Good — binary vs counter, correct |
| OpenCode | 3.0s | Good — concise, correct |
Pure reasoning, no tools. All four correct. Claude Code’s answer was most technically precise (mentioned ownership semantics). Speed inversely correlated with quality — cloud models (Claude, Gemini) were slower but more detailed.
Test 4: Error Handling
Section titled “Test 4: Error Handling”Prompt: “Read DOES_NOT_EXIST.py and summarize it.”
| Tool | Time | Response |
|---|---|---|
| Claude Code | 8.1s | ”File doesn’t exist. Check the path.” |
| Gemini CLI | 9.6s | Attempted to read, reported not found (verbose) |
| Vibe | 2.5s | ”The file does not exist.” |
| OpenCode | 2.0s | Shows ✗ read failed, reports not found |
Self-hosted tools (Vibe, OpenCode) recovered 4x faster than cloud tools — no network round trip.
Performance Summary (5-Way)
Section titled “Performance Summary (5-Way)”| Metric | Claude Cloud | Claude Self-Hosted | Gemini CLI | Vibe | OpenCode |
|---|---|---|---|---|---|
| Cold start | 0.04s | 0.04s | 0.63s | 0.49s | 0.85s |
| Code generation | 9.0s | 21.6s | 12.5s | 7.4s | 6.1s |
| Code review | 18.2s | 19.5s | 124.2s | 4.3s | 3.6s |
| Reasoning | 8.0s | 10.0s | 8.5s | 3.6s | 2.6s |
| Error recovery | 6.8s | 8.8s | 9.7s | 2.5s | 2.7s |
| Model | Opus 4.6 | Devstral 123B | Gemini 2.5 Pro | Devstral 123B | Devstral 123B |
| Cost | Subscription | Free | Free (Google) | Free | Free |
| Quality | Best | Good+ | Good | Good | Good |
Key takeaways:
- Highest quality: Claude Cloud (Opus) — best reasoning, most detailed
- Fastest execution: OpenCode and Vibe (2.5-7s on Devstral, no harness overhead)
- Best value for interactive sessions: Claude Self-Hosted — same tooling as cloud, unlimited tokens, free
- Best for headless dispatch: Vibe (fastest +
--workdir) or OpenCode (LSP + sessions) - Setup guide: Claude Code Self-Hosted
Self-Hosted Model Comparison (4-Model Benchmark)
Section titled “Self-Hosted Model Comparison (4-Model Benchmark)”Tested April 2026 on 8x NVIDIA B200 (183GB each). All models served via vLLM 0.19.0 through LiteLLM gateway. 7 tests covering speed, quality, instruction following, security detection, complex code gen, and convention understanding.
Models Tested
Section titled “Models Tested”| Model | Parameters | SWE-bench | GPUs (FP8) | License | vLLM Parser |
|---|---|---|---|---|---|
| Devstral 2 123B | 123B dense | 72.2% | 2 | Mod. MIT | mistral |
| MiniMax M2.5 | 230B/10B MoE | 80.2% | 2 | Mod. MIT | minimax_m2 |
| Qwen3.6-35B-A3B | 35B/3B MoE | 73.4% | 1 | Apache 2.0 | qwen3_coder |
| Devstral Small 2 24B | 24B dense | 68.0% | 1 | Mod. MIT | mistral |
Speed Results (ms, lower is better)
Section titled “Speed Results (ms, lower is better)”| Test | Devstral 123B | MiniMax M2.5 | Qwen3.6 35B | Devstral Small 24B |
|---|---|---|---|---|
| Code generation | 1,791 | 3,830 | 2,314 | 742 |
| Code review | 6,871 | 3,855 | 3,327 | 1,987 |
| Reasoning (3 bullets) | 2,497 | 2,348 | 1,526 | 1,109 |
| Instruction following | 79 | 264 | 194 | 64 |
| Complex decorator | 2,537 | 4,580 | 2,956 | 1,261 |
| Convention quiz | 147 | 260 | 782 | 57 |
| Average | 2,320 | 2,523 | 1,850 | 870 |
Quality Results
Section titled “Quality Results”| Test | Devstral 123B | MiniMax M2.5 | Qwen3.6 35B | Devstral Small 24B |
|---|---|---|---|---|
| Correct function + types | Pass | Pass (in thinking) | Pass (in thinking) | Pass |
| Found 3 review bugs | 3/3 | 0 (thinking) | 2 | 3/3 |
| 3 clean bullet points | 3 | 2 (verbose) | 3 (in thinking) | 3 |
| One-word answer | ”Four” | Failed (13 words) | Failed (thinking) | “Four” |
| Working retry decorator | Pass | Pass (in thinking) | Partial | Pass |
| Correct convention answer | B (correct) | Wrong | C (wrong) | B (correct) |
| Quality Score | 6/6 | 1/6 | 2/6 | 6/6 |
Detailed Findings
Section titled “Detailed Findings”Thinking Trace Problem: MiniMax M2.5 and Qwen3.6 are “reasoning models” that output their thinking process inline (e.g., “Here’s a thinking process: 1. Analyze User Input…”). This pollutes the response for coding dispatch — tools like Vibe and OpenCode receive the thinking traces as the actual answer. vLLM’s --reasoning-parser can strip these, but:
- Qwen3.6: reasoning parser strips ALL content (returns empty)
- MiniMax M2.5:
minimax_m2parser untested for this - Neither supports
enable_thinking=falseon vLLM 0.19.0
Convention Understanding (T7): Only the Devstral models correctly identified fix(auth): resolve null pointer in session validation as the correct conventional commit format. MiniMax chose wrong. Qwen3.6 chose C) Fix Auth Bug. This matters for coding agents that need to follow project conventions.
Instruction Following (T4): When asked for “one word,” both Devstral models responded “Four.” MiniMax responded with 13 words including its thinking. Qwen3.6 started with “Here’s a thinking process.” For headless dispatch where the response is parsed programmatically, this is a critical failure.
Recommendation
Section titled “Recommendation”| Use Case | Best Model | Why |
|---|---|---|
| Production coding agent | Devstral 123B | 6/6 quality, clean output, proven tools |
| Fast-lane dispatch | Devstral Small 24B | 6/6 quality, 870ms avg, same parser |
| Monitor for future | MiniMax M2.5 | 80.2% SWE-bench — needs reasoning parser fix |
| Monitor for future | Qwen3.6 35B | 73.4% SWE-bench — needs thinking mode disable |
| Chat (non-coding) | Gemma 4 31B | Broken tool calling, fine for conversation |
GPU Layout (Recommended)
Section titled “GPU Layout (Recommended)”GPU 0: Gemma 4 31B (chat only)GPU 1: Devstral Small 24B (fast lane, 870ms avg)GPU 2: FREE (MiniMax future / Kimi hot-swap)GPU 3: FREE (MiniMax future / Kimi hot-swap)GPU 4: FREE (2nd Devstral 123B replica)GPU 5: FREE (2nd Devstral 123B replica)GPU 6: Devstral 123B (primary coding agent)GPU 7: Devstral 123B (primary coding agent)When to Use What
Section titled “When to Use What”Decision Tree
Section titled “Decision Tree”Need highest benchmark scores / complex reasoning? → Claude Code (Opus/Sonnet, 80.8% SWE-bench)
Need massive context window (2M+ tokens)? → Gemini CLI (Gemini 2.5 Pro, free with Google account)
Need self-hosted / data stays on-network? → Is it a web-based batch workflow? → OpenHands (web UI, sandbox, GitHub integration) → Is it a CLI workflow? → Do you need LSP diagnostics? (TypeScript, Go, Rust) → OpenCode (LSP catches type errors in-loop) → Do you need --workdir or scripted dispatch? → Vibe (--workdir flag, cleaner output, faster startup) → Do you need session continuity? → OpenCode (--continue, --session, serve+attach) → Simple one-shot tasks? → Vibe (lightest, most predictable)By Task Type
Section titled “By Task Type”| Task | Best Tool | Why |
|---|---|---|
| Complex architecture / planning | Claude Code | Superior multi-step reasoning, 80.8% SWE-bench |
| Security review (judgment) | Claude Code | Better at exploitability assessment |
| Large codebase analysis | Gemini CLI | 2M+ token context window |
| Multimodal (images, screenshots) | Gemini CLI | Native multimodal support |
| TypeScript / Go / Rust editing | OpenCode | LSP catches type errors in-loop |
| Long multi-step sessions | OpenCode | Context compaction + session persistence |
| CI/CD automation | OpenCode | JSONL events with token/cost tracking |
| Quick one-shot generation | Vibe | Fastest startup, --workdir, clean text |
| Parallel batch dispatch | Vibe | Predictable, --max-turns safety |
| Docs / configs / markdown | Vibe | LSP irrelevant, lighter tool |
| Autonomous issue resolution | OpenHands | Fire-and-forget, sandbox, GitHub native |
| Untrusted / experimental code | OpenHands | Docker sandbox isolation |
The Orchestration Stack
Section titled “The Orchestration Stack”The most powerful setup combines all four tools:
| Layer | Tool | Role |
|---|---|---|
| Brain | Claude Code | Plans, reviews, synthesizes, makes judgment calls |
| Large Context | Gemini CLI | Analyze entire codebases, multimodal reviews |
| LSP Worker | OpenCode | TypeScript/Go/Rust implementation with type checking |
| Grunt Worker | Vibe | Bulk generation, research, docs, configs |
| Batch Agent | OpenHands | Autonomous issue resolution, sandbox experiments |
See the Claude Code + Vibe Orchestration guide for the dispatch pattern.
Deep Evaluation Results
Section titled “Deep Evaluation Results”Reliability (20 Runs, Same Task)
Section titled “Reliability (20 Runs, Same Task)”Ran “review vibe/install.sh for bugs” 20 times per tool on the same Devstral 123B backend. Only Vibe and OpenCode tested (self-hosted, no per-run cost).
| Metric | Vibe (20 runs) | OpenCode (20 runs) |
|---|---|---|
| Success rate | 100% | 100% |
| Found 3+ issues | 100% | 100% |
| Avg time | 4,899ms | 3,964ms |
| Min time | 4,368ms | 3,589ms |
| Max time | 6,061ms | 4,402ms |
| Time variance | 1,693ms range | 813ms range |
Both tools are 100% reliable over 20 runs — zero failures, zero hallucinations. OpenCode is 19% faster with tighter variance (more predictable for CI/CD).
Failure Recovery (Fix Broken TypeScript)
Section titled “Failure Recovery (Fix Broken TypeScript)”Created a TypeScript file with 3 deliberate type errors. Asked each tool to find and fix them.
| Metric | Claude Code | Gemini CLI | Vibe | OpenCode |
|---|---|---|---|---|
| Time | 33.9s | 21.4s | 8.8s | TIMEOUT |
| Bugs fixed | 3/3 | 3/3 | 3/3 | 0/3 |
| File modified | Yes | Yes | Yes | No |
| Approach | Minimal diff | Creative (throw Error, better IDs) | Clean fix | Failed |
Claude Code produced the cleanest fix (minimal changes). Gemini CLI was most creative — used .toString(36).substring(2) for better ID generation and threw Error for missing users. Vibe was fastest at 8.8s with all 3 bugs correctly fixed.
Stability & Maintenance
Section titled “Stability & Maintenance”| Tool | Current | Releases/Month | Breaking Changes (2026) | Automation Risk |
|---|---|---|---|---|
| Claude Code | 2.1.98 | ~8-12 | Low | Low |
| Vibe | 2.7.6 | ~2-4 | Low | Low |
| Gemini CLI | 0.38.2 | ~4-6 | Medium | Medium |
| OpenCode | 1.14.20 | 60+ (multi/day) | High (-p deprecated) | High |
For automated pipelines: pin OpenCode versions (breaks frequently). Claude Code and Vibe have the most stable headless APIs.
Self-Hosted Comparison
Section titled “Self-Hosted Comparison”All three open-source tools can point at the same vLLM backend:
Minimum vLLM Launch Command
Section titled “Minimum vLLM Launch Command”vllm serve mistralai/Devstral-2-123B-Instruct-2512 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 2 \ --quantization fp8Configuration Side-by-Side
Section titled “Configuration Side-by-Side”Vibe (~/.vibe/config.toml):
[[providers]]name = "local"api_base = "http://localhost:8000/v1"api_key_env_var = "VLLM_KEY"api_style = "openai"backend = "generic"
[[models]]name = "devstral-123b"provider = "local"alias = "devstral"temperature = 0.2OpenCode (opencode.json):
{ "provider": { "local": { "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://localhost:8000/v1" }, "models": { "devstral-123b": { "name": "Devstral 123B" } } } }, "model": "local/devstral-123b"}OpenHands (Settings > LLM > Advanced):
Custom Model: openai/devstralBase URL: http://host.docker.internal:8000/v1Cost: Self-Hosted vs Cloud
Section titled “Cost: Self-Hosted vs Cloud”| Setup | Monthly Cost | Tokens/Month |
|---|---|---|
| Claude Code Pro | $20 | Shared with web (limited) |
| Claude Code Max | $100-200 | Higher limits (unpublished) |
| Claude Code API | Pay-per-token | $3-15/M tokens (model dependent) |
| Self-hosted Devstral (Vibe/OpenCode/OpenHands) | $0 (electricity only) | Unlimited |
| Self-hosted (cloud GPU) | $2-4/hr (H100) | Unlimited while running |
Reproduce These Benchmarks
Section titled “Reproduce These Benchmarks”All tests were run on 2026-04-21. Scripts and raw results are published so you can replicate or extend them.
Test Environment
Section titled “Test Environment”| Component | Version | Details |
|---|---|---|
| Hardware | Dell XE9780 (Obelisk) | 8x NVIDIA B200, 2TB RAM |
| GPU allocation | GPUs 6-7 | Devstral 123B (FP8, tensor-parallel=2) |
| vLLM | v0.19.0 | --tool-call-parser mistral --enable-auto-tool-choice |
| LiteLLM | 1.82.6 | Gateway on port 4000, drop_params: true |
| Claude Code | 2.1.98 | Anthropic cloud (Opus 4.6) |
| Gemini CLI | 0.38.2 | Google cloud (Gemini 2.5 Pro) |
| Mistral Vibe | 2.7.6 | Self-hosted Devstral via LiteLLM |
| OpenCode | 1.14.20 | Self-hosted Devstral via LiteLLM |
| macOS | Darwin 25.3.0 | Apple Silicon (M-series) |
Exact Commands Used
Section titled “Exact Commands Used”Test 1: Code Generation (Single Run, 4-Way)
Section titled “Test 1: Code Generation (Single Run, 4-Way)”PROMPT="Write a Python function called merge_sorted that merges two sorted lists into one sorted list. Include type hints. Just the function, no explanation."
# Claude Codetime claude -p "$PROMPT" --output-format text
# Gemini CLItime gemini -p "$PROMPT" --approval-mode yolo -o text
# Mistral Vibe (self-hosted Devstral)time vibe -p "$PROMPT" --workdir /tmp --max-turns 5 --output text
# OpenCode (self-hosted Devstral)time opencode run "$PROMPT" --model litellm/devstral-123b --dangerously-skip-permissionsTest 2: Code Review (Single Run, 4-Way)
Section titled “Test 2: Code Review (Single Run, 4-Way)”PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."WORKDIR="/path/to/your/project"
# Claude Codetime claude -p "$PROMPT" --output-format text
# Gemini CLItime gemini -p "$PROMPT" --approval-mode yolo -o text
# Mistral Vibetime vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \ --enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash"
# OpenCode (must cd to project first — no --workdir flag)cd "$WORKDIR" && time opencode run "$PROMPT" \ --model litellm/devstral-123b --dangerously-skip-permissionsTest 3: Reasoning (No Tools, 4-Way)
Section titled “Test 3: Reasoning (No Tools, 4-Way)”PROMPT="Explain the difference between a mutex and a semaphore in exactly 3 bullet points. No code."
# Same 4 commands as Test 1 (just change the prompt)Test 4: Error Handling (4-Way)
Section titled “Test 4: Error Handling (4-Way)”PROMPT="Read the file DOES_NOT_EXIST.py and summarize it."
# Same 4 commands as Test 2 (just change the prompt)Test 5: Cold Start
Section titled “Test 5: Cold Start”time claude --versiontime gemini --versiontime vibe --versiontime opencode --versionReliability Test (20 Runs, Vibe + OpenCode Only)
Section titled “Reliability Test (20 Runs, Vibe + OpenCode Only)”PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."WORKDIR="/path/to/your/project"
# Run 20 times each, capture output to filesfor i in $(seq 1 20); do # Vibe vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \ --enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash" \ > "results/vibe/run-$i.txt" 2>/dev/null
# OpenCode (cd "$WORKDIR" && opencode run "$PROMPT" \ --model litellm/devstral-123b --dangerously-skip-permissions \ 2>/dev/null) | cat > "results/opencode/run-$i.txt"done
# Analyze: count runs that found the file and produced 3+ findingsfor f in results/vibe/run-*.txt; do grep -qiE "install.sh|line [0-9]" "$f" && echo "PASS" || echo "FAIL"doneRecovery Test (TypeScript Bug Fix, 4-Way)
Section titled “Recovery Test (TypeScript Bug Fix, 4-Way)”Create a broken TypeScript file with these 3 deliberate errors:
// src/auth.ts — 3 bugs to fixinterface User { id: string; name: string; email: string; role: "admin" | "user";}
// Bug 1: find() returns User | undefined, but return type says Userfunction getUser(users: User[], id: string): User { const user = users.find(u => u.id === id); return user;}
// Bug 2: 'isAdmin' doesn't exist on User typefunction isAdmin(user: User): boolean { return user.isAdmin === true;}
// Bug 3: Math.random() returns number, not stringfunction createUser(name: string, email: string): User { return { id: Math.random(), name, email, role: "user", };}
export { getUser, isAdmin, createUser };Then run each tool with the same prompt:
PROMPT="The file src/auth.ts has TypeScript errors. Read it, identify ALL type errors, and fix them. Write the corrected file."
# Claude Code (use --add-dir to give access to the temp project)claude -p "$PROMPT" --output-format text --add-dir /path/to/test-project
# Gemini CLIcd /path/to/test-project && gemini -p "$PROMPT" --approval-mode yolo -o text
# Vibevibe -p "$PROMPT" --workdir /path/to/test-project --max-turns 15 --output text
# OpenCodecd /path/to/test-project && opencode run "$PROMPT" \ --model litellm/devstral-123b --dangerously-skip-permissionsScore each tool on: time to complete, number of bugs found (out of 3), whether it modified the file, quality of the fix.
Scoring Criteria
Section titled “Scoring Criteria”| Category | How We Scored |
|---|---|
| Time | Wall clock from date +%s%N before and after, includes startup + API + response |
| Quality | Did it produce correct, runnable code? Did it find the right bugs? |
| Reliability | Over 20 runs: did it read the file, find 3+ issues, no hallucinated findings? |
| Recovery | Out of 3 planted bugs: how many correctly identified and fixed? |
| File modification | Did the tool actually write the fix to disk, or just show text? |
Adapting for Your Environment
Section titled “Adapting for Your Environment”- Different model: Change
litellm/devstral-123bto your model name. Adjust Vibe’sconfig.tomland OpenCode’sopencode.jsonaccordingly. - Different backend: Replace
http://localhost:4000/v1with your vLLM/LiteLLM/Ollama endpoint. - No self-hosted GPU: Skip Vibe and OpenCode tests, or point them at a cloud provider (OpenRouter, Together AI, etc.).
- Different repo: Replace
vibe/install.shwith any file in your project. The reliability test works with any “read file and review it” prompt.
Raw Results
Section titled “Raw Results”Full test scripts and raw output files are available at:
git.irregularchat.com/irregulars/ai-coding-env — vibe/benchmarks/
vibe/benchmarks/├── run-scale-test.sh # Real PR review (26 files, 5400+ lines)├── run-reliability-test.sh # 20 runs per tool├── run-cost-test.sh # Token measurement (Claude solo vs orchestrator)├── run-recovery-test.sh # TypeScript bug fix (4-way)├── RESULTS.md # Compiled analysis└── results/ ├── reliability/ │ ├── vibe/run-{1..20}.txt │ └── opencode/run-{1..20}.txt └── recovery/ ├── vibe-fixed.ts ├── claude-fixed.ts ├── gemini-fixed.ts └── *-recovery.txtRelated Resources
Section titled “Related Resources”Tool Guides
Section titled “Tool Guides”- Claude Code - Full guide: plugins, skills, team spawners, CLAUDE.md
- Claude Code Self-Hosted - Use Claude Code with Devstral, Ollama, vLLM (free)
- Claude Code Funding - Pricing, military procurement, billing
- Mistral Vibe - Open-source CLI with self-hosted Devstral + orchestration
- OpenCode - Open-source TUI with LSP integration + client/server
- OpenHands - Web-based autonomous agent with Docker sandbox
- Gemini Code - Google’s Gemini CLI with 2M+ context
Reference
Section titled “Reference”- AI Agent Pricing - Plan pricing, token limits, cost comparison
- Project Rules & Lessons Learned - CLAUDE.md / AGENTS.md patterns
Community
Section titled “Community”- Vibe Coding Repository - Rules, skills, configs, orchestrator skill, backend switcher