CLI Coding Agent Comparison

A practical comparison of the leading AI coding tools with real benchmark timing data — including Claude Code on both Anthropic cloud and self-hosted Devstral. Each tool has a dedicated guide linked below.

Tool Guides

Each tool has a dedicated wiki page with full setup instructions, configuration, and usage patterns:

Tool	Guide	What It Covers
Claude Code	Full Guide	Plugins, skills, team spawners, CLAUDE.md, tech stacks, funding
Claude Code Self-Hosted	Setup Guide	Use Claude Code with Devstral/Ollama/vLLM — switcher script, shell aliases, env vars
Mistral Vibe	Full Guide	Install, config, MCP, skills, AGENTS.md, self-hosted vLLM, Open-WebUI, orchestration
OpenCode	Full Guide	LSP integration, client/server, headless JSON events, MCP OAuth, custom agents
OpenHands	Full Guide	Web UI, Docker sandbox, Devstral setup, autonomous batch mode, GitHub integration
Gemini CLI	Full Guide	Subagents, Jules extension, multi-model orchestration, GEMINI.md

Supporting Pages

Page	What It Covers
Agent Pricing	Plan pricing, token limits, self-hosted cost comparison
Claude Code Funding	Military/government procurement, billing, subscriptions
Project Rules	CLAUDE.md / AGENTS.md patterns and templates
Lessons Learned	Rules and incident reports by stack type

Community Resources

Resource	Link
Vibe Coding Repository	git.irregularchat.com/public-repos/vibe-coding — Rules, skills, configs, orchestrator, switcher script
IrregularChat Wiki	irregularpedia.org — Full knowledge base

Feature Matrix

Feature	Claude Code	Gemini CLI	Mistral Vibe	OpenCode	OpenHands
License	Proprietary	Apache 2.0	Apache 2.0	MIT	MIT
Interface	Terminal CLI	Terminal CLI	Terminal CLI	Terminal TUI	Web UI
Model support	Claude only	Gemini only	Any OpenAI-compat	Any (ai-sdk)	Any
Self-hosted models	No	No	Yes	Yes	Yes
LSP integration	No	No	No	Yes (20+ langs)	No
MCP servers	Yes	Yes	Yes (stdio+HTTP)	Yes (stdio+SSE+OAuth)	Yes
Client/server mode	No	No	No	Yes (`serve`+`attach`)	Yes (web server)
Headless mode	`-p` (print)	`-p` (prompt)	`-p` (headless)	`run`	CLI binary
Auto-approve flag	`--dangerously-skip-permissions`	`--approval-mode yolo`	(auto in `-p`)	`--dangerously-skip-permissions`
JSON output	`--output-format stream-json`	`-o stream-json`	`--output json\|streaming`	`--format json` (JSONL)	—
Cost tracking	`--max-budget-usd`	N/A	`--max-price` + `input_price`/`output_price` in config	JSONL `step_finish` events include `cost`/`tokens`	—
`--workdir` flag	inherits CWD	inherits CWD	`--workdir DIR`	inherits CWD	N/A
Session continuity	`--continue` / `--resume`	`--resume`	`--continue`	`--continue`, `--session`	Per-session
Context window	1M tokens	2M+ tokens	Model-dependent	Model-dependent	Model-dependent
Context compaction	Sophisticated	Automatic	`auto_compact_threshold`	Auto at 75%	Memory condensation
Custom instructions	CLAUDE.md (hierarchical)	GEMINI.md / AGENTS.md	AGENTS.md (root only)	AGENTS.md + CLAUDE.md	Settings UI
Skills/agents	SKILL.md + team spawners	Extensions + subagents	SKILL.md + TOML agents	opencode.json agents	—
Sandbox isolation	No	`--sandbox` option	No	No	Yes (Docker)
GitHub/GitLab	Via `gh` CLI	Via extensions	Manual	Manual	Native (issues → PRs)
SWE-bench Verified	80.8% (Opus 4.6)	~70% (Gemini 2.5 Pro)	72.2% (Devstral 2)	Model-dependent	46.8-61.7%
Cost	$20-200/mo	Free (with Google account)	Free (self-host)	Free (self-host)	Free (self-host)
GitHub stars	N/A	~55k	~5k	~147k	~65k

Benchmark Results

Tested April 2026. Vibe and OpenCode hit the same Devstral 123B (FP8) on 2x NVIDIA B200 GPUs via vLLM + LiteLLM. Claude Code uses Anthropic’s cloud (Opus 4.6). Gemini CLI uses Google’s cloud (Gemini 2.5 Pro). All four tools ran the exact same prompts.

Equivalent headless flags across tools:

Action	Claude Code	Gemini CLI	Mistral Vibe	OpenCode
Headless mode	`-p "prompt"`	`-p "prompt"`	`-p "prompt"`	`run "prompt"`
Auto-approve	`--dangerously-skip-permissions`	`--approval-mode yolo` / `-y`	(auto in `-p`)	`--dangerously-skip-permissions`
Output format	`--output-format text\|json\|stream-json`	`-o text\|json\|stream-json`	`--output text\|json`	`--format json` (JSONL)
Model override	N/A (uses subscription)	`-m model-name`	N/A (uses config)	`--model provider/model`
Resume session	`-r` / `--resume`	`-r` / `--resume`	`--continue`	`--continue` / `--session`
Working directory	inherits CWD	inherits CWD	`--workdir DIR`	inherits CWD
Budget limit	`--max-budget-usd`	N/A	N/A	N/A

Test 1: Code Generation

Prompt: “Write a Python function called merge_sorted that merges two sorted lists. Include type hints. Just the function.”

Tool	Time	Output	Quality
Claude Code	10.2s	Text + insight	Correct, uses `list[int]` (modern), added educational note
Gemini CLI	8.1s	Text response	Correct, uses `TypeVar`, clean
Vibe	7.0s	Text + wrote file	Correct, uses `TypeVar`
OpenCode	5.6s	Text + wrote file	Correct, uses `TypeVar`

All four produced correct two-pointer merge implementations. Claude Code added unsolicited educational insights. All self-hosted tools (Vibe, OpenCode) wrote files to disk AND showed the code in text output. Claude Code and Gemini responded with text only (no file created).

Test 2: Code Review

Prompt: “Read vibe/install.sh and list the top 3 bugs or improvements with line numbers.”

Tool	Time	Findings	Quality
Claude Code	18.2s	3 with reasoning	Best — nullglob, mkdir dry-run, grep pattern
Gemini CLI	124.2s	1 (after 19 tool calls)	Worst — couldn’t find file, SSH’d to server, fetched URL
Vibe	4.3s	3 with line numbers	Good — grep, set -e, metadata
OpenCode	3.6s	3 with line numbers	Good — mkdir, grep error handling

Claude Code’s findings were the most insightful (caught nullglob issue). Gemini spent 2+ minutes trying to find the file — it searched the remote server via SSH instead of reading locally. Vibe and OpenCode were fast and accurate.

Gemini CLI Path Resolution

In this test, Gemini CLI failed to find a local file and attempted to SSH into a remote server to find it. This is a significant issue for local code review tasks. It may relate to CWD handling or the agent’s tendency to use tools aggressively.

Test 3: Reasoning (No Tools)

Prompt: “Explain the difference between a mutex and a semaphore in exactly 3 bullet points.”

Tool	Time	Quality
Claude Code	9.8s	Best — ownership, counting, purpose distinction
Gemini CLI	7.0s	Good — locking vs signaling, ownership, use cases
Vibe	3.8s	Good — binary vs counter, correct
OpenCode	3.0s	Good — concise, correct

Pure reasoning, no tools. All four correct. Claude Code’s answer was most technically precise (mentioned ownership semantics). Speed inversely correlated with quality — cloud models (Claude, Gemini) were slower but more detailed.

Test 4: Error Handling

Prompt: “Read DOES_NOT_EXIST.py and summarize it.”

Tool	Time	Response
Claude Code	8.1s	”File doesn’t exist. Check the path.”
Gemini CLI	9.6s	Attempted to read, reported not found (verbose)
Vibe	2.5s	”The file does not exist.”
OpenCode	2.0s	Shows `✗ read failed`, reports not found

Self-hosted tools (Vibe, OpenCode) recovered 4x faster than cloud tools — no network round trip.

Performance Summary (5-Way)

Metric	Claude Cloud	Claude Self-Hosted	Gemini CLI	Vibe	OpenCode
Cold start	0.04s	0.04s	0.63s	0.49s	0.85s
Code generation	9.0s	21.6s	12.5s	7.4s	6.1s
Code review	18.2s	19.5s	124.2s	4.3s	3.6s
Reasoning	8.0s	10.0s	8.5s	3.6s	2.6s
Error recovery	6.8s	8.8s	9.7s	2.5s	2.7s
Model	Opus 4.6	Devstral 123B	Gemini 2.5 Pro	Devstral 123B	Devstral 123B
Cost	Subscription	Free	Free (Google)	Free	Free
Quality	Best	Good+	Good	Good	Good

Key takeaways:

Highest quality: Claude Cloud (Opus) — best reasoning, most detailed
Fastest execution: OpenCode and Vibe (2.5-7s on Devstral, no harness overhead)
Best value for interactive sessions: Claude Self-Hosted — same tooling as cloud, unlimited tokens, free
Best for headless dispatch: Vibe (fastest + --workdir) or OpenCode (LSP + sessions)
Setup guide: Claude Code Self-Hosted

Self-Hosted Model Comparison (4-Model Benchmark)

Tested April 2026 on 8x NVIDIA B200 (183GB each). All models served via vLLM 0.19.0 through LiteLLM gateway. 7 tests covering speed, quality, instruction following, security detection, complex code gen, and convention understanding.

Models Tested

Model	Parameters	SWE-bench	GPUs (FP8)	License	vLLM Parser
Devstral 2 123B	123B dense	72.2%	2	Mod. MIT	`mistral`
MiniMax M2.5	230B/10B MoE	80.2%	2	Mod. MIT	`minimax_m2`
Qwen3.6-35B-A3B	35B/3B MoE	73.4%	1	Apache 2.0	`qwen3_coder`
Devstral Small 2 24B	24B dense	68.0%	1	Mod. MIT	`mistral`

Speed Results (ms, lower is better)

Test	Devstral 123B	MiniMax M2.5	Qwen3.6 35B	Devstral Small 24B
Code generation	1,791	3,830	2,314	742
Code review	6,871	3,855	3,327	1,987
Reasoning (3 bullets)	2,497	2,348	1,526	1,109
Instruction following	79	264	194	64
Complex decorator	2,537	4,580	2,956	1,261
Convention quiz	147	260	782	57
Average	2,320	2,523	1,850	870

Quality Results

Test	Devstral 123B	MiniMax M2.5	Qwen3.6 35B	Devstral Small 24B
Correct function + types	Pass	Pass (in thinking)	Pass (in thinking)	Pass
Found 3 review bugs	3/3	0 (thinking)	2	3/3
3 clean bullet points	3	2 (verbose)	3 (in thinking)	3
One-word answer	”Four”	Failed (13 words)	Failed (thinking)	“Four”
Working retry decorator	Pass	Pass (in thinking)	Partial	Pass
Correct convention answer	B (correct)	Wrong	C (wrong)	B (correct)
Quality Score	6/6	1/6	2/6	6/6

Detailed Findings

Thinking Trace Problem: MiniMax M2.5 and Qwen3.6 are “reasoning models” that output their thinking process inline (e.g., “Here’s a thinking process: 1. Analyze User Input…”). This pollutes the response for coding dispatch — tools like Vibe and OpenCode receive the thinking traces as the actual answer. vLLM’s --reasoning-parser can strip these, but:

Qwen3.6: reasoning parser strips ALL content (returns empty)
MiniMax M2.5: minimax_m2 parser untested for this
Neither supports enable_thinking=false on vLLM 0.19.0

Convention Understanding (T7): Only the Devstral models correctly identified fix(auth): resolve null pointer in session validation as the correct conventional commit format. MiniMax chose wrong. Qwen3.6 chose C) Fix Auth Bug. This matters for coding agents that need to follow project conventions.

Instruction Following (T4): When asked for “one word,” both Devstral models responded “Four.” MiniMax responded with 13 words including its thinking. Qwen3.6 started with “Here’s a thinking process.” For headless dispatch where the response is parsed programmatically, this is a critical failure.

Recommendation

Use Case	Best Model	Why
Production coding agent	Devstral 123B	6/6 quality, clean output, proven tools
Fast-lane dispatch	Devstral Small 24B	6/6 quality, 870ms avg, same parser
Monitor for future	MiniMax M2.5	80.2% SWE-bench — needs reasoning parser fix
Monitor for future	Qwen3.6 35B	73.4% SWE-bench — needs thinking mode disable
Chat (non-coding)	Gemma 4 31B	Broken tool calling, fine for conversation

GPU Layout (Recommended)

GPU 0: Gemma 4 31B          (chat only)
GPU 1: Devstral Small 24B   (fast lane, 870ms avg)
GPU 2: FREE                  (MiniMax future / Kimi hot-swap)
GPU 3: FREE                  (MiniMax future / Kimi hot-swap)
GPU 4: FREE                  (2nd Devstral 123B replica)
GPU 5: FREE                  (2nd Devstral 123B replica)
GPU 6: Devstral 123B         (primary coding agent)
GPU 7: Devstral 123B         (primary coding agent)

When to Use What

Decision Tree

Need highest benchmark scores / complex reasoning?
  → Claude Code (Opus/Sonnet, 80.8% SWE-bench)

Need massive context window (2M+ tokens)?
  → Gemini CLI (Gemini 2.5 Pro, free with Google account)

Need self-hosted / data stays on-network?
  → Is it a web-based batch workflow?
      → OpenHands (web UI, sandbox, GitHub integration)
  → Is it a CLI workflow?
      → Do you need LSP diagnostics? (TypeScript, Go, Rust)
          → OpenCode (LSP catches type errors in-loop)
      → Do you need --workdir or scripted dispatch?
          → Vibe (--workdir flag, cleaner output, faster startup)
      → Do you need session continuity?
          → OpenCode (--continue, --session, serve+attach)
      → Simple one-shot tasks?
          → Vibe (lightest, most predictable)

By Task Type

Task	Best Tool	Why
Complex architecture / planning	Claude Code	Superior multi-step reasoning, 80.8% SWE-bench
Security review (judgment)	Claude Code	Better at exploitability assessment
Large codebase analysis	Gemini CLI	2M+ token context window
Multimodal (images, screenshots)	Gemini CLI	Native multimodal support
TypeScript / Go / Rust editing	OpenCode	LSP catches type errors in-loop
Long multi-step sessions	OpenCode	Context compaction + session persistence
CI/CD automation	OpenCode	JSONL events with token/cost tracking
Quick one-shot generation	Vibe	Fastest startup, `--workdir`, clean text
Parallel batch dispatch	Vibe	Predictable, `--max-turns` safety
Docs / configs / markdown	Vibe	LSP irrelevant, lighter tool
Autonomous issue resolution	OpenHands	Fire-and-forget, sandbox, GitHub native
Untrusted / experimental code	OpenHands	Docker sandbox isolation

The Orchestration Stack

The most powerful setup combines all four tools:

Layer	Tool	Role
Brain	Claude Code	Plans, reviews, synthesizes, makes judgment calls
Large Context	Gemini CLI	Analyze entire codebases, multimodal reviews
LSP Worker	OpenCode	TypeScript/Go/Rust implementation with type checking
Grunt Worker	Vibe	Bulk generation, research, docs, configs
Batch Agent	OpenHands	Autonomous issue resolution, sandbox experiments

See the Claude Code + Vibe Orchestration guide for the dispatch pattern.

Deep Evaluation Results

Reliability (20 Runs, Same Task)

Ran “review vibe/install.sh for bugs” 20 times per tool on the same Devstral 123B backend. Only Vibe and OpenCode tested (self-hosted, no per-run cost).

Metric	Vibe (20 runs)	OpenCode (20 runs)
Success rate	100%	100%
Found 3+ issues	100%	100%
Avg time	4,899ms	3,964ms
Min time	4,368ms	3,589ms
Max time	6,061ms	4,402ms
Time variance	1,693ms range	813ms range

Both tools are 100% reliable over 20 runs — zero failures, zero hallucinations. OpenCode is 19% faster with tighter variance (more predictable for CI/CD).

Failure Recovery (Fix Broken TypeScript)

Created a TypeScript file with 3 deliberate type errors. Asked each tool to find and fix them.

Metric	Claude Code	Gemini CLI	Vibe	OpenCode
Time	33.9s	21.4s	8.8s	TIMEOUT
Bugs fixed	3/3	3/3	3/3	0/3
File modified	Yes	Yes	Yes	No
Approach	Minimal diff	Creative (throw Error, better IDs)	Clean fix	Failed

Claude Code produced the cleanest fix (minimal changes). Gemini CLI was most creative — used .toString(36).substring(2) for better ID generation and threw Error for missing users. Vibe was fastest at 8.8s with all 3 bugs correctly fixed.

Stability & Maintenance

Tool	Current	Releases/Month	Breaking Changes (2026)	Automation Risk
Claude Code	2.1.98	~8-12	Low	Low
Vibe	2.7.6	~2-4	Low	Low
Gemini CLI	0.38.2	~4-6	Medium	Medium
OpenCode	1.14.20	60+ (multi/day)	High (`-p` deprecated)	High

For automated pipelines: pin OpenCode versions (breaks frequently). Claude Code and Vibe have the most stable headless APIs.

Self-Hosted Comparison

All three open-source tools can point at the same vLLM backend:

Minimum vLLM Launch Command

vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 2 \
  --quantization fp8

Configuration Side-by-Side

Vibe (~/.vibe/config.toml):

[[providers]]
name = "local"
api_base = "http://localhost:8000/v1"
api_key_env_var = "VLLM_KEY"
api_style = "openai"
backend = "generic"

[[models]]
name = "devstral-123b"
provider = "local"
alias = "devstral"
temperature = 0.2

OpenCode (opencode.json):

{
  "provider": {
    "local": {
      "npm": "@ai-sdk/openai-compatible",
      "options": { "baseURL": "http://localhost:8000/v1" },
      "models": { "devstral-123b": { "name": "Devstral 123B" } }
    }
  },
  "model": "local/devstral-123b"
}

OpenHands (Settings > LLM > Advanced):

Custom Model: openai/devstral
Base URL: http://host.docker.internal:8000/v1

Cost: Self-Hosted vs Cloud

Setup	Monthly Cost	Tokens/Month
Claude Code Pro	$20	Shared with web (limited)
Claude Code Max	$100-200	Higher limits (unpublished)
Claude Code API	Pay-per-token	$3-15/M tokens (model dependent)
Self-hosted Devstral (Vibe/OpenCode/OpenHands)	$0 (electricity only)	Unlimited
Self-hosted (cloud GPU)	$2-4/hr (H100)	Unlimited while running

Reproduce These Benchmarks

All tests were run on 2026-04-21. Scripts and raw results are published so you can replicate or extend them.

Test Environment

Component	Version	Details
Hardware	Dell XE9780 (Obelisk)	8x NVIDIA B200, 2TB RAM
GPU allocation	GPUs 6-7	Devstral 123B (FP8, tensor-parallel=2)
vLLM	v0.19.0	`--tool-call-parser mistral --enable-auto-tool-choice`
LiteLLM	1.82.6	Gateway on port 4000, `drop_params: true`
Claude Code	2.1.98	Anthropic cloud (Opus 4.6)
Gemini CLI	0.38.2	Google cloud (Gemini 2.5 Pro)
Mistral Vibe	2.7.6	Self-hosted Devstral via LiteLLM
OpenCode	1.14.20	Self-hosted Devstral via LiteLLM
macOS	Darwin 25.3.0	Apple Silicon (M-series)

Exact Commands Used

Test 1: Code Generation (Single Run, 4-Way)

PROMPT="Write a Python function called merge_sorted that merges two sorted lists into one sorted list. Include type hints. Just the function, no explanation."

# Claude Code
time claude -p "$PROMPT" --output-format text

# Gemini CLI
time gemini -p "$PROMPT" --approval-mode yolo -o text

# Mistral Vibe (self-hosted Devstral)
time vibe -p "$PROMPT" --workdir /tmp --max-turns 5 --output text

# OpenCode (self-hosted Devstral)
time opencode run "$PROMPT" --model litellm/devstral-123b --dangerously-skip-permissions

Test 2: Code Review (Single Run, 4-Way)

PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."
WORKDIR="/path/to/your/project"

# Claude Code
time claude -p "$PROMPT" --output-format text

# Gemini CLI
time gemini -p "$PROMPT" --approval-mode yolo -o text

# Mistral Vibe
time vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \
  --enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash"

# OpenCode (must cd to project first — no --workdir flag)
cd "$WORKDIR" && time opencode run "$PROMPT" \
  --model litellm/devstral-123b --dangerously-skip-permissions

Test 3: Reasoning (No Tools, 4-Way)

PROMPT="Explain the difference between a mutex and a semaphore in exactly 3 bullet points. No code."

# Same 4 commands as Test 1 (just change the prompt)

Test 4: Error Handling (4-Way)

PROMPT="Read the file DOES_NOT_EXIST.py and summarize it."

# Same 4 commands as Test 2 (just change the prompt)

Test 5: Cold Start

time claude --version
time gemini --version
time vibe --version
time opencode --version

Reliability Test (20 Runs, Vibe + OpenCode Only)

PROMPT="Read vibe/install.sh and list the top 3 bugs or improvements with line numbers. Be brief."
WORKDIR="/path/to/your/project"

# Run 20 times each, capture output to files
for i in $(seq 1 20); do
  # Vibe
  vibe -p "$PROMPT" --workdir "$WORKDIR" --max-turns 10 --output text \
    --enabled-tools "read_file" --enabled-tools "grep" --enabled-tools "bash" \
    > "results/vibe/run-$i.txt" 2>/dev/null

  # OpenCode
  (cd "$WORKDIR" && opencode run "$PROMPT" \
    --model litellm/devstral-123b --dangerously-skip-permissions \
    2>/dev/null) | cat > "results/opencode/run-$i.txt"
done

# Analyze: count runs that found the file and produced 3+ findings
for f in results/vibe/run-*.txt; do
  grep -qiE "install.sh|line [0-9]" "$f" && echo "PASS" || echo "FAIL"
done

Recovery Test (TypeScript Bug Fix, 4-Way)

Create a broken TypeScript file with these 3 deliberate errors:

// src/auth.ts — 3 bugs to fix
interface User {
  id: string;
  name: string;
  email: string;
  role: "admin" | "user";
}

// Bug 1: find() returns User | undefined, but return type says User
function getUser(users: User[], id: string): User {
  const user = users.find(u => u.id === id);
  return user;
}

// Bug 2: 'isAdmin' doesn't exist on User type
function isAdmin(user: User): boolean {
  return user.isAdmin === true;
}

// Bug 3: Math.random() returns number, not string
function createUser(name: string, email: string): User {
  return {
    id: Math.random(),
    name,
    email,
    role: "user",
  };
}

export { getUser, isAdmin, createUser };

Then run each tool with the same prompt:

PROMPT="The file src/auth.ts has TypeScript errors. Read it, identify ALL type errors, and fix them. Write the corrected file."

# Claude Code (use --add-dir to give access to the temp project)
claude -p "$PROMPT" --output-format text --add-dir /path/to/test-project

# Gemini CLI
cd /path/to/test-project && gemini -p "$PROMPT" --approval-mode yolo -o text

# Vibe
vibe -p "$PROMPT" --workdir /path/to/test-project --max-turns 15 --output text

# OpenCode
cd /path/to/test-project && opencode run "$PROMPT" \
  --model litellm/devstral-123b --dangerously-skip-permissions

Score each tool on: time to complete, number of bugs found (out of 3), whether it modified the file, quality of the fix.

Scoring Criteria

Category	How We Scored
Time	Wall clock from `date +%s%N` before and after, includes startup + API + response
Quality	Did it produce correct, runnable code? Did it find the right bugs?
Reliability	Over 20 runs: did it read the file, find 3+ issues, no hallucinated findings?
Recovery	Out of 3 planted bugs: how many correctly identified and fixed?
File modification	Did the tool actually write the fix to disk, or just show text?

Adapting for Your Environment

Different model: Change litellm/devstral-123b to your model name. Adjust Vibe’s config.toml and OpenCode’s opencode.json accordingly.
Different backend: Replace http://localhost:4000/v1 with your vLLM/LiteLLM/Ollama endpoint.
No self-hosted GPU: Skip Vibe and OpenCode tests, or point them at a cloud provider (OpenRouter, Together AI, etc.).
Different repo: Replace vibe/install.sh with any file in your project. The reliability test works with any “read file and review it” prompt.

Raw Results

Full test scripts and raw output files are available at:

git.irregularchat.com/irregulars/ai-coding-env — vibe/benchmarks/

vibe/benchmarks/
├── run-scale-test.sh          # Real PR review (26 files, 5400+ lines)
├── run-reliability-test.sh    # 20 runs per tool
├── run-cost-test.sh           # Token measurement (Claude solo vs orchestrator)
├── run-recovery-test.sh       # TypeScript bug fix (4-way)
├── RESULTS.md                 # Compiled analysis
└── results/
    ├── reliability/
    │   ├── vibe/run-{1..20}.txt
    │   └── opencode/run-{1..20}.txt
    └── recovery/
        ├── vibe-fixed.ts
        ├── claude-fixed.ts
        ├── gemini-fixed.ts
        └── *-recovery.txt

Tool Guides

Claude Code - Full guide: plugins, skills, team spawners, CLAUDE.md
Claude Code Self-Hosted - Use Claude Code with Devstral, Ollama, vLLM (free)
Claude Code Funding - Pricing, military procurement, billing
Mistral Vibe - Open-source CLI with self-hosted Devstral + orchestration
OpenCode - Open-source TUI with LSP integration + client/server
OpenHands - Web-based autonomous agent with Docker sandbox
Gemini Code - Google’s Gemini CLI with 2M+ context

Reference

AI Agent Pricing - Plan pricing, token limits, cost comparison
Project Rules & Lessons Learned - CLAUDE.md / AGENTS.md patterns

Community

Vibe Coding Repository - Rules, skills, configs, orchestrator skill, backend switcher

CLI Coding Agent Comparison

CLI Coding Agent Comparison

Tool Guides

Supporting Pages

Community Resources

Feature Matrix

Benchmark Results

Test 1: Code Generation

Test 2: Code Review

Test 3: Reasoning (No Tools)

Test 4: Error Handling

Performance Summary (5-Way)

Self-Hosted Model Comparison (4-Model Benchmark)

Models Tested

Speed Results (ms, lower is better)

Quality Results

Detailed Findings

Recommendation

GPU Layout (Recommended)

When to Use What

Decision Tree

By Task Type

The Orchestration Stack

Deep Evaluation Results

Reliability (20 Runs, Same Task)

Failure Recovery (Fix Broken TypeScript)

Stability & Maintenance

Self-Hosted Comparison

Minimum vLLM Launch Command

Configuration Side-by-Side

Cost: Self-Hosted vs Cloud

Reproduce These Benchmarks

Test Environment

Exact Commands Used

Test 1: Code Generation (Single Run, 4-Way)

Test 2: Code Review (Single Run, 4-Way)

Test 3: Reasoning (No Tools, 4-Way)

Test 4: Error Handling (4-Way)

Test 5: Cold Start

Reliability Test (20 Runs, Vibe + OpenCode Only)

Recovery Test (TypeScript Bug Fix, 4-Way)

Scoring Criteria

Adapting for Your Environment

Raw Results

Related Resources

Tool Guides

Reference

Community