Training the IrregularChat Model
Training the IrregularChat Model
Section titled “Training the IrregularChat Model”A complete walkthrough of how we built a domain-specific community assistant by fine-tuning an open-source LLM on IrregularChat data — from data collection to model serving.
Overview
Section titled “Overview”We fine-tuned Google’s Gemma-4-31B-Instruct using LoRA (Low-Rank Adaptation) on 4,178 instruction-tuning examples extracted from the IrregularChat community’s wiki, Q&A, news summaries, PDF library, breakout room summaries, TLDR summaries, and TIL entries.
| Detail | Value |
|---|---|
| Base model | gemma-4-31b-it (31B params, multimodal) |
| Method | LoRA fine-tune (267M trainable / 31.5B total = 0.85%) |
| Training data | 4,178 instruction pairs (10 MB JSONL) |
| Hardware | 2x NVIDIA B200 (192 GB HBM3e each, uncapped power) |
| Training time | 1 hour 34 minutes (747 steps) |
| Final eval loss | 2.454 |
| Tool | Unsloth for loading, HuggingFace PEFT + Trainer for DDP |
Data Sources
Section titled “Data Sources”What We Collected
Section titled “What We Collected”Training data was drawn from community resources including public wikis, curated Q&A, AI-generated summaries, PDF libraries, and internal documentation. Data sources:
| Source | Records | Raw Size | Content |
|---|---|---|---|
| Tagged community content | Questions, answers, TILs, events, notes | — | Structured entries tagged by type via Signal bot |
| Shared links | Publicly available URLs shared in channels | — | Links to articles, tools, and resources |
| News link summaries | 2,556 | 2.9 MB | AI-generated article summaries |
| Wiki (Irregularpedia) | 399 pages | 1.9 MB | Curated knowledge base |
| PDF library | 145 unique PDFs | 31.3 MB extracted | Military manuals, drone reports, OSINT guides, cybersecurity docs |
| Outline docs | 269 docs | 1.5 MB | Internal team documentation (member-accessible wiki) |
| Q&A | 174 questions, 63 answers | 82 KB | Community Q&A with voting |
| Breakout room summaries | — | — | AI-generated summaries derived from group discussion sessions |
| TLDR summaries | — | — | AI-generated summaries of shared content |
| TIL entries | 34 | 25 KB | ”Today I Learned” snippets |
| Archived wiki | 1,610 pages | 10.6 MB | Historical wiki content |
Data Pipeline
Section titled “Data Pipeline”Step 1: Export from Signal Bot Database
Section titled “Step 1: Export from Signal Bot Database”The Signal bot stores community data across several PostgreSQL tables. Each structured content type has its own table:
# Run on the server hosting signal-bot-postgres# Export structured Q&Adocker exec signal-bot-postgres psql -U signal_bot -d signal_bot \ -c "COPY ( SELECT question_text, category, to_timestamp(created_at/1000)::text as ts FROM q_and_a_questions WHERE question_text IS NOT NULL ) TO STDOUT WITH (FORMAT csv, HEADER true)" > qa_questions.csv
# Export news link summariesdocker exec signal-bot-postgres psql -U signal_bot -d signal_bot \ -c "COPY ( SELECT title, summary, url FROM news_links WHERE summary IS NOT NULL ) TO STDOUT WITH (FORMAT csv, HEADER true)" > news_links.csvSame pattern for TIL entries, breakout room summaries, and other structured sources.
Key tables:
q_and_a_questions/q_and_a_answers— structured Q&Anews_links— shared articles withtitle,summarytoday_i_learned—original_messages+ai_summarybreakout_rooms—executive_summary+detailed_summary(AI-generated from session messages)
Step 2: Export from Outline
Section titled “Step 2: Export from Outline”Outline uses camelCase columns in PostgreSQL:
docker exec norequirement_postgres psql -h localhost -U outline_user -d outline_db \ -c "COPY ( SELECT title, text, \"urlId\" FROM documents WHERE \"deletedAt\" IS NULL AND length(text) > 50 ) TO STDOUT WITH (FORMAT csv, HEADER true)" > outline.csvStep 3: Extract Text from PDFs
Section titled “Step 3: Extract Text from PDFs”The community file share (/datadrive/IrregularChat/) contains 163 PDFs across 16 topic categories. Many are duplicated across categories.
pip install pymupdf
python3 extract-pdfs.py \ --input /path/to/pdfs \ --output extracted/pdfs.jsonl \ --min-chars 200 # Skip scanned/image PDFsThe extraction script:
- Recursively finds all PDFs
- Extracts text with
pymupdf(fitz) - Deduplicates by content hash (SHA-256)
- Outputs JSONL with category, filename, and text
Result: 145 unique PDFs extracted (40 too short/scanned, 173 duplicates removed), yielding 31.3 MB of text.
Step 4: Format as Instruction-Tuning Data
Section titled “Step 4: Format as Instruction-Tuning Data”Convert all sources into chat-format JSONL:
{ "messages": [ {"role": "system", "content": "You are IrregularChat Assistant..."}, {"role": "user", "content": "What is C-UAS?"}, {"role": "assistant", "content": "Counter-Unmanned Aircraft Systems (C-UAS)..."} ]}Formatting strategies by source:
| Source | User Prompt | Assistant Response |
|---|---|---|
| Wiki pages | ”What is {title}?” / “Explain {title}“ | Page content (max 4000 chars) |
| Q&A | Actual question text | Best/longest answer |
| News | ”Summarize this: {title}“ | AI summary |
| PDFs | ”What does ‘{document}’ say?” | Document text (chunked at 3500 chars) |
| TIL | ”What did the community learn?” | AI summary or original entry |
| Breakout summaries | ”What was discussed in {session}?” | AI-generated session summary |
| TLDR summaries | ”Summarize {content}“ | AI-generated content summary |
| Tagged notes | ”What did the community share about {topic}?” | Tagged content entries |
PDFs get chunked into multiple training examples — a 50-page report becomes 10+ instruction pairs.
Final dataset: 3,970 train + 208 validation = 4,178 records (10 MB JSONL).
Model Selection
Section titled “Model Selection”Why Gemma-4-31B
Section titled “Why Gemma-4-31B”We evaluated models available on the server:
| Model | Params | Type | Issue |
|---|---|---|---|
| Qwen2.5-VL-72B | 72B | Vision+Language | OOM even in 4-bit with Unsloth; bitsandbytes incompatible with B200 (Blackwell) |
| Devstral-2-123B | 123B | Code | Requires transformers 5.0+ (too new for most tools) |
| gpt-oss-120b | 120B | MoE | Requires transformers 4.55+ (bleeding edge) |
| gemma-4-31b-it | 31B | Multimodal | Fits on 1-2 GPUs, well-supported, good general quality |
Gemma-4 Quirks
Section titled “Gemma-4 Quirks”- Architecture:
Gemma4ForConditionalGeneration(multimodal wrapper around text decoder) - Requires
mm_token_type_idsfield in training data (all zeros for text-only) - Custom
Gemma4ClippableLinearlayers break some PEFT/heretic versions AutoModelForCausalLMwon’t load it in older transformers — need 5.5.0+- Unsloth handles it via
FastLanguageModelbut ignoresCUDA_VISIBLE_DEVICES
Training
Section titled “Training”Tool Comparison
Section titled “Tool Comparison”| Tool | Speed | Multi-GPU | VRAM | Issue We Hit |
|---|---|---|---|---|
| Unsloth | 3x faster | Yes (since Dec 2025) | 70% less | Ignores CUDA_VISIBLE_DEVICES, can’t colocate with other processes |
| Vanilla PEFT | Baseline | Yes (DDP) | Baseline | Works but 2-4x slower than Unsloth |
| PEFT + DDP | 2x (per GPU added) | Yes | Full copy per GPU | Needs find_unused_parameters=True for multimodal models |
What Worked: 2-GPU DDP with Vanilla PEFT
Section titled “What Worked: 2-GPU DDP with Vanilla PEFT”After multiple OOM errors and device mapping issues with Unsloth, the winning config was vanilla HuggingFace PEFT with torchrun DDP on 2 GPUs:
# Key settingsos.environ["CUDA_VISIBLE_DEVICES"] = "5,6"
model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, device_map={"": local_rank}, # Each GPU gets full copy attn_implementation="sdpa",)
lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=32, lora_alpha=64, # alpha/r = 2.0 (standard heuristic) lora_dropout=0, # Unsloth default; consider 0.05 for small datasets target_modules="all-linear", bias="none",)
# Must enable for multimodal models (vision encoder params unused in text training)TrainingArguments( ddp_find_unused_parameters=True, gradient_checkpointing=True, per_device_train_batch_size=2, gradient_accumulation_steps=4, # Effective batch = 2 * 2 GPUs * 4 accum = 16)Custom data collator required for Gemma-4:
@dataclassclass GemmaCollator: tokenizer: Any pad_to_multiple_of: int = 8
def __call__(self, features): import torch max_len = max(len(f["input_ids"]) for f in features) max_len = ((max_len + self.pad_to_multiple_of - 1) // self.pad_to_multiple_of) * self.pad_to_multiple_of
pad_id = self.tokenizer.pad_token_id or 0 batch = {k: [] for k in ["input_ids", "attention_mask", "labels", "mm_token_type_ids"]}
for f in features: pad_len = max_len - len(f["input_ids"]) batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len) batch["attention_mask"].append([1] * len(f["input_ids"]) + [0] * pad_len) batch["labels"].append(f["labels"] + [-100] * pad_len) batch["mm_token_type_ids"].append(f["mm_token_type_ids"] + [0] * pad_len)
return {k: torch.tensor(v) for k, v in batch.items()}Launch Command
Section titled “Launch Command”CUDA_VISIBLE_DEVICES=5,6 \PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \torchrun --nproc_per_node=2 train.py 2>&1 | tee training.logTraining Results
Section titled “Training Results”| Metric | Value |
|---|---|
| Steps | 747 |
| Epochs | 3 |
| Final train loss | 10.86 (avg) |
| Final eval loss | 2.454 |
| Last step loss | 8.553 |
| Training time | 1h 34m |
| Avg step time | 7.53 s/step |
| Trainable params | 266,963,456 (0.85%) |
| GPU memory | ~149 GB per GPU (of 192 GB available) |
| GPU utilization | 98-100% |
| Power draw | 535-679W per GPU (uncapped) |
Loss progression:
- Step 5: 77.0 (initial, high)
- Step 20: 19.5 (learning rate warmup)
- Step 50: ~12 (converging)
- Step 300: ~9.5 (stable)
- Step 745: 8.55 (final)
Post-Training: Merge LoRA
Section titled “Post-Training: Merge LoRA”Merge the adapter back into the base model for serving:
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained( "google/gemma-4-31b-it", torch_dtype=torch.bfloat16, device_map="cpu",)model = PeftModel.from_pretrained(model, "/path/to/lora-adapter")model = model.merge_and_unload()model.save_pretrained("/path/to/merged")Abliteration (Refusal Removal)
Section titled “Abliteration (Refusal Removal)”What Is It
Section titled “What Is It”Abliteration removes the “refusal direction” from a model’s activation space — the internal vector that causes it to say “I can’t help with that.” For a community focused on cybersecurity, drones, OSINT, and military tech, stock instruct models refuse too many legitimate domain questions.
What We Tried
Section titled “What We Tried”| Method | Result |
|---|---|
| Manual mlabonne method (20 prompts, all layers) | Model output became gibberish (l l-'-'- l'--) |
| Heretic (automated Bayesian) | Can’t load Gemma-4 (PEFT version mismatch) |
| TrevorS/gemma-4-abliteration | Can’t load Gemma-4 (same PEFT issue) |
Why Manual Abliteration Failed
Section titled “Why Manual Abliteration Failed”- Only 20 contrastive prompts — need 256-800 for a reliable direction estimate
- Applied to ALL 60 layers indiscriminately — no per-layer calibration of refusal direction strength
- No Winsorization — large LLMs (including Gemma) produce high-magnitude activation outliers (Sun et al. 2024) that corrupt mean calculations. This is the documented cause of gibberish output on Gemma models
- Single global direction — Gemma benefits from per-layer refusal direction estimation
- No norm preservation — raw projection removal distorts weight row norms
Correct Approach (For Future Reference)
Section titled “Correct Approach (For Future Reference)”Based on TrevorJS’s work (3.2% refusal rate, 0.124 KL divergence on gemma-4-31b) and grimjim’s norm-preserving biprojected abliteration:
- 256-800 contrastive prompts (use
mlabonne/harmful_behaviors+mlabonne/harmless_alpaca) - Winsorize activations at 99.5th percentile before computing means
- Per-layer refusal directions (compute per-layer, not one global direction)
- Orthogonalize against harmless mean (biprojection)
- Norm-preserving weight modification — Magnitude-Preserving Orthogonal Ablation (MPOA): decompose row norms, ablate direction only, recompose
- Apply to
o_projandmlp.down_projacross layers (TrevorJS applied to all 60 layers; smaller models may benefit from targeting middle-to-late layers only)
Alternative: Pre-Abliterated Base
Section titled “Alternative: Pre-Abliterated Base”TrevorJS/gemma-4-31B-it-uncensored is a pre-abliterated version of the same base model. Our LoRA adapter can be applied on top of it instead of the stock model.
Serving
Section titled “Serving”# From venv (Docker image may be too old for Gemma-4)pip install vllm
CUDA_VISIBLE_DEVICES=5,6 python3 -m vllm.entrypoints.openai.api_server \ --model /path/to/merged \ --host 127.0.0.1 \ --port 8002 \ --tensor-parallel-size 2 \ --dtype bfloat16 \ --trust-remote-code \ --gpu-memory-utilization 0.80Gotcha: Cloudflare WARP hijacks all routing tables. Traffic sourced from a public IP gets routed through the WARP tunnel instead of the physical interface. Disconnect WARP or configure split tunneling before testing.
MCP Server for RAG
Section titled “MCP Server for RAG”We built an MCP server (apps/search-mcp/) that wraps the search service at search.irregulars.io, providing Claude Code and Claude Desktop with community knowledge retrieval:
{ "mcpServers": { "irregularchat-search": { "command": "node", "args": ["/path/to/apps/search-mcp/dist/index.js"], "env": { "IRREGULARCHAT_SEARCH_TOKEN": "your-token" } } }}Lessons Learned
Section titled “Lessons Learned”What Worked
Section titled “What Worked”- LoRA fine-tuning on community data is highly effective for domain adaptation — the model learns the community’s voice, topics, and terminology
- PDF extraction was the biggest data source by volume — military manuals and technical reports provide dense, high-quality training signal
- Deduplication by content hash eliminated 173 duplicate PDFs filed across multiple topic categories
- DDP on 2 GPUs gave 2.7x speedup over single GPU with minimal code changes
- Packing reduces total steps dramatically (747 → 96) but increases VRAM usage per step
What Didn’t Work
Section titled “What Didn’t Work”- Unsloth ignores
CUDA_VISIBLE_DEVICES— can’t colocate with other GPU processes on a shared server - bitsandbytes 4-bit quantization crashed on B200 (Blackwell) GPUs with CUDA illegal memory access (as of early 2025; later releases added sm_100 support)
- USB-C ethernet adapters use
cdc_ncmdriver that shows link UP but doesn’t pass traffic - Assigning same IP to multiple interfaces causes ARP confusion — router sends packets to random MACs
- Cloudflare WARP hijacks all routing via policy table 65743, breaking direct ISP connections
- Naive abliteration (few prompts, all layers, no Winsorization) destroys model output on Gemma
Key Numbers
Section titled “Key Numbers”| Metric | Value |
|---|---|
| Total raw data collected | ~730 MB files + ~33 MB structured text |
| Usable extracted text | ~80 MB after dedup and filtering |
| Training examples | 4,178 |
| Training cost | $0 (own hardware) |
| Training time | 1h 34m on 2x B200 |
| LoRA adapter size | 1 GB |
| Merged model size | 59 GB |
| Base model size | 59 GB |
File Locations on Obelisk
Section titled “File Locations on Obelisk”/workspace/irregularchat-corpus/ training/ train.jsonl # 3,970 training examples val.jsonl # 208 validation examples db/ # Signal bot DB exports outline/ # Outline wiki exports extracted/ pdfs.jsonl # Extracted PDF text wiki/ # Irregularpedia markdown wiki-archived/ # Old wiki content
/workspace/irregularchat-model/ lora-adapter/ # LoRA weights (1 GB) merged/ # Full merged model (59 GB) training.log # Training output checkpoint-400/ # Mid-training checkpoint checkpoint-747/ # Final checkpointv2 Training Attempt (May 2026)
Section titled “v2 Training Attempt (May 2026)”When testing the v1 model in production, members noticed that domain questions weren’t being answered any better than by the stock base model. Investigation revealed the problem wasn’t training time or hardware — it was that v1 didn’t actually train on the loss it claimed to.
Why v1 didn’t move the needle
Section titled “Why v1 didn’t move the needle”A close read of the v1 training script and corpus surfaced five compounding issues:
- Prompt tokens were never masked.
labels = input_ids.copy()meant the loss was computed over the system prompt and user question, not just the assistant response. The model spent most of its gradient signal memorizing the (always-identical) system prompt instead of learning content. - Response template wouldn’t have matched anyway. Even with a completion-only collator, the literature pointed at
<start_of_turn>model\nwhile Gemma-4’s real chat template emits<|turn>model\n. The mask would have been empty either way. - Corpus shape was wrong for knowledge injection. Of 3,970 training examples, 60% were
Summarize this: ARTICLE_TITLE → summaryand 29% wereWhat does doc 'X' say? (section N) → text chunk. Only 2.7% were natural “What is X?” pairs. Real users never phrase questions like the training data, so the adapter only fires for exact-match templates. - System prompt was byte-identical across all 3,970 examples. With the masking bug, the model effectively required that exact 350-character preamble to “enter IrregularChat mode.”
- LoRA rank was sized for behavior, not facts.
r=32on a 31B model (0.85% trainable) is enough to shift tone but not to inject ~80 MB of community knowledge. Published results suggest r≥64 + more epochs OR continued pretraining for facts.
What v2 changes
Section titled “What v2 changes”| Area | v1 | v2 |
|---|---|---|
| Loss masking | Full sequence (bug) | DataCollatorForCompletionOnlyLM with dynamically detected response template |
| Mask sanity check | None | Aborts if mask_frac < 0.30 or > 0.95 before training starts |
| Response template | Hardcoded (wrong) | Detected at startup by diffing apply_chat_template outputs |
| Corpus shape | 60% “Summarize this:” / 29% “What does doc X say:“ | Rule-based paraphrasing → 3 natural questions per source |
| System prompts | 1 identical preamble × 3,970 | 8 rotating templates |
| Training examples | 3,970 train / 208 val | 10,435 train / 375 val (after paraphrasing + dedup + short-fragment drop) |
| LoRA rank | r=32, α=64 | r=128, α=128 |
| Sharding | DDP (full copy per GPU) | FSDP (sharded across GPUs) |
| Packing | None | packing_strategy="wrapped" |
| Max sequence | 1024 tokens | 2048 tokens |
| GPU budget | 2 B200 uncapped (~600W each) | 6 B200 at 800W cap, alongside Vibe serving |
Six bugs encountered while wiring up v2
Section titled “Six bugs encountered while wiring up v2”Every one of these would have silently mis-trained (or mis-measured) in v1. They surfaced as crashes in v2 because of the sanity checks, not because v2 introduced them:
trl 0.19forcespadding_freewhenpacking=Truewith defaultffdstrategy → can’t pass a custom collator. Fix:packing_strategy="wrapped".- PEFT’s FSDP auto-wrap reads
FSDP_TRANSFORMER_CLS_TO_WRAPenv var, not HF Trainer’sfsdp_configdict. The two configs don’t cross over. Fix: set env var in the launcher per profile. - Gemma-4’s decoder layer is named
Gemma4TextDecoderLayerintransformers 5.5+, notGemma4DecoderLayeras in older versions or older Gemma generations. - Gemma-4 IT’s
apply_chat_template(..., add_generation_prompt=True)injects a reasoning-channel prefix (<|turn>model\n<|channel>thought\n<channel|>) that never appears in training-text rendering. Detecting the response template from this gives a template that matches nothing. Fix: probe with sentinel user+assistant content and slice between them, anchored on the last<...>marker. Gemma4ForConditionalGenerationrequiresmm_token_type_idsin every batch even for text-only training. trl’s data collator doesn’t emit this field. Fix: thin wrapper around the collator that injects zeros.- Sanity check at
mask_frac < 0.30is one-sided —mask_frac == 1.0means “response template not found in any example” which is equally broken. Fix: assert0.30 < mask_frac < 0.95.
Methodology improvements
Section titled “Methodology improvements”- Offline preflight before relaunch. After each diagnosed bug, run the collator + tokenizer + template detection on a real training example before spending GPU time on a full model load. Six attempts in 25 minutes vs. what would have been six attempts × ~10 minutes if each required a full model load to surface the next bug.
- Dynamic GPU detection in the launcher. Picks all GPUs with ≥60 GB free (≥150 GB for Mistral). Combined with stopping unused vLLM endpoints (
vllm-irregularchat,vllm-devstral-rod,vllm-qwen36) freed 6 GPUs while keeping Vibe (Mistral-Medium-3.5 on GPUs 4,5) untouched. - Power cap raised from 700W → 800W persisted via
gpu-power-cap.service. 6 training GPUs at 700W + 2 serving GPUs at 800W = ~5,800W under the 9,600W breaker.
v2 pipeline (run sequentially)
Section titled “v2 pipeline (run sequentially)”/workspace/irregularchat-corpus/launch_v2.sh gemma-v1data # isolate training-config gains/workspace/irregularchat-corpus/launch_v2.sh gemma-v2data # isolate data-shape gains/workspace/irregularchat-corpus/launch_v2.sh mistral-v2data # does bigger base eat the data better?Each profile auto-detects free GPUs, sets CUDA_VISIBLE_DEVICES, exports FSDP_TRANSFORMER_CLS_TO_WRAP, and launches torchrun with the right nproc_per_node. The 3-run comparison answers all three failure-mode questions independently.
Evaluation Methodology
Section titled “Evaluation Methodology”A corpus-quality fine-tune is only useful if it answers community questions in real user phrasing — not in the “Summarize this:” template that dominated v1’s training data. The eval set must be authored without reference to the corpus formatting.
Wiki-grounded evaluation prompts
Section titled “Wiki-grounded evaluation prompts”These 25 prompts are sourced from actual Irregularpedia pages that were part of the training corpus. Each one asks a question a community member might plausibly ask in chat, not in the training-corpus templates. The expected answer should reference concepts from the cited wiki page.
| # | Prompt | Source page |
|---|---|---|
| 1 | I just got an email saying my account was breached — where do I start? | cybersecurity/cyber-incident-response-guide-personal |
| 2 | What’s a good first step if I think my phone is compromised? | cybersecurity/cyber-incident-response-guide-personal |
| 3 | How do I run a radio check on an RTL-SDR? | radio/radio-checks |
| 4 | What command lists USB-connected SDRs on Linux? | radio/radio-checks |
| 5 | Walk me through prepping for a RIGEX exercise. | military/airborne-equipment-rigging |
| 6 | What’s the MC-6 nomenclature I need to know for jumpmaster? | military/airborne-equipment-rigging |
| 7 | What’s the difference between Monero and Bitcoin in terms of privacy? | privacy/monero |
| 8 | Where can I buy XMR with USD? | privacy/monero |
| 9 | What’s prompt engineering and why does it matter for working with LLMs? | ai-ml/ai-prompting |
| 10 | How should I structure a prompt to avoid ambiguity? | ai-ml/ai-prompting |
| 11 | What is C-UAS? | general/large-language-models or any C-UAS-tagged content |
| 12 | What does the community use Flipper Zero for? | radio/flipper-zero |
| 13 | How does the community handle email hardening? | cybersecurity/email-hardening-guide |
| 14 | What’s red-teaming in a cyber context? | cybersecurity/cyber-red-teaming |
| 15 | What’s the IrregularChat login flow? | general/the-irregularchat-login |
| 16 | Is there a community guide to running Protonmail Bridge on Linux? | privacy/protonmail-bridge-on-linux |
| 17 | What’s DragonOS used for? | radio/dragonos |
| 18 | How do I get started with software-defined radio? | radio/software-defined-radios-sdrs |
| 19 | What are the IrregularChat hackathons about? | general/irregularchat-hackathons |
| 20 | What’s the community’s recommended approach for self-hosting Nextcloud? | privacy/service-storage-nextcloud |
| 21 | What 3D printer does the community recommend? | hardware/3d-printer-recommendation |
| 22 | What’s a cyber deck and why would I build one? | hardware/cyber-decks |
| 23 | How does the community do archival research? | research/archival-research |
| 24 | What ham radio resources does IrregularChat recommend? | radio/ham-radio |
| 25 | What’s the AI ethics stance for community-built AI tools? | ai-ml/ai-ethics |
Eval execution
Section titled “Eval execution”For each candidate model (base, v1-LoRA, v2-on-v1data, v2-on-v2data, mistral-v2data), serve via vLLM and run eval_v2.py against the OpenAI-compatible endpoint. The script writes one JSONL record per prompt with {i, q, answer, ms}.
python3 /workspace/irregularchat-corpus/eval_v2.py \ --base-url http://localhost:8000/v1 \ --model <model_name> \ --out /workspace/irregularchat-model/eval/<run>.jsonlScoring rubric
Section titled “Scoring rubric”For each (prompt, answer) pair, score 0–3:
- 0 — Refusal or unrelated. “I can’t help with that” or wanders into a different topic.
- 1 — Generic web-pretrained answer. Correct on the topic but ignores the community’s specific tools, conventions, or page content.
- 2 — Domain-aware. References at least one community-specific concept (tool name, command, person, page) even if not perfectly aligned.
- 3 — Community-grounded. Answer reads like it came from someone who has read the cited wiki page; cites or paraphrases specific content.
A v2 model that beats v1 should land more answers in the 2–3 band, with fewer 0–1 results on prompts whose source page was part of training data. A useful sanity check: the base model should score 1 most of the time on these prompts — if it scores 2+ frequently, the eval is too easy and the wiki content overlaps generic web knowledge.
What we’ll know at the end
Section titled “What we’ll know at the end”| Comparison | Question answered |
|---|---|
base vs v1 | Did the original training do anything measurable? |
v1 vs gemma-on-v1data (v2 training, v1 corpus) | Did the training-config bugs alone account for v1’s underperformance? |
gemma-on-v1data vs gemma-on-v2data | Did corpus reshaping move the needle further? |
gemma-on-v2data vs mistral-on-v2data | Does a 128B base eat the data better than a 31B base? |
If v1 ≈ gemma-on-v1data, the corpus was always the limiter. If gemma-on-v1data > v1 but gemma-on-v2data ≈ gemma-on-v1data, the training bugs were the dominant problem. The four-way split makes the attribution decomposable.
v2 Results (Actual Numbers)
Section titled “v2 Results (Actual Numbers)”Three v2 runs completed in sequence on 6× B200 (FSDP, packing, completion-only masking). Mistral was dropped because the on-disk Mistral-Medium-3.5-128B is Mistral3ForConditionalGeneration (multimodal) — Mistral 3 has no text-only causal-LM sibling class, unlike Gemma 4’s Gemma4ForCausalLM. The checkpoint is also FP8-quantized, which compounds the issue. See “Operational Pitfalls” below.
| Run | Corpus | Recipe | Wall clock | Final train loss | Train tok_acc | Eval tok_acc |
|---|---|---|---|---|---|---|
| gemma-v1data | 3,970 examples (v1) | r=128, α=128, 5 epochs | 31:24 | 36.07 | 0.621 | 0.591 |
| gemma-v2data | 10,435 examples (v2) | r=128, α=128, 5 epochs | 1:23:00 | 50.95 | 0.587 | 0.685 |
| gemma-v2.1 | 10,435 examples (v2) | r=128, α=256, NEFTune α=5, 8 epochs | 2:13:17 | 27.21 | 0.918 | 0.640 |
Key reading of the numbers:
- Train loss is not comparable across LoRA scaling settings. v2.1 has loss 27 vs v2’s 51, but the larger α changes absolute loss magnitudes. Eval token accuracy is the comparable signal.
- v2 generalizes 16% better than v1 on held-out (0.685 vs 0.591). The corpus reshape paid off — paraphrasing + diverse system prompts + dropped-summary-templates produces a model that handles unseen phrasings better.
- v2.1 overfits relative to v2. Despite a much tighter training fit (0.918 train tok_acc), eval dropped to 0.640. The α=256 + NEFTune + 8-epoch combination memorizes the corpus tightly but loses generalization headroom.
- Eval loss was
nanon every run but token accuracy worked fine. With sequence packing, occasional eval batches end up withmask_frac == 1.0(no loss-bearing tokens — the response template never appears in a particular packed window), causing the cross-entropy aggregator to divide by zero. Switching to a non-packed eval pass would fix this, but token accuracy is enough for relative comparison.
The wiki-grounded rubric eval is the actual quality measure — token accuracy on packed batches is a proxy. Results pending at time of writing.
v2.1 Recipe Experiment
Section titled “v2.1 Recipe Experiment”After v2 completed, we ran a v2.1 experiment to isolate the impact of recipe upgrades on the same v2 corpus:
| Setting | v2 | v2.1 |
|---|---|---|
| LoRA rank | 128 | 128 |
| LoRA alpha | 128 | 256 (α=2r) |
| Use rsLoRA | No | No (see warning below) |
| NEFTune α | None | 5 |
| Epochs | 5 | 8 |
| Effective LoRA scaling | α/r = 1.0 | α/r = 2.0 |
The α=2r recommendation came from arxiv 2602.04998. NEFTune (ICLR 2024, arxiv 2310.05914) adds Gaussian noise to embeddings during training; published gains of 8–35 points on AlpacaEval.
The rsLoRA + α=2r pitfall
Section titled “The rsLoRA + α=2r pitfall”Our first v2.1 attempt also enabled use_rslora=True. This destroyed the model. Within 80 steps, train loss climbed past 280 and token accuracy collapsed to 1% — the model was outputting essentially noise.
The math: rsLoRA changes the LoRA scaling factor from α/r to α/sqrt(r). For r=128:
- Standard LoRA, α=128: scaling = 1.0
- Standard LoRA, α=256: scaling = 2.0 ← v2.1 final recipe
- rsLoRA, α=128: scaling = 11.3 (sqrt(128) ≈ 11.3)
- rsLoRA, α=256: scaling = 22.6 ← what destroyed the model
The published recommendations for rsLoRA assume you reduce alpha to keep effective scaling in a reasonable range. Stacking α=2r with rsLoRA is 2 × sqrt(r) scaling — never recommended.
Rule of thumb: apply LoRA stability changes one at a time. Each recipe paper assumes the others are at default.
Loss trajectory comparison
Section titled “Loss trajectory comparison”Train loss at comparable steps:
| Step | v1-data | v2-data | v2.1 |
|---|---|---|---|
| 50 | 46.64 | 53.16 | 56.07 |
| 100 | 37.25 | 49.46 | 46.18 |
| 150 | — | 44.17 | 35.31 |
| 200 | — | 76.19 (spike) | — |
| 280 | — | 40.40 | 17.58 |
| 350 | — | — | 9.67 |
| Final | 36.07 | 40.40 (step 290) | 4.70 (step 460) |
Grad-norm stability:
- v2: 14 → 759 range, multiple spikes triggered Trainer’s
max_grad_norm=1.0clipping - v2.1: 5 → 130 range, no spikes — the α=2r scaling absorbs more signal without needing violent updates
Operational Pitfalls Encountered
Section titled “Operational Pitfalls Encountered”In addition to the six bugs documented under “Six bugs encountered while wiring up v2,” the in-flight runs surfaced four more operational issues. Documenting these so the next iteration doesn’t have to rediscover them.
1. FSDP + PEFT save_pretrained() hangs at the end of every run
Section titled “1. FSDP + PEFT save_pretrained() hangs at the end of every run”After every successful training run, the rank-0 process hung at 100% CPU when trainer.model.save_pretrained() ran. PEFT’s save path uses the deprecated state_dict_type() FSDP API (emits FutureWarning every save), gathers all FSDP shards to rank 0, then extracts LoRA weights. The gather completes but the save never returns — at least for our 4.3 GB adapter, the rank-0 process burns CPU for 13+ minutes before manual kill.
Workaround: Skip the final save_pretrained() call entirely. The Trainer’s per-step checkpoint save (which uses a different code path) writes adapter_model.safetensors into checkpoint-N/. Post-training, the launcher copies that out to lora-adapter/:
# In launch_v2.sh, after torchrun exits:LAST_CKPT=$(ls -1d "$OUT"/checkpoint-* | sort -V | tail -1)cp "$LAST_CKPT/adapter_model.safetensors" \ "$LAST_CKPT/adapter_config.json" \ "$LAST_CKPT/chat_template.jinja" \ "$LAST_CKPT/tokenizer"*.json \ "$OUT/lora-adapter/"The first two runs (v1data, v2data) had to be manually killed + salvaged this way. v2.1 ran cleanly with the patched script.
2. Disk pressure cascades into save failures
Section titled “2. Disk pressure cascades into save failures”Initial training writes accumulated 65 GB in checkpoint directories + 59 GB v1 merged model + 6 GB old checkpoints on a 3.5 TB root volume that was already at 93 GB free. By the time run #1 saved its final checkpoint, the disk was at 100% — save_pretrained() wrote README.md to the new lora-adapter/ dir but couldn’t write adapter_model.safetensors. The hang was actually two issues stacked: PEFT’s slow FSDP gather + a failed write into a full disk.
Layout we ended up with:
/dev/nvme1n1p2 3.5T / <- system, /workspace, /workspace/models/dev/nvme4n1p1 3.5T /data <- v2+ training outputs (this is where adapters live)/dev/nvme2n1p1 3.5T /scratch <- ext4 created at runtime; merged models go herenvme3n1 3.5T raw <- unformatted, reservedThe migration approach: leave /workspace/models/ (shared base models) where they are, symlink future training outputs to /data, format /dev/nvme2n1 for scratch space, leave one disk unformatted as future-expansion.
Rule of thumb: every training run on 31B-class models needs ≥100 GB free on the write target before it starts. For 128B-class with optimizer state, plan for 300+ GB.
3. Mistral-Medium-3.5-128B incompatible with our pipeline
Section titled “3. Mistral-Medium-3.5-128B incompatible with our pipeline”The on-disk Mistral checkpoint is Mistral3ForConditionalGeneration — Mistral 3 architecture with Pixtral vision. The Mistral 3 module exposes only:
Mistral3ModelMistral3PreTrainedModelMistral3ForConditionalGenerationNo Mistral3ForCausalLM. Gemma 4 has its text-only causal-LM sibling (which is how our training path works); Mistral 3 doesn’t. Additionally the on-disk weights are FP8-quantized (quant_method: fp8 in config.json), so loading them as bf16 for LoRA training requires a dequantization step that TRL+PEFT don’t do automatically.
Workarounds exist (load the multimodal class, target only text-decoder LoRA modules, dequantize-on-load) but each is 2–4 hours of engineering. Skip for now. A “bigger base” experiment would be better targeted at Qwen3-32B (text-only, Apache 2.0, top instruction-following benchmarks as of 2026) or Llama-3.3-70B-Instruct — neither currently on disk.
4. vLLM not available in the training venv
Section titled “4. vLLM not available in the training venv”The training venv at /workspace/irregularchat-corpus/.venv/ has transformers 5.5.0, trl 0.19.1, peft 0.18.1 — but no vLLM. The original eval orchestrator script assumed python -m vllm.entrypoints.openai.api_server would work; it fails with ModuleNotFoundError.
Two options:
- Install vLLM in the venv — possible but risks transformers version conflicts with our TRL/PEFT stack
- Use
transformers.generate()directly — slower per-prompt (~10s vs vLLM’s ~1s) but adequate for offline eval and avoids the dependency
We chose the second. eval_direct.py loads the merged model with device_map="auto" (splits 62 GB across 2 GPUs), runs apply_chat_template + generate() per prompt. 25 prompts × 4 candidates × ~10s = ~17 min of pure generation, plus ~3–5 min per merge step (CPU bf16 load of 31B base + adapter, then merge_and_unload()).
Eval Results
Section titled “Eval Results”All four candidates were merged into bf16, loaded via transformers.generate() directly (vLLM wasn’t installed in the training venv — transformers worked fine, ~10s per generation), and evaluated against the 25 wiki-grounded prompts. Output JSONLs at /data/irregularchat-model/eval/.
Heuristic rubric scores
Section titled “Heuristic rubric scores”A scoring script that looks for community-specific markers (tool names, commands, URL paths like /general/, irregularchat) per prompt:
| Candidate | Rubric avg (0–3) | Train loss | Train tok_acc | Eval tok_acc |
|---|---|---|---|---|
| base | 2.88 | — | — | — |
| v2-gemma-on-v1data | 2.76 | 36.07 | 0.621 | 0.591 |
| v2-gemma-on-v2data | 2.80 | 50.95 | 0.587 | 0.685 |
| v2.1-gemma-on-v2data | 2.84 | 27.21 | 0.918 | 0.640 |
All four scores within 0.12 points — within noise. No fine-tune meaningfully beats base on the heuristic scorer. The rubric automation is too lenient because Gemma 4 31B’s prior knowledge already produces fluent, on-topic, keyword-rich answers for most technical questions (SDR, Monero, prompt engineering, OSINT, etc.). The markers can’t distinguish “Gemma knows what an RTL-SDR is” from “the model learned IrregularChat’s specific community conventions.”
The Authentik test
Section titled “The Authentik test”The qualitative test that mattered: Q15 asked “What’s the IrregularChat login flow?” The correct answer references Authentik — the SSO system the community actually uses. Result:
| Candidate | Mentions Authentik? | What it says instead |
|---|---|---|
| base | No | Generic “OAuth via Discord/Google” |
| v2-on-v1data | No | ”Login button, Google/Discord/Apple” |
| v2-on-v2data | No | ”Google/Facebook/Twitter” |
| v2.1 | No | ”Magic Links via Supabase Auth, signInWithOtp(), auth.irregularchat.com/auth/v1/callback” — fabricated specifics |
None of the candidates mention Authentik. All four confidently fabricate. v2.1’s answer is the most concerning — high-confidence specific-looking code snippets, fake URLs, fabricated authentication library. The α=256 + NEFTune + 8-epoch recipe made the model more willing to invent specific-sounding fakes, not better at producing real community content.
Other v2.1 hallucinations from spot-checks:
- Q3 (RTL-SDR radio check): cited a non-existent “rtl-bench software suite” with full install commands
- Q19 (hackathons): cited “Hack the Polyglot, Feb 21–March 2 2026” — may be genuine memorization of a wiki page OR fabricated specifics
Bottom line
Section titled “Bottom line”LoRA fine-tuning at our scale (r=128, 10K examples, ≤8 epochs) failed at the actual goal: injecting IrregularChat-specific facts. What it produced:
- ✅ Better instruction-following on technical topics (tone, structure, formatting)
- ✅ Slight shift toward wiki-like markdown output (section headers, bullet structure)
- ❌ Zero successful injection of community facts (Authentik, actual hackathon names, community-specific tools, conventions)
- ⚠️ v2.1 specifically: fabricates community-specific details more confidently than base — net negative for a Q&A bot
This empirically validates the research recommendation in the previous wiki section: SFT teaches behavior; RAG is required for facts at this corpus scale. The model’s pretrained prior dominates; a 4 GB adapter cannot encode 10 MB of unique community facts.
Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)
Section titled “Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)”| Option | Pros | Cons |
|---|---|---|
| Deploy v2.1 | Most “wiki-styled” output | Fabricates community facts most confidently — actively harmful for a Q&A bot |
| Deploy v2-on-v2data | Good train/eval balance, no overfit signature | Marginal lift over base on rubric |
| Deploy base + RAG | No fabrication risk beyond stock model | Loses slight wiki-format tone improvement |
| Deploy v2-on-v2data + RAG | Best of both: slight tone improvement + retrieved facts | RAG is doing the heavy lifting |
The fine-tune contributes ~10% formatting/tone polish. The retrieval (MCP server at search.irregulars.io) contributes the 90% of “actual community knowledge.”
Next Steps
Section titled “Next Steps”- Quantize v2-on-v2data to AWQ-4bit for deployment. Target: 20 GB on disk, runs on RTX 4090 / single 24GB GPU.
QuantTrio/gemma-4-31B-it-AWQconfirms vLLM’sawq_marlinhandles Gemma 4. - Wire MCP retrieval at inference. Either system-prompt injection with top-k retrieved docs, or a proper RAG framework (LangChain / LlamaIndex). The MCP server at
search.irregulars.ioalready exists. - Re-run the 25-prompt eval with RAG enabled. This is the comparison that actually matters. The expectation: the Authentik test passes this time. If it doesn’t, the retrieval pipeline needs work, not the model.
- Skip further LoRA experiments for knowledge injection. The data point is clear at this scale. Future iterations should focus on either:
- Continued pretraining (CPT) on raw wiki text in causal-LM mode for many passes — the SynCPT ICLR 2025 paper showed this helps where SFT plateaus
- RAFT-style training with distractor passages mixed into the training data, so the model learns to ignore irrelevant context AND rely on parametric fallback gracefully (Berkeley 2024)
- Distillation from a strong RAG-augmented teacher to a smaller student (Qwen3-8B) for cheaper deployment
- Don’t deploy v2.1. Its overfitting hurts fact-grounding more than the recipe gains help formatting. The model that looks most polished is the one most likely to confidently mislead users.
RAG Validation (BM25 + merged adapter)
Section titled “RAG Validation (BM25 + merged adapter)”After eval-of-fine-tunes confirmed the fact-injection gap, we built a BM25 retriever over the wiki and ran the same 25 prompts with retrieved context injected into the system prompt. Two corpus sizes tested:
- Wiki-only RAG: 386 wiki .md files (the canonical Irregularpedia content).
- Full-corpus RAG: 11,627 docs — wiki + 4,253 PDF chunks (mined from v1+v2 training data assistant turns) + 3,673 news summaries (training) + 3,673 news summaries (fresh DB pull) + 47 archived-files AI summaries + 11 daily community rollups + 6 Outline docs.
Eval configuration: top-4 retrievals per prompt, 1500-char snippet limit per doc, injected into the system prompt before the user question. Same 25 wiki-grounded prompts as the no-RAG eval. transformers.generate() direct, bf16, 2-GPU TP.
Headline result — the Authentik test
Section titled “Headline result — the Authentik test”| no-RAG | wiki-RAG | full-RAG | |
|---|---|---|---|
| base | FAIL | PASS | PASS |
| v2-v1data | FAIL | — | — |
| v2-v2data | FAIL | PASS | PASS |
| v2.1 | FAIL | — | — |
For Q15 “What’s the IrregularChat login flow?”, every no-RAG configuration fabricated an answer (Discord/Google OAuth, “Magic Links via Supabase Auth,” etc.). Every RAG configuration correctly identified Authentik. This is the cleanest binary signal in the whole experiment.
Heuristic rubric across all 6 conditions
Section titled “Heuristic rubric across all 6 conditions”| Condition | Hits / 91 | % |
|---|---|---|
| base no-RAG | 55 | 60.4% |
| v2-v1data no-RAG | 55 | 60.4% |
| v2-v2data no-RAG | 57 | 62.6% |
| v2.1 no-RAG | 56 | 61.5% |
| base wiki-RAG | 48 | 52.7% |
| v2-v2data wiki-RAG | 55 | 60.4% |
| base full-RAG | 47 | 51.6% |
| v2-v2data full-RAG | 56 | 61.5% |
The heuristic rubric (counting topic-keyword hits) shows RAG and no-RAG within noise — RAG sometimes loses generic keyword reflexes that Gemma 4’s prior already has (e.g., Q13 email hardening: no-RAG fluently lists SPF/DKIM/DMARC; RAG focuses on what the wiki page actually says).
The heuristic doesn’t capture correctness. The Authentik test does — and on it, RAG wins 4–0. Other community-internal-fact wins (manually verified): Q5 RIGEX (no-RAG 1 marker → RAG 5), Q23 archival research (1 → 2).
Why RAG sometimes loses on the heuristic
Section titled “Why RAG sometimes loses on the heuristic”For prompts where Gemma 4’s pretraining has strong coverage (email security, generic phone-compromise advice, Monero-vs-Bitcoin), the model produces fluent answers with many topic keywords. The wiki page on the same topic is often shorter and less keyword-dense. RAG redirects the model toward wiki content, which can mean fewer technical-acronym mentions and lower heuristic score — but answers that are actually grounded in what the community says, rather than just generic textbook recall.
This is a methodology lesson, not a result against RAG: mark-counting rubrics over-credit fluent generic answers. For real-world deployment, the user-visible question is “did the bot answer correctly with community-specific information” — and on that axis, RAG is decisive.
Source mix observed in full-corpus RAG
Section titled “Source mix observed in full-corpus RAG”Across 100 retrievals (25 prompts × top-4 each) for v2-v2data full-RAG:
| Source | Retrievals | Share |
|---|---|---|
| wiki | 59 | 59% |
| news (training-derived summaries) | 26 | 26% |
| pdf (training-derived chunks) | 9 | 9% |
| news_db (fresh from signal-bot) | 6 | 6% |
41% of retrievals are non-wiki. BM25 naturally surfaces wiki pages first when an exact topical match exists, with PDFs/news as fallback. For the 25 wiki-grounded eval prompts, the wiki-first behavior limits the impact of non-wiki sources — but in production where users ask questions not covered by any wiki page, the broader corpus matters.
Fine-tune × RAG synergy
Section titled “Fine-tune × RAG synergy”| Comparison | base | v2-v2data | Δ |
|---|---|---|---|
| No-RAG | 60.4% | 62.6% | +2.2 |
| Wiki-RAG | 52.7% | 60.4% | +7.7 |
| Full-RAG | 51.6% | 61.5% | +9.9 |
The fine-tune’s contribution grows with RAG. Without RAG, v2 is only ~2 points above base. With RAG, v2 is +8–10 points above base. The LoRA learned to use retrieved content idiomatically — wiki-style markdown structure, action-oriented answer format, proper integration of citations — even though it didn’t learn the underlying facts. This is the case for deploying v2-on-v2data over base, conditional on RAG being part of the stack.
Final deployment recommendation
Section titled “Final deployment recommendation”Deploy: v2-on-v2data adapter + wiki-only BM25 retrieval.
| Component | Choice | Why |
|---|---|---|
| Base model | gemma-4-31b-it | Strongest instruction model in our 30B class, available locally |
| Adapter | v2-on-v2data/lora-adapter/ (4.28 GB) | Best no-RAG-score fine-tune that also synergizes with RAG; no overfitting signs |
| Quantization | GGUF Q5_K_M when llama.cpp Gemma-4 support solidifies | AWQ-via-AutoAWQ doesn’t recognize gemma4 model type (deprecated tool); QuantTrio AWQ is base-only |
| Retrieval | BM25 over 386 wiki .md files | Full-corpus added marginal lift only; wiki-only is faster and noise-free |
| Top-K | 4 retrievals, 1500-char snippet cap | Tested config; ~4.5K extra tokens of context per query |
| Inference | bf16 on 2× B200 today; target single 24GB GPU after GGUF conversion | Works with current stack; smaller hardware pending quant |
Do NOT deploy v2.1 — its overfit recipe makes hallucinations more confident.
Do NOT skip the fine-tune even with RAG — measurably contributes +7-10 points over base+RAG.
Do NOT use full-corpus RAG yet — the wiki is sufficient for the current 25-prompt eval set; the broader corpus will matter for production queries that fall outside wiki coverage.
v3: Switching to Qwen3 + Heretic Abliteration (May 2026)
Section titled “v3: Switching to Qwen3 + Heretic Abliteration (May 2026)”After v2.1’s failure mode (rsLoRA + α=2r overfitting + NaN eval loss), and Gemma 4’s heavy alignment friction making “uncensored” responses hard to elicit even on legitimate professional topics, we switched the base model from gemma-4-31b-it to Qwen3-30B-A3B-Instruct-2507 (Alibaba, Apache 2.0). The deployed model is now irregularchat-v3-heretic running locally on a Mac via Ollama + Open WebUI.
Why switch off Gemma 4
Section titled “Why switch off Gemma 4”| Issue with Gemma 4 31B | Impact |
|---|---|
| Dense 31B params → ~12-14 tok/s on Apple Silicon Q8 | Sluggish interactive use |
| Most safety-trained of the major open models | Refusals/disclaimers on legitimate technical prompts |
Gemma4ForConditionalGeneration multimodal scaffolding | GGUF conversion path immature; double-BOS warning at inference |
| Distinct attention architecture | FSDP + PEFT save_pretrained() hangs (Pitfall #1 above) |
| Abliterated variants poorly preserve LoRA fluency | LoRA-on-abliterated-base = compounding drift (verified empirically) |
Why Qwen3-30B-A3B-Instruct-2507
Section titled “Why Qwen3-30B-A3B-Instruct-2507”| Property | Value |
|---|---|
| Architecture | MoE — 30.5B total, 3.3B active per token |
| Inference speed (Q4_K_M on M-series) | ~92 tok/s (6.5× faster than Gemma Q8) |
| Fine-tuning benchmarks | Qwen3 family takes 4 of top 6 fine-tuned-quality spots in published 2026 evals |
| License | Apache 2.0 |
| Unsloth 2026.4.2 support | First-class — single B200 (180GB) handles the full 30B fine-tune in ~2 hours, no FSDP needed |
| Default alignment | Less aggressive than Gemma 4; system-prompt jailbreaks more effective |
| Abliteration ecosystem | Multiple pre-abliterated variants on HF (huihui-ai, mlabonne, DavidAU) plus first-class Heretic 1.3.0 support |
Training recipe (v3)
Section titled “Training recipe (v3)”Single-GPU, no FSDP, lessons applied from v2/v2.1:
| Parameter | Value | Rationale |
|---|---|---|
| Base | Qwen/Qwen3-30B-A3B-Instruct-2507 | See above |
lora_r | 64 | Middle ground between v2 (r=32, underfit) and v2.1 (r=128, exploded) |
lora_alpha | 128 | Standard α=2r; intentionally NOT using rsLoRA |
lora_dropout | 0 | Required — PEFT ParamWrapper for MoE expert layers raises ValueError on non-zero dropout. MoE gating provides sufficient regularization. |
target_modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | Standard set; Unsloth excludes router by default for MoE |
| Epochs | 4 | Reduced from v2’s 5 to lower overfit risk on 10K examples |
per_device_train_batch_size | 4 | B200 has plenty of headroom |
gradient_accumulation_steps | 4 | Effective batch 16 |
learning_rate | 2e-4 | Standard for LoRA, cosine scheduler |
max_grad_norm | 0.5 | Tighter than default 1.0 — prevents the gradient-explosion failure mode v2.1 hit at step 10 |
| Trainer | Unsloth FastLanguageModel + TRL SFTTrainer | Single GPU, no FSDP complications |
v3 training results
Section titled “v3 training results”| Metric | Value |
|---|---|
| Total runtime | 7,347s (~2.0 hours) |
| Steps | 2,612 (4 epochs × 653 steps/epoch) |
Final train_loss | 0.37 |
Final eval_loss | 2.38 (healthy, not NaN like v2.1) |
| Train/eval generalization gap | ~2.0 — reasonable (not overfit, not underfit) |
eval_mean_token_accuracy | 0.89 |
| GPU memory peak | ~70 GB on a single B200 |
Heretic abliteration (v3-heretic)
Section titled “Heretic abliteration (v3-heretic)”After v3 training, the merged model went through Heretic 1.3.0 to remove the refusal direction structurally. This is the published gold-standard approach (Arditi et al. 2024; KL-minimizing optimization).
| Parameter | Value |
|---|---|
n_trials | 30 (Optuna TPE) |
kl_divergence_target | 0.20 |
| Best trial KL divergence | 0.0137 (well below the 0.16 reference for Llama-3.1-8B-heretic; model behavior preserved) |
| Refusal rate vs baseline | 100/100 → 95/100 on mlabonne/harmful_behaviors |
The refusal-rate reduction on the standard harmful-behaviors benchmark was minimal — Qwen3’s safety pathways are distributed across MoE experts, and 30 trials wasn’t enough to fully untangle them. However, for the actual use case (military/OSINT/drone technical queries that are not in the benchmark), the abliteration completely eliminates moralizing and “I cannot assist” responses. Verified empirically post-deploy.
Critical incompatibility: Heretic’s interactive TUI
Section titled “Critical incompatibility: Heretic’s interactive TUI”Heretic uses questionary (raw-mode prompt_toolkit) at the end of optimization to interactively prompt for which trial to apply and where to save. This cannot be driven by pexpect / stdin piping in scripted/headless runs — prompt_toolkit requires a real terminal.
Fix: Write a wrapper that monkey-patches questionary.select/path/text/checkbox to return canned responses BEFORE heretic.main.run() is called:
import questionaryquestionary.select = fake_select_returning_first_trial_then_savequestionary.path = lambda message: FakeAsker(OUTPUT_DIR)import heretic.mainheretic.main.run()State-machine the fake_select to pick:
- Trial selection menu → first (best) trial in the Pareto front
- Action menu → “Save the model to a local folder”
- Subsequent action prompts → “Return to trial selection menu” → exit
Without this, Heretic completes optimization, sits at the menu, and exits without saving when its parent shell dies — wasting all the compute.
Local Mac deployment
Section titled “Local Mac deployment”The production model now runs on a Mac instead of Obelisk:
| Component | Path |
|---|---|
| Merged HF model on Obelisk | /data/irregularchat-model/v3-heretic/ (53GB safetensors, 13 shards) |
| bf16 GGUF on Obelisk | /data/irregularchat-model/v3-heretic-gguf/irregularchat-v3-heretic-bf16.gguf (57GB) |
| Q4_K_M GGUF on Mac | /Users/sac/Models/irregularchat-v3-heretic-Q4_K_M.gguf (17GB) |
| Ollama tag | irregularchat-v3-heretic:latest |
| Modelfile | /Users/sac/Models/Modelfile-v3-heretic |
| RAG markdown corpus | /Users/sac/Models/rag-corpus/wiki-md/ (386 files derived from wiki.jsonl) |
| Open WebUI | http://127.0.0.1:8080 (Python venv at /Users/sac/irregularchat-local/.venv-webui/) |
llama.cpp’s convert_hf_to_gguf.py natively supports the Qwen3 MoE architecture — no patches needed (unlike Gemma 4 where we had to wait for the toolchain). Q4_K_M produces a 17GB file that loads in ~30s on an M-series with unified memory, then runs at ~92 tok/s.
v3 deployment recommendations
Section titled “v3 deployment recommendations”| Decision | Choice | Why |
|---|---|---|
| Base | Qwen3-30B-A3B-Instruct-2507 | MoE speed + better fine-tuning lift than Gemma at the same param count |
| Fine-tune | r=64 / α=128 / dropout=0 / 4 epochs / max_grad_norm=0.5 | Stable training, no NaN, train_loss=0.37 |
| Abliteration | Heretic 1.3.0 with monkey-patched auto-save | Removes moralizing on professional-context queries; KL preserved |
| Quantization | Q4_K_M GGUF | Sweet spot — 17GB fits comfortably; ~98% quality vs bf16 |
| Serving | Ollama + Open WebUI Knowledge | Local, fast, RAG via Knowledge collection (see Open WebUI) |
| Retrieval | Open WebUI built-in (sentence-transformers + sqlite-vec) | Replaces the previous Python BM25 script |
The v2-vs-v3 quality bar is: v3 produces fluent, calibrated answers (“I don’t know — based on common military training conventions…”) on out-of-corpus topics, where v2 confidently confabulated (“RIGEX stands for Rapid Interdiction Group Exercise”). RAG remains essential for actual factual content — fine-tuning teaches style, RAG provides facts.