Skip to content

Training the IrregularChat Model

A complete walkthrough of how we built a domain-specific community assistant by fine-tuning an open-source LLM on IrregularChat data — from data collection to model serving.

We fine-tuned Google’s Gemma-4-31B-Instruct using LoRA (Low-Rank Adaptation) on 4,178 instruction-tuning examples extracted from the IrregularChat community’s wiki, Q&A, news summaries, PDF library, breakout room summaries, TLDR summaries, and TIL entries.

DetailValue
Base modelgemma-4-31b-it (31B params, multimodal)
MethodLoRA fine-tune (267M trainable / 31.5B total = 0.85%)
Training data4,178 instruction pairs (10 MB JSONL)
Hardware2x NVIDIA B200 (192 GB HBM3e each, uncapped power)
Training time1 hour 34 minutes (747 steps)
Final eval loss2.454
ToolUnsloth for loading, HuggingFace PEFT + Trainer for DDP

Training data was drawn from community resources including public wikis, curated Q&A, AI-generated summaries, PDF libraries, and internal documentation. Data sources:

SourceRecordsRaw SizeContent
Tagged community contentQuestions, answers, TILs, events, notesStructured entries tagged by type via Signal bot
Shared linksPublicly available URLs shared in channelsLinks to articles, tools, and resources
News link summaries2,5562.9 MBAI-generated article summaries
Wiki (Irregularpedia)399 pages1.9 MBCurated knowledge base
PDF library145 unique PDFs31.3 MB extractedMilitary manuals, drone reports, OSINT guides, cybersecurity docs
Outline docs269 docs1.5 MBInternal team documentation (member-accessible wiki)
Q&A174 questions, 63 answers82 KBCommunity Q&A with voting
Breakout room summariesAI-generated summaries derived from group discussion sessions
TLDR summariesAI-generated summaries of shared content
TIL entries3425 KB”Today I Learned” snippets
Archived wiki1,610 pages10.6 MBHistorical wiki content

The Signal bot stores community data across several PostgreSQL tables. Each structured content type has its own table:

Terminal window
# Run on the server hosting signal-bot-postgres
# Export structured Q&A
docker exec signal-bot-postgres psql -U signal_bot -d signal_bot \
-c "COPY (
SELECT question_text, category, to_timestamp(created_at/1000)::text as ts
FROM q_and_a_questions WHERE question_text IS NOT NULL
) TO STDOUT WITH (FORMAT csv, HEADER true)" > qa_questions.csv
# Export news link summaries
docker exec signal-bot-postgres psql -U signal_bot -d signal_bot \
-c "COPY (
SELECT title, summary, url
FROM news_links WHERE summary IS NOT NULL
) TO STDOUT WITH (FORMAT csv, HEADER true)" > news_links.csv

Same pattern for TIL entries, breakout room summaries, and other structured sources.

Key tables:

  • q_and_a_questions / q_and_a_answers — structured Q&A
  • news_links — shared articles with title, summary
  • today_i_learnedoriginal_messages + ai_summary
  • breakout_roomsexecutive_summary + detailed_summary (AI-generated from session messages)

Outline uses camelCase columns in PostgreSQL:

Terminal window
docker exec norequirement_postgres psql -h localhost -U outline_user -d outline_db \
-c "COPY (
SELECT title, text, \"urlId\"
FROM documents
WHERE \"deletedAt\" IS NULL AND length(text) > 50
) TO STDOUT WITH (FORMAT csv, HEADER true)" > outline.csv

The community file share (/datadrive/IrregularChat/) contains 163 PDFs across 16 topic categories. Many are duplicated across categories.

Terminal window
pip install pymupdf
python3 extract-pdfs.py \
--input /path/to/pdfs \
--output extracted/pdfs.jsonl \
--min-chars 200 # Skip scanned/image PDFs

The extraction script:

  • Recursively finds all PDFs
  • Extracts text with pymupdf (fitz)
  • Deduplicates by content hash (SHA-256)
  • Outputs JSONL with category, filename, and text

Result: 145 unique PDFs extracted (40 too short/scanned, 173 duplicates removed), yielding 31.3 MB of text.

Convert all sources into chat-format JSONL:

{
"messages": [
{"role": "system", "content": "You are IrregularChat Assistant..."},
{"role": "user", "content": "What is C-UAS?"},
{"role": "assistant", "content": "Counter-Unmanned Aircraft Systems (C-UAS)..."}
]
}

Formatting strategies by source:

SourceUser PromptAssistant Response
Wiki pages”What is {title}?” / “Explain {title}“Page content (max 4000 chars)
Q&AActual question textBest/longest answer
News”Summarize this: {title}“AI summary
PDFs”What does ‘{document}’ say?”Document text (chunked at 3500 chars)
TIL”What did the community learn?”AI summary or original entry
Breakout summaries”What was discussed in {session}?”AI-generated session summary
TLDR summaries”Summarize {content}“AI-generated content summary
Tagged notes”What did the community share about {topic}?”Tagged content entries

PDFs get chunked into multiple training examples — a 50-page report becomes 10+ instruction pairs.

Final dataset: 3,970 train + 208 validation = 4,178 records (10 MB JSONL).

We evaluated models available on the server:

ModelParamsTypeIssue
Qwen2.5-VL-72B72BVision+LanguageOOM even in 4-bit with Unsloth; bitsandbytes incompatible with B200 (Blackwell)
Devstral-2-123B123BCodeRequires transformers 5.0+ (too new for most tools)
gpt-oss-120b120BMoERequires transformers 4.55+ (bleeding edge)
gemma-4-31b-it31BMultimodalFits on 1-2 GPUs, well-supported, good general quality
  • Architecture: Gemma4ForConditionalGeneration (multimodal wrapper around text decoder)
  • Requires mm_token_type_ids field in training data (all zeros for text-only)
  • Custom Gemma4ClippableLinear layers break some PEFT/heretic versions
  • AutoModelForCausalLM won’t load it in older transformers — need 5.5.0+
  • Unsloth handles it via FastLanguageModel but ignores CUDA_VISIBLE_DEVICES
ToolSpeedMulti-GPUVRAMIssue We Hit
Unsloth3x fasterYes (since Dec 2025)70% lessIgnores CUDA_VISIBLE_DEVICES, can’t colocate with other processes
Vanilla PEFTBaselineYes (DDP)BaselineWorks but 2-4x slower than Unsloth
PEFT + DDP2x (per GPU added)YesFull copy per GPUNeeds find_unused_parameters=True for multimodal models

After multiple OOM errors and device mapping issues with Unsloth, the winning config was vanilla HuggingFace PEFT with torchrun DDP on 2 GPUs:

# Key settings
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map={"": local_rank}, # Each GPU gets full copy
attn_implementation="sdpa",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=32,
lora_alpha=64, # alpha/r = 2.0 (standard heuristic)
lora_dropout=0, # Unsloth default; consider 0.05 for small datasets
target_modules="all-linear",
bias="none",
)
# Must enable for multimodal models (vision encoder params unused in text training)
TrainingArguments(
ddp_find_unused_parameters=True,
gradient_checkpointing=True,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
# Effective batch = 2 * 2 GPUs * 4 accum = 16
)

Custom data collator required for Gemma-4:

@dataclass
class GemmaCollator:
tokenizer: Any
pad_to_multiple_of: int = 8
def __call__(self, features):
import torch
max_len = max(len(f["input_ids"]) for f in features)
max_len = ((max_len + self.pad_to_multiple_of - 1) //
self.pad_to_multiple_of) * self.pad_to_multiple_of
pad_id = self.tokenizer.pad_token_id or 0
batch = {k: [] for k in ["input_ids", "attention_mask", "labels", "mm_token_type_ids"]}
for f in features:
pad_len = max_len - len(f["input_ids"])
batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len)
batch["attention_mask"].append([1] * len(f["input_ids"]) + [0] * pad_len)
batch["labels"].append(f["labels"] + [-100] * pad_len)
batch["mm_token_type_ids"].append(f["mm_token_type_ids"] + [0] * pad_len)
return {k: torch.tensor(v) for k, v in batch.items()}
Terminal window
CUDA_VISIBLE_DEVICES=5,6 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --nproc_per_node=2 train.py 2>&1 | tee training.log
MetricValue
Steps747
Epochs3
Final train loss10.86 (avg)
Final eval loss2.454
Last step loss8.553
Training time1h 34m
Avg step time7.53 s/step
Trainable params266,963,456 (0.85%)
GPU memory~149 GB per GPU (of 192 GB available)
GPU utilization98-100%
Power draw535-679W per GPU (uncapped)

Loss progression:

  • Step 5: 77.0 (initial, high)
  • Step 20: 19.5 (learning rate warmup)
  • Step 50: ~12 (converging)
  • Step 300: ~9.5 (stable)
  • Step 745: 8.55 (final)

Merge the adapter back into the base model for serving:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31b-it",
torch_dtype=torch.bfloat16,
device_map="cpu",
)
model = PeftModel.from_pretrained(model, "/path/to/lora-adapter")
model = model.merge_and_unload()
model.save_pretrained("/path/to/merged")

Abliteration removes the “refusal direction” from a model’s activation space — the internal vector that causes it to say “I can’t help with that.” For a community focused on cybersecurity, drones, OSINT, and military tech, stock instruct models refuse too many legitimate domain questions.

MethodResult
Manual mlabonne method (20 prompts, all layers)Model output became gibberish (l l-'-'- l'--)
Heretic (automated Bayesian)Can’t load Gemma-4 (PEFT version mismatch)
TrevorS/gemma-4-abliterationCan’t load Gemma-4 (same PEFT issue)
  1. Only 20 contrastive prompts — need 256-800 for a reliable direction estimate
  2. Applied to ALL 60 layers indiscriminately — no per-layer calibration of refusal direction strength
  3. No Winsorization — large LLMs (including Gemma) produce high-magnitude activation outliers (Sun et al. 2024) that corrupt mean calculations. This is the documented cause of gibberish output on Gemma models
  4. Single global direction — Gemma benefits from per-layer refusal direction estimation
  5. No norm preservation — raw projection removal distorts weight row norms

Based on TrevorJS’s work (3.2% refusal rate, 0.124 KL divergence on gemma-4-31b) and grimjim’s norm-preserving biprojected abliteration:

  1. 256-800 contrastive prompts (use mlabonne/harmful_behaviors + mlabonne/harmless_alpaca)
  2. Winsorize activations at 99.5th percentile before computing means
  3. Per-layer refusal directions (compute per-layer, not one global direction)
  4. Orthogonalize against harmless mean (biprojection)
  5. Norm-preserving weight modification — Magnitude-Preserving Orthogonal Ablation (MPOA): decompose row norms, ablate direction only, recompose
  6. Apply to o_proj and mlp.down_proj across layers (TrevorJS applied to all 60 layers; smaller models may benefit from targeting middle-to-late layers only)

TrevorJS/gemma-4-31B-it-uncensored is a pre-abliterated version of the same base model. Our LoRA adapter can be applied on top of it instead of the stock model.

Terminal window
# From venv (Docker image may be too old for Gemma-4)
pip install vllm
CUDA_VISIBLE_DEVICES=5,6 python3 -m vllm.entrypoints.openai.api_server \
--model /path/to/merged \
--host 127.0.0.1 \
--port 8002 \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--trust-remote-code \
--gpu-memory-utilization 0.80

Gotcha: Cloudflare WARP hijacks all routing tables. Traffic sourced from a public IP gets routed through the WARP tunnel instead of the physical interface. Disconnect WARP or configure split tunneling before testing.

We built an MCP server (apps/search-mcp/) that wraps the search service at search.irregulars.io, providing Claude Code and Claude Desktop with community knowledge retrieval:

{
"mcpServers": {
"irregularchat-search": {
"command": "node",
"args": ["/path/to/apps/search-mcp/dist/index.js"],
"env": {
"IRREGULARCHAT_SEARCH_TOKEN": "your-token"
}
}
}
}
  • LoRA fine-tuning on community data is highly effective for domain adaptation — the model learns the community’s voice, topics, and terminology
  • PDF extraction was the biggest data source by volume — military manuals and technical reports provide dense, high-quality training signal
  • Deduplication by content hash eliminated 173 duplicate PDFs filed across multiple topic categories
  • DDP on 2 GPUs gave 2.7x speedup over single GPU with minimal code changes
  • Packing reduces total steps dramatically (747 → 96) but increases VRAM usage per step
  • Unsloth ignores CUDA_VISIBLE_DEVICES — can’t colocate with other GPU processes on a shared server
  • bitsandbytes 4-bit quantization crashed on B200 (Blackwell) GPUs with CUDA illegal memory access (as of early 2025; later releases added sm_100 support)
  • USB-C ethernet adapters use cdc_ncm driver that shows link UP but doesn’t pass traffic
  • Assigning same IP to multiple interfaces causes ARP confusion — router sends packets to random MACs
  • Cloudflare WARP hijacks all routing via policy table 65743, breaking direct ISP connections
  • Naive abliteration (few prompts, all layers, no Winsorization) destroys model output on Gemma
MetricValue
Total raw data collected~730 MB files + ~33 MB structured text
Usable extracted text~80 MB after dedup and filtering
Training examples4,178
Training cost$0 (own hardware)
Training time1h 34m on 2x B200
LoRA adapter size1 GB
Merged model size59 GB
Base model size59 GB
/workspace/irregularchat-corpus/
training/
train.jsonl # 3,970 training examples
val.jsonl # 208 validation examples
db/ # Signal bot DB exports
outline/ # Outline wiki exports
extracted/
pdfs.jsonl # Extracted PDF text
wiki/ # Irregularpedia markdown
wiki-archived/ # Old wiki content
/workspace/irregularchat-model/
lora-adapter/ # LoRA weights (1 GB)
merged/ # Full merged model (59 GB)
training.log # Training output
checkpoint-400/ # Mid-training checkpoint
checkpoint-747/ # Final checkpoint

When testing the v1 model in production, members noticed that domain questions weren’t being answered any better than by the stock base model. Investigation revealed the problem wasn’t training time or hardware — it was that v1 didn’t actually train on the loss it claimed to.

A close read of the v1 training script and corpus surfaced five compounding issues:

  1. Prompt tokens were never masked. labels = input_ids.copy() meant the loss was computed over the system prompt and user question, not just the assistant response. The model spent most of its gradient signal memorizing the (always-identical) system prompt instead of learning content.
  2. Response template wouldn’t have matched anyway. Even with a completion-only collator, the literature pointed at <start_of_turn>model\n while Gemma-4’s real chat template emits <|turn>model\n. The mask would have been empty either way.
  3. Corpus shape was wrong for knowledge injection. Of 3,970 training examples, 60% were Summarize this: ARTICLE_TITLE → summary and 29% were What does doc 'X' say? (section N) → text chunk. Only 2.7% were natural “What is X?” pairs. Real users never phrase questions like the training data, so the adapter only fires for exact-match templates.
  4. System prompt was byte-identical across all 3,970 examples. With the masking bug, the model effectively required that exact 350-character preamble to “enter IrregularChat mode.”
  5. LoRA rank was sized for behavior, not facts. r=32 on a 31B model (0.85% trainable) is enough to shift tone but not to inject ~80 MB of community knowledge. Published results suggest r≥64 + more epochs OR continued pretraining for facts.
Areav1v2
Loss maskingFull sequence (bug)DataCollatorForCompletionOnlyLM with dynamically detected response template
Mask sanity checkNoneAborts if mask_frac < 0.30 or > 0.95 before training starts
Response templateHardcoded (wrong)Detected at startup by diffing apply_chat_template outputs
Corpus shape60% “Summarize this:” / 29% “What does doc X say:“Rule-based paraphrasing → 3 natural questions per source
System prompts1 identical preamble × 3,9708 rotating templates
Training examples3,970 train / 208 val10,435 train / 375 val (after paraphrasing + dedup + short-fragment drop)
LoRA rankr=32, α=64r=128, α=128
ShardingDDP (full copy per GPU)FSDP (sharded across GPUs)
PackingNonepacking_strategy="wrapped"
Max sequence1024 tokens2048 tokens
GPU budget2 B200 uncapped (~600W each)6 B200 at 800W cap, alongside Vibe serving

Every one of these would have silently mis-trained (or mis-measured) in v1. They surfaced as crashes in v2 because of the sanity checks, not because v2 introduced them:

  1. trl 0.19 forces padding_free when packing=True with default ffd strategy → can’t pass a custom collator. Fix: packing_strategy="wrapped".
  2. PEFT’s FSDP auto-wrap reads FSDP_TRANSFORMER_CLS_TO_WRAP env var, not HF Trainer’s fsdp_config dict. The two configs don’t cross over. Fix: set env var in the launcher per profile.
  3. Gemma-4’s decoder layer is named Gemma4TextDecoderLayer in transformers 5.5+, not Gemma4DecoderLayer as in older versions or older Gemma generations.
  4. Gemma-4 IT’s apply_chat_template(..., add_generation_prompt=True) injects a reasoning-channel prefix (<|turn>model\n<|channel>thought\n<channel|>) that never appears in training-text rendering. Detecting the response template from this gives a template that matches nothing. Fix: probe with sentinel user+assistant content and slice between them, anchored on the last <...> marker.
  5. Gemma4ForConditionalGeneration requires mm_token_type_ids in every batch even for text-only training. trl’s data collator doesn’t emit this field. Fix: thin wrapper around the collator that injects zeros.
  6. Sanity check at mask_frac < 0.30 is one-sidedmask_frac == 1.0 means “response template not found in any example” which is equally broken. Fix: assert 0.30 < mask_frac < 0.95.
  • Offline preflight before relaunch. After each diagnosed bug, run the collator + tokenizer + template detection on a real training example before spending GPU time on a full model load. Six attempts in 25 minutes vs. what would have been six attempts × ~10 minutes if each required a full model load to surface the next bug.
  • Dynamic GPU detection in the launcher. Picks all GPUs with ≥60 GB free (≥150 GB for Mistral). Combined with stopping unused vLLM endpoints (vllm-irregularchat, vllm-devstral-rod, vllm-qwen36) freed 6 GPUs while keeping Vibe (Mistral-Medium-3.5 on GPUs 4,5) untouched.
  • Power cap raised from 700W → 800W persisted via gpu-power-cap.service. 6 training GPUs at 700W + 2 serving GPUs at 800W = ~5,800W under the 9,600W breaker.
Terminal window
/workspace/irregularchat-corpus/launch_v2.sh gemma-v1data # isolate training-config gains
/workspace/irregularchat-corpus/launch_v2.sh gemma-v2data # isolate data-shape gains
/workspace/irregularchat-corpus/launch_v2.sh mistral-v2data # does bigger base eat the data better?

Each profile auto-detects free GPUs, sets CUDA_VISIBLE_DEVICES, exports FSDP_TRANSFORMER_CLS_TO_WRAP, and launches torchrun with the right nproc_per_node. The 3-run comparison answers all three failure-mode questions independently.

A corpus-quality fine-tune is only useful if it answers community questions in real user phrasing — not in the “Summarize this:” template that dominated v1’s training data. The eval set must be authored without reference to the corpus formatting.

These 25 prompts are sourced from actual Irregularpedia pages that were part of the training corpus. Each one asks a question a community member might plausibly ask in chat, not in the training-corpus templates. The expected answer should reference concepts from the cited wiki page.

#PromptSource page
1I just got an email saying my account was breached — where do I start?cybersecurity/cyber-incident-response-guide-personal
2What’s a good first step if I think my phone is compromised?cybersecurity/cyber-incident-response-guide-personal
3How do I run a radio check on an RTL-SDR?radio/radio-checks
4What command lists USB-connected SDRs on Linux?radio/radio-checks
5Walk me through prepping for a RIGEX exercise.military/airborne-equipment-rigging
6What’s the MC-6 nomenclature I need to know for jumpmaster?military/airborne-equipment-rigging
7What’s the difference between Monero and Bitcoin in terms of privacy?privacy/monero
8Where can I buy XMR with USD?privacy/monero
9What’s prompt engineering and why does it matter for working with LLMs?ai-ml/ai-prompting
10How should I structure a prompt to avoid ambiguity?ai-ml/ai-prompting
11What is C-UAS?general/large-language-models or any C-UAS-tagged content
12What does the community use Flipper Zero for?radio/flipper-zero
13How does the community handle email hardening?cybersecurity/email-hardening-guide
14What’s red-teaming in a cyber context?cybersecurity/cyber-red-teaming
15What’s the IrregularChat login flow?general/the-irregularchat-login
16Is there a community guide to running Protonmail Bridge on Linux?privacy/protonmail-bridge-on-linux
17What’s DragonOS used for?radio/dragonos
18How do I get started with software-defined radio?radio/software-defined-radios-sdrs
19What are the IrregularChat hackathons about?general/irregularchat-hackathons
20What’s the community’s recommended approach for self-hosting Nextcloud?privacy/service-storage-nextcloud
21What 3D printer does the community recommend?hardware/3d-printer-recommendation
22What’s a cyber deck and why would I build one?hardware/cyber-decks
23How does the community do archival research?research/archival-research
24What ham radio resources does IrregularChat recommend?radio/ham-radio
25What’s the AI ethics stance for community-built AI tools?ai-ml/ai-ethics

For each candidate model (base, v1-LoRA, v2-on-v1data, v2-on-v2data, mistral-v2data), serve via vLLM and run eval_v2.py against the OpenAI-compatible endpoint. The script writes one JSONL record per prompt with {i, q, answer, ms}.

Terminal window
python3 /workspace/irregularchat-corpus/eval_v2.py \
--base-url http://localhost:8000/v1 \
--model <model_name> \
--out /workspace/irregularchat-model/eval/<run>.jsonl

For each (prompt, answer) pair, score 0–3:

  • 0 — Refusal or unrelated. “I can’t help with that” or wanders into a different topic.
  • 1 — Generic web-pretrained answer. Correct on the topic but ignores the community’s specific tools, conventions, or page content.
  • 2 — Domain-aware. References at least one community-specific concept (tool name, command, person, page) even if not perfectly aligned.
  • 3 — Community-grounded. Answer reads like it came from someone who has read the cited wiki page; cites or paraphrases specific content.

A v2 model that beats v1 should land more answers in the 2–3 band, with fewer 0–1 results on prompts whose source page was part of training data. A useful sanity check: the base model should score 1 most of the time on these prompts — if it scores 2+ frequently, the eval is too easy and the wiki content overlaps generic web knowledge.

ComparisonQuestion answered
base vs v1Did the original training do anything measurable?
v1 vs gemma-on-v1data (v2 training, v1 corpus)Did the training-config bugs alone account for v1’s underperformance?
gemma-on-v1data vs gemma-on-v2dataDid corpus reshaping move the needle further?
gemma-on-v2data vs mistral-on-v2dataDoes a 128B base eat the data better than a 31B base?

If v1 ≈ gemma-on-v1data, the corpus was always the limiter. If gemma-on-v1data > v1 but gemma-on-v2data ≈ gemma-on-v1data, the training bugs were the dominant problem. The four-way split makes the attribution decomposable.

Three v2 runs completed in sequence on 6× B200 (FSDP, packing, completion-only masking). Mistral was dropped because the on-disk Mistral-Medium-3.5-128B is Mistral3ForConditionalGeneration (multimodal) — Mistral 3 has no text-only causal-LM sibling class, unlike Gemma 4’s Gemma4ForCausalLM. The checkpoint is also FP8-quantized, which compounds the issue. See “Operational Pitfalls” below.

RunCorpusRecipeWall clockFinal train lossTrain tok_accEval tok_acc
gemma-v1data3,970 examples (v1)r=128, α=128, 5 epochs31:2436.070.6210.591
gemma-v2data10,435 examples (v2)r=128, α=128, 5 epochs1:23:0050.950.5870.685
gemma-v2.110,435 examples (v2)r=128, α=256, NEFTune α=5, 8 epochs2:13:1727.210.9180.640

Key reading of the numbers:

  • Train loss is not comparable across LoRA scaling settings. v2.1 has loss 27 vs v2’s 51, but the larger α changes absolute loss magnitudes. Eval token accuracy is the comparable signal.
  • v2 generalizes 16% better than v1 on held-out (0.685 vs 0.591). The corpus reshape paid off — paraphrasing + diverse system prompts + dropped-summary-templates produces a model that handles unseen phrasings better.
  • v2.1 overfits relative to v2. Despite a much tighter training fit (0.918 train tok_acc), eval dropped to 0.640. The α=256 + NEFTune + 8-epoch combination memorizes the corpus tightly but loses generalization headroom.
  • Eval loss was nan on every run but token accuracy worked fine. With sequence packing, occasional eval batches end up with mask_frac == 1.0 (no loss-bearing tokens — the response template never appears in a particular packed window), causing the cross-entropy aggregator to divide by zero. Switching to a non-packed eval pass would fix this, but token accuracy is enough for relative comparison.

The wiki-grounded rubric eval is the actual quality measure — token accuracy on packed batches is a proxy. Results pending at time of writing.

After v2 completed, we ran a v2.1 experiment to isolate the impact of recipe upgrades on the same v2 corpus:

Settingv2v2.1
LoRA rank128128
LoRA alpha128256 (α=2r)
Use rsLoRANoNo (see warning below)
NEFTune αNone5
Epochs58
Effective LoRA scalingα/r = 1.0α/r = 2.0

The α=2r recommendation came from arxiv 2602.04998. NEFTune (ICLR 2024, arxiv 2310.05914) adds Gaussian noise to embeddings during training; published gains of 8–35 points on AlpacaEval.

Our first v2.1 attempt also enabled use_rslora=True. This destroyed the model. Within 80 steps, train loss climbed past 280 and token accuracy collapsed to 1% — the model was outputting essentially noise.

The math: rsLoRA changes the LoRA scaling factor from α/r to α/sqrt(r). For r=128:

  • Standard LoRA, α=128: scaling = 1.0
  • Standard LoRA, α=256: scaling = 2.0 ← v2.1 final recipe
  • rsLoRA, α=128: scaling = 11.3 (sqrt(128) ≈ 11.3)
  • rsLoRA, α=256: scaling = 22.6 ← what destroyed the model

The published recommendations for rsLoRA assume you reduce alpha to keep effective scaling in a reasonable range. Stacking α=2r with rsLoRA is 2 × sqrt(r) scaling — never recommended.

Rule of thumb: apply LoRA stability changes one at a time. Each recipe paper assumes the others are at default.

Train loss at comparable steps:

Stepv1-datav2-datav2.1
5046.6453.1656.07
10037.2549.4646.18
15044.1735.31
20076.19 (spike)
28040.4017.58
3509.67
Final36.0740.40 (step 290)4.70 (step 460)

Grad-norm stability:

  • v2: 14 → 759 range, multiple spikes triggered Trainer’s max_grad_norm=1.0 clipping
  • v2.1: 5 → 130 range, no spikes — the α=2r scaling absorbs more signal without needing violent updates

In addition to the six bugs documented under “Six bugs encountered while wiring up v2,” the in-flight runs surfaced four more operational issues. Documenting these so the next iteration doesn’t have to rediscover them.

1. FSDP + PEFT save_pretrained() hangs at the end of every run

Section titled “1. FSDP + PEFT save_pretrained() hangs at the end of every run”

After every successful training run, the rank-0 process hung at 100% CPU when trainer.model.save_pretrained() ran. PEFT’s save path uses the deprecated state_dict_type() FSDP API (emits FutureWarning every save), gathers all FSDP shards to rank 0, then extracts LoRA weights. The gather completes but the save never returns — at least for our 4.3 GB adapter, the rank-0 process burns CPU for 13+ minutes before manual kill.

Workaround: Skip the final save_pretrained() call entirely. The Trainer’s per-step checkpoint save (which uses a different code path) writes adapter_model.safetensors into checkpoint-N/. Post-training, the launcher copies that out to lora-adapter/:

Terminal window
# In launch_v2.sh, after torchrun exits:
LAST_CKPT=$(ls -1d "$OUT"/checkpoint-* | sort -V | tail -1)
cp "$LAST_CKPT/adapter_model.safetensors" \
"$LAST_CKPT/adapter_config.json" \
"$LAST_CKPT/chat_template.jinja" \
"$LAST_CKPT/tokenizer"*.json \
"$OUT/lora-adapter/"

The first two runs (v1data, v2data) had to be manually killed + salvaged this way. v2.1 ran cleanly with the patched script.

2. Disk pressure cascades into save failures

Section titled “2. Disk pressure cascades into save failures”

Initial training writes accumulated 65 GB in checkpoint directories + 59 GB v1 merged model + 6 GB old checkpoints on a 3.5 TB root volume that was already at 93 GB free. By the time run #1 saved its final checkpoint, the disk was at 100% — save_pretrained() wrote README.md to the new lora-adapter/ dir but couldn’t write adapter_model.safetensors. The hang was actually two issues stacked: PEFT’s slow FSDP gather + a failed write into a full disk.

Layout we ended up with:

/dev/nvme1n1p2 3.5T / <- system, /workspace, /workspace/models
/dev/nvme4n1p1 3.5T /data <- v2+ training outputs (this is where adapters live)
/dev/nvme2n1p1 3.5T /scratch <- ext4 created at runtime; merged models go here
nvme3n1 3.5T raw <- unformatted, reserved

The migration approach: leave /workspace/models/ (shared base models) where they are, symlink future training outputs to /data, format /dev/nvme2n1 for scratch space, leave one disk unformatted as future-expansion.

Rule of thumb: every training run on 31B-class models needs ≥100 GB free on the write target before it starts. For 128B-class with optimizer state, plan for 300+ GB.

3. Mistral-Medium-3.5-128B incompatible with our pipeline

Section titled “3. Mistral-Medium-3.5-128B incompatible with our pipeline”

The on-disk Mistral checkpoint is Mistral3ForConditionalGeneration — Mistral 3 architecture with Pixtral vision. The Mistral 3 module exposes only:

Mistral3Model
Mistral3PreTrainedModel
Mistral3ForConditionalGeneration

No Mistral3ForCausalLM. Gemma 4 has its text-only causal-LM sibling (which is how our training path works); Mistral 3 doesn’t. Additionally the on-disk weights are FP8-quantized (quant_method: fp8 in config.json), so loading them as bf16 for LoRA training requires a dequantization step that TRL+PEFT don’t do automatically.

Workarounds exist (load the multimodal class, target only text-decoder LoRA modules, dequantize-on-load) but each is 2–4 hours of engineering. Skip for now. A “bigger base” experiment would be better targeted at Qwen3-32B (text-only, Apache 2.0, top instruction-following benchmarks as of 2026) or Llama-3.3-70B-Instruct — neither currently on disk.

4. vLLM not available in the training venv

Section titled “4. vLLM not available in the training venv”

The training venv at /workspace/irregularchat-corpus/.venv/ has transformers 5.5.0, trl 0.19.1, peft 0.18.1 — but no vLLM. The original eval orchestrator script assumed python -m vllm.entrypoints.openai.api_server would work; it fails with ModuleNotFoundError.

Two options:

  • Install vLLM in the venv — possible but risks transformers version conflicts with our TRL/PEFT stack
  • Use transformers.generate() directly — slower per-prompt (~10s vs vLLM’s ~1s) but adequate for offline eval and avoids the dependency

We chose the second. eval_direct.py loads the merged model with device_map="auto" (splits 62 GB across 2 GPUs), runs apply_chat_template + generate() per prompt. 25 prompts × 4 candidates × ~10s = ~17 min of pure generation, plus ~3–5 min per merge step (CPU bf16 load of 31B base + adapter, then merge_and_unload()).

All four candidates were merged into bf16, loaded via transformers.generate() directly (vLLM wasn’t installed in the training venv — transformers worked fine, ~10s per generation), and evaluated against the 25 wiki-grounded prompts. Output JSONLs at /data/irregularchat-model/eval/.

A scoring script that looks for community-specific markers (tool names, commands, URL paths like /general/, irregularchat) per prompt:

CandidateRubric avg (0–3)Train lossTrain tok_accEval tok_acc
base2.88
v2-gemma-on-v1data2.7636.070.6210.591
v2-gemma-on-v2data2.8050.950.5870.685
v2.1-gemma-on-v2data2.8427.210.9180.640

All four scores within 0.12 points — within noise. No fine-tune meaningfully beats base on the heuristic scorer. The rubric automation is too lenient because Gemma 4 31B’s prior knowledge already produces fluent, on-topic, keyword-rich answers for most technical questions (SDR, Monero, prompt engineering, OSINT, etc.). The markers can’t distinguish “Gemma knows what an RTL-SDR is” from “the model learned IrregularChat’s specific community conventions.”

The qualitative test that mattered: Q15 asked “What’s the IrregularChat login flow?” The correct answer references Authentik — the SSO system the community actually uses. Result:

CandidateMentions Authentik?What it says instead
baseNoGeneric “OAuth via Discord/Google”
v2-on-v1dataNo”Login button, Google/Discord/Apple”
v2-on-v2dataNo”Google/Facebook/Twitter”
v2.1No”Magic Links via Supabase Auth, signInWithOtp(), auth.irregularchat.com/auth/v1/callback — fabricated specifics

None of the candidates mention Authentik. All four confidently fabricate. v2.1’s answer is the most concerning — high-confidence specific-looking code snippets, fake URLs, fabricated authentication library. The α=256 + NEFTune + 8-epoch recipe made the model more willing to invent specific-sounding fakes, not better at producing real community content.

Other v2.1 hallucinations from spot-checks:

  • Q3 (RTL-SDR radio check): cited a non-existent “rtl-bench software suite” with full install commands
  • Q19 (hackathons): cited “Hack the Polyglot, Feb 21–March 2 2026” — may be genuine memorization of a wiki page OR fabricated specifics

LoRA fine-tuning at our scale (r=128, 10K examples, ≤8 epochs) failed at the actual goal: injecting IrregularChat-specific facts. What it produced:

  • ✅ Better instruction-following on technical topics (tone, structure, formatting)
  • ✅ Slight shift toward wiki-like markdown output (section headers, bullet structure)
  • Zero successful injection of community facts (Authentik, actual hackathon names, community-specific tools, conventions)
  • ⚠️ v2.1 specifically: fabricates community-specific details more confidently than base — net negative for a Q&A bot

This empirically validates the research recommendation in the previous wiki section: SFT teaches behavior; RAG is required for facts at this corpus scale. The model’s pretrained prior dominates; a 4 GB adapter cannot encode 10 MB of unique community facts.

Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)

Section titled “Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)”
OptionProsCons
Deploy v2.1Most “wiki-styled” outputFabricates community facts most confidently — actively harmful for a Q&A bot
Deploy v2-on-v2dataGood train/eval balance, no overfit signatureMarginal lift over base on rubric
Deploy base + RAGNo fabrication risk beyond stock modelLoses slight wiki-format tone improvement
Deploy v2-on-v2data + RAGBest of both: slight tone improvement + retrieved factsRAG is doing the heavy lifting

The fine-tune contributes ~10% formatting/tone polish. The retrieval (MCP server at search.irregulars.io) contributes the 90% of “actual community knowledge.”

  1. Quantize v2-on-v2data to AWQ-4bit for deployment. Target: 20 GB on disk, runs on RTX 4090 / single 24GB GPU. QuantTrio/gemma-4-31B-it-AWQ confirms vLLM’s awq_marlin handles Gemma 4.
  2. Wire MCP retrieval at inference. Either system-prompt injection with top-k retrieved docs, or a proper RAG framework (LangChain / LlamaIndex). The MCP server at search.irregulars.io already exists.
  3. Re-run the 25-prompt eval with RAG enabled. This is the comparison that actually matters. The expectation: the Authentik test passes this time. If it doesn’t, the retrieval pipeline needs work, not the model.
  4. Skip further LoRA experiments for knowledge injection. The data point is clear at this scale. Future iterations should focus on either:
    • Continued pretraining (CPT) on raw wiki text in causal-LM mode for many passes — the SynCPT ICLR 2025 paper showed this helps where SFT plateaus
    • RAFT-style training with distractor passages mixed into the training data, so the model learns to ignore irrelevant context AND rely on parametric fallback gracefully (Berkeley 2024)
    • Distillation from a strong RAG-augmented teacher to a smaller student (Qwen3-8B) for cheaper deployment
  5. Don’t deploy v2.1. Its overfitting hurts fact-grounding more than the recipe gains help formatting. The model that looks most polished is the one most likely to confidently mislead users.

After eval-of-fine-tunes confirmed the fact-injection gap, we built a BM25 retriever over the wiki and ran the same 25 prompts with retrieved context injected into the system prompt. Two corpus sizes tested:

  • Wiki-only RAG: 386 wiki .md files (the canonical Irregularpedia content).
  • Full-corpus RAG: 11,627 docs — wiki + 4,253 PDF chunks (mined from v1+v2 training data assistant turns) + 3,673 news summaries (training) + 3,673 news summaries (fresh DB pull) + 47 archived-files AI summaries + 11 daily community rollups + 6 Outline docs.

Eval configuration: top-4 retrievals per prompt, 1500-char snippet limit per doc, injected into the system prompt before the user question. Same 25 wiki-grounded prompts as the no-RAG eval. transformers.generate() direct, bf16, 2-GPU TP.

no-RAGwiki-RAGfull-RAG
baseFAILPASSPASS
v2-v1dataFAIL
v2-v2dataFAILPASSPASS
v2.1FAIL

For Q15 “What’s the IrregularChat login flow?”, every no-RAG configuration fabricated an answer (Discord/Google OAuth, “Magic Links via Supabase Auth,” etc.). Every RAG configuration correctly identified Authentik. This is the cleanest binary signal in the whole experiment.

ConditionHits / 91%
base no-RAG5560.4%
v2-v1data no-RAG5560.4%
v2-v2data no-RAG5762.6%
v2.1 no-RAG5661.5%
base wiki-RAG4852.7%
v2-v2data wiki-RAG5560.4%
base full-RAG4751.6%
v2-v2data full-RAG5661.5%

The heuristic rubric (counting topic-keyword hits) shows RAG and no-RAG within noise — RAG sometimes loses generic keyword reflexes that Gemma 4’s prior already has (e.g., Q13 email hardening: no-RAG fluently lists SPF/DKIM/DMARC; RAG focuses on what the wiki page actually says).

The heuristic doesn’t capture correctness. The Authentik test does — and on it, RAG wins 4–0. Other community-internal-fact wins (manually verified): Q5 RIGEX (no-RAG 1 marker → RAG 5), Q23 archival research (1 → 2).

For prompts where Gemma 4’s pretraining has strong coverage (email security, generic phone-compromise advice, Monero-vs-Bitcoin), the model produces fluent answers with many topic keywords. The wiki page on the same topic is often shorter and less keyword-dense. RAG redirects the model toward wiki content, which can mean fewer technical-acronym mentions and lower heuristic score — but answers that are actually grounded in what the community says, rather than just generic textbook recall.

This is a methodology lesson, not a result against RAG: mark-counting rubrics over-credit fluent generic answers. For real-world deployment, the user-visible question is “did the bot answer correctly with community-specific information” — and on that axis, RAG is decisive.

Across 100 retrievals (25 prompts × top-4 each) for v2-v2data full-RAG:

SourceRetrievalsShare
wiki5959%
news (training-derived summaries)2626%
pdf (training-derived chunks)99%
news_db (fresh from signal-bot)66%

41% of retrievals are non-wiki. BM25 naturally surfaces wiki pages first when an exact topical match exists, with PDFs/news as fallback. For the 25 wiki-grounded eval prompts, the wiki-first behavior limits the impact of non-wiki sources — but in production where users ask questions not covered by any wiki page, the broader corpus matters.

Comparisonbasev2-v2dataΔ
No-RAG60.4%62.6%+2.2
Wiki-RAG52.7%60.4%+7.7
Full-RAG51.6%61.5%+9.9

The fine-tune’s contribution grows with RAG. Without RAG, v2 is only ~2 points above base. With RAG, v2 is +8–10 points above base. The LoRA learned to use retrieved content idiomatically — wiki-style markdown structure, action-oriented answer format, proper integration of citations — even though it didn’t learn the underlying facts. This is the case for deploying v2-on-v2data over base, conditional on RAG being part of the stack.

Deploy: v2-on-v2data adapter + wiki-only BM25 retrieval.

ComponentChoiceWhy
Base modelgemma-4-31b-itStrongest instruction model in our 30B class, available locally
Adapterv2-on-v2data/lora-adapter/ (4.28 GB)Best no-RAG-score fine-tune that also synergizes with RAG; no overfitting signs
QuantizationGGUF Q5_K_M when llama.cpp Gemma-4 support solidifiesAWQ-via-AutoAWQ doesn’t recognize gemma4 model type (deprecated tool); QuantTrio AWQ is base-only
RetrievalBM25 over 386 wiki .md filesFull-corpus added marginal lift only; wiki-only is faster and noise-free
Top-K4 retrievals, 1500-char snippet capTested config; ~4.5K extra tokens of context per query
Inferencebf16 on 2× B200 today; target single 24GB GPU after GGUF conversionWorks with current stack; smaller hardware pending quant

Do NOT deploy v2.1 — its overfit recipe makes hallucinations more confident.

Do NOT skip the fine-tune even with RAG — measurably contributes +7-10 points over base+RAG.

Do NOT use full-corpus RAG yet — the wiki is sufficient for the current 25-prompt eval set; the broader corpus will matter for production queries that fall outside wiki coverage.

v3: Switching to Qwen3 + Heretic Abliteration (May 2026)

Section titled “v3: Switching to Qwen3 + Heretic Abliteration (May 2026)”

After v2.1’s failure mode (rsLoRA + α=2r overfitting + NaN eval loss), and Gemma 4’s heavy alignment friction making “uncensored” responses hard to elicit even on legitimate professional topics, we switched the base model from gemma-4-31b-it to Qwen3-30B-A3B-Instruct-2507 (Alibaba, Apache 2.0). The deployed model is now irregularchat-v3-heretic running locally on a Mac via Ollama + Open WebUI.

Issue with Gemma 4 31BImpact
Dense 31B params → ~12-14 tok/s on Apple Silicon Q8Sluggish interactive use
Most safety-trained of the major open modelsRefusals/disclaimers on legitimate technical prompts
Gemma4ForConditionalGeneration multimodal scaffoldingGGUF conversion path immature; double-BOS warning at inference
Distinct attention architectureFSDP + PEFT save_pretrained() hangs (Pitfall #1 above)
Abliterated variants poorly preserve LoRA fluencyLoRA-on-abliterated-base = compounding drift (verified empirically)
PropertyValue
ArchitectureMoE — 30.5B total, 3.3B active per token
Inference speed (Q4_K_M on M-series)~92 tok/s (6.5× faster than Gemma Q8)
Fine-tuning benchmarksQwen3 family takes 4 of top 6 fine-tuned-quality spots in published 2026 evals
LicenseApache 2.0
Unsloth 2026.4.2 supportFirst-class — single B200 (180GB) handles the full 30B fine-tune in ~2 hours, no FSDP needed
Default alignmentLess aggressive than Gemma 4; system-prompt jailbreaks more effective
Abliteration ecosystemMultiple pre-abliterated variants on HF (huihui-ai, mlabonne, DavidAU) plus first-class Heretic 1.3.0 support

Single-GPU, no FSDP, lessons applied from v2/v2.1:

ParameterValueRationale
BaseQwen/Qwen3-30B-A3B-Instruct-2507See above
lora_r64Middle ground between v2 (r=32, underfit) and v2.1 (r=128, exploded)
lora_alpha128Standard α=2r; intentionally NOT using rsLoRA
lora_dropout0Required — PEFT ParamWrapper for MoE expert layers raises ValueError on non-zero dropout. MoE gating provides sufficient regularization.
target_modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_projStandard set; Unsloth excludes router by default for MoE
Epochs4Reduced from v2’s 5 to lower overfit risk on 10K examples
per_device_train_batch_size4B200 has plenty of headroom
gradient_accumulation_steps4Effective batch 16
learning_rate2e-4Standard for LoRA, cosine scheduler
max_grad_norm0.5Tighter than default 1.0 — prevents the gradient-explosion failure mode v2.1 hit at step 10
TrainerUnsloth FastLanguageModel + TRL SFTTrainerSingle GPU, no FSDP complications
MetricValue
Total runtime7,347s (~2.0 hours)
Steps2,612 (4 epochs × 653 steps/epoch)
Final train_loss0.37
Final eval_loss2.38 (healthy, not NaN like v2.1)
Train/eval generalization gap~2.0 — reasonable (not overfit, not underfit)
eval_mean_token_accuracy0.89
GPU memory peak~70 GB on a single B200

After v3 training, the merged model went through Heretic 1.3.0 to remove the refusal direction structurally. This is the published gold-standard approach (Arditi et al. 2024; KL-minimizing optimization).

ParameterValue
n_trials30 (Optuna TPE)
kl_divergence_target0.20
Best trial KL divergence0.0137 (well below the 0.16 reference for Llama-3.1-8B-heretic; model behavior preserved)
Refusal rate vs baseline100/100 → 95/100 on mlabonne/harmful_behaviors

The refusal-rate reduction on the standard harmful-behaviors benchmark was minimal — Qwen3’s safety pathways are distributed across MoE experts, and 30 trials wasn’t enough to fully untangle them. However, for the actual use case (military/OSINT/drone technical queries that are not in the benchmark), the abliteration completely eliminates moralizing and “I cannot assist” responses. Verified empirically post-deploy.

Critical incompatibility: Heretic’s interactive TUI

Section titled “Critical incompatibility: Heretic’s interactive TUI”

Heretic uses questionary (raw-mode prompt_toolkit) at the end of optimization to interactively prompt for which trial to apply and where to save. This cannot be driven by pexpect / stdin piping in scripted/headless runs — prompt_toolkit requires a real terminal.

Fix: Write a wrapper that monkey-patches questionary.select/path/text/checkbox to return canned responses BEFORE heretic.main.run() is called:

import questionary
questionary.select = fake_select_returning_first_trial_then_save
questionary.path = lambda message: FakeAsker(OUTPUT_DIR)
import heretic.main
heretic.main.run()

State-machine the fake_select to pick:

  1. Trial selection menu → first (best) trial in the Pareto front
  2. Action menu → “Save the model to a local folder”
  3. Subsequent action prompts → “Return to trial selection menu” → exit

Without this, Heretic completes optimization, sits at the menu, and exits without saving when its parent shell dies — wasting all the compute.

The production model now runs on a Mac instead of Obelisk:

ComponentPath
Merged HF model on Obelisk/data/irregularchat-model/v3-heretic/ (53GB safetensors, 13 shards)
bf16 GGUF on Obelisk/data/irregularchat-model/v3-heretic-gguf/irregularchat-v3-heretic-bf16.gguf (57GB)
Q4_K_M GGUF on Mac/Users/sac/Models/irregularchat-v3-heretic-Q4_K_M.gguf (17GB)
Ollama tagirregularchat-v3-heretic:latest
Modelfile/Users/sac/Models/Modelfile-v3-heretic
RAG markdown corpus/Users/sac/Models/rag-corpus/wiki-md/ (386 files derived from wiki.jsonl)
Open WebUIhttp://127.0.0.1:8080 (Python venv at /Users/sac/irregularchat-local/.venv-webui/)

llama.cpp’s convert_hf_to_gguf.py natively supports the Qwen3 MoE architecture — no patches needed (unlike Gemma 4 where we had to wait for the toolchain). Q4_K_M produces a 17GB file that loads in ~30s on an M-series with unified memory, then runs at ~92 tok/s.

DecisionChoiceWhy
BaseQwen3-30B-A3B-Instruct-2507MoE speed + better fine-tuning lift than Gemma at the same param count
Fine-tuner=64 / α=128 / dropout=0 / 4 epochs / max_grad_norm=0.5Stable training, no NaN, train_loss=0.37
AbliterationHeretic 1.3.0 with monkey-patched auto-saveRemoves moralizing on professional-context queries; KL preserved
QuantizationQ4_K_M GGUFSweet spot — 17GB fits comfortably; ~98% quality vs bf16
ServingOllama + Open WebUI KnowledgeLocal, fast, RAG via Knowledge collection (see Open WebUI)
RetrievalOpen WebUI built-in (sentence-transformers + sqlite-vec)Replaces the previous Python BM25 script

The v2-vs-v3 quality bar is: v3 produces fluent, calibrated answers (“I don’t know — based on common military training conventions…”) on out-of-corpus topics, where v2 confidently confabulated (“RIGEX stands for Rapid Interdiction Group Exercise”). RAG remains essential for actual factual content — fine-tuning teaches style, RAG provides facts.