Training the IrregularChat Model

A complete walkthrough of how we built a domain-specific community assistant by fine-tuning an open-source LLM on IrregularChat data — from data collection to model serving.

Overview

We fine-tuned Google’s Gemma-4-31B-Instruct using LoRA (Low-Rank Adaptation) on 4,178 instruction-tuning examples extracted from the IrregularChat community’s wiki, Q&A, news summaries, PDF library, breakout room summaries, TLDR summaries, and TIL entries.

Detail	Value
Base model	gemma-4-31b-it (31B params, multimodal)
Method	LoRA fine-tune (267M trainable / 31.5B total = 0.85%)
Training data	4,178 instruction pairs (10 MB JSONL)
Hardware	2x NVIDIA B200 (192 GB HBM3e each, uncapped power)
Training time	1 hour 34 minutes (747 steps)
Final eval loss	2.454
Tool	Unsloth for loading, HuggingFace PEFT + Trainer for DDP

Data Sources

What We Collected

Training data was drawn from community resources including public wikis, curated Q&A, AI-generated summaries, PDF libraries, and internal documentation. Data sources:

Source	Records	Raw Size	Content
Tagged community content	Questions, answers, TILs, events, notes	—	Structured entries tagged by type via Signal bot
Shared links	Publicly available URLs shared in channels	—	Links to articles, tools, and resources
News link summaries	2,556	2.9 MB	AI-generated article summaries
Wiki (Irregularpedia)	399 pages	1.9 MB	Curated knowledge base
PDF library	145 unique PDFs	31.3 MB extracted	Military manuals, drone reports, OSINT guides, cybersecurity docs
Outline docs	269 docs	1.5 MB	Internal team documentation (member-accessible wiki)
Q&A	174 questions, 63 answers	82 KB	Community Q&A with voting
Breakout room summaries	—	—	AI-generated summaries derived from group discussion sessions
TLDR summaries	—	—	AI-generated summaries of shared content
TIL entries	34	25 KB	”Today I Learned” snippets
Archived wiki	1,610 pages	10.6 MB	Historical wiki content

Data Pipeline

Step 1: Export from Signal Bot Database

The Signal bot stores community data across several PostgreSQL tables. Each structured content type has its own table:

# Run on the server hosting signal-bot-postgres
# Export structured Q&A
docker exec signal-bot-postgres psql -U signal_bot -d signal_bot \
  -c "COPY (
    SELECT question_text, category, to_timestamp(created_at/1000)::text as ts
    FROM q_and_a_questions WHERE question_text IS NOT NULL
  ) TO STDOUT WITH (FORMAT csv, HEADER true)" > qa_questions.csv

# Export news link summaries
docker exec signal-bot-postgres psql -U signal_bot -d signal_bot \
  -c "COPY (
    SELECT title, summary, url
    FROM news_links WHERE summary IS NOT NULL
  ) TO STDOUT WITH (FORMAT csv, HEADER true)" > news_links.csv

Same pattern for TIL entries, breakout room summaries, and other structured sources.

Key tables:

q_and_a_questions / q_and_a_answers — structured Q&A
news_links — shared articles with title, summary
today_i_learned — original_messages + ai_summary
breakout_rooms — executive_summary + detailed_summary (AI-generated from session messages)

Step 2: Export from Outline

Outline uses camelCase columns in PostgreSQL:

docker exec norequirement_postgres psql -h localhost -U outline_user -d outline_db \
  -c "COPY (
    SELECT title, text, \"urlId\"
    FROM documents
    WHERE \"deletedAt\" IS NULL AND length(text) > 50
  ) TO STDOUT WITH (FORMAT csv, HEADER true)" > outline.csv

Step 3: Extract Text from PDFs

The community file share (/datadrive/IrregularChat/) contains 163 PDFs across 16 topic categories. Many are duplicated across categories.

pip install pymupdf

python3 extract-pdfs.py \
  --input /path/to/pdfs \
  --output extracted/pdfs.jsonl \
  --min-chars 200  # Skip scanned/image PDFs

The extraction script:

Recursively finds all PDFs
Extracts text with pymupdf (fitz)
Deduplicates by content hash (SHA-256)
Outputs JSONL with category, filename, and text

Result: 145 unique PDFs extracted (40 too short/scanned, 173 duplicates removed), yielding 31.3 MB of text.

Step 4: Format as Instruction-Tuning Data

Convert all sources into chat-format JSONL:

{
  "messages": [
    {"role": "system", "content": "You are IrregularChat Assistant..."},
    {"role": "user", "content": "What is C-UAS?"},
    {"role": "assistant", "content": "Counter-Unmanned Aircraft Systems (C-UAS)..."}
  ]
}

Formatting strategies by source:

Source	User Prompt	Assistant Response
Wiki pages	”What is {title}?” / “Explain {title}“	Page content (max 4000 chars)
Q&A	Actual question text	Best/longest answer
News	”Summarize this: {title}“	AI summary
PDFs	”What does ‘{document}’ say?”	Document text (chunked at 3500 chars)
TIL	”What did the community learn?”	AI summary or original entry
Breakout summaries	”What was discussed in {session}?”	AI-generated session summary
TLDR summaries	”Summarize {content}“	AI-generated content summary
Tagged notes	”What did the community share about {topic}?”	Tagged content entries

PDFs get chunked into multiple training examples — a 50-page report becomes 10+ instruction pairs.

Final dataset: 3,970 train + 208 validation = 4,178 records (10 MB JSONL).

Model Selection

Why Gemma-4-31B

We evaluated models available on the server:

Model	Params	Type	Issue
Qwen2.5-VL-72B	72B	Vision+Language	OOM even in 4-bit with Unsloth; bitsandbytes incompatible with B200 (Blackwell)
Devstral-2-123B	123B	Code	Requires transformers 5.0+ (too new for most tools)
gpt-oss-120b	120B	MoE	Requires transformers 4.55+ (bleeding edge)
gemma-4-31b-it	31B	Multimodal	Fits on 1-2 GPUs, well-supported, good general quality

Gemma-4 Quirks

Architecture: Gemma4ForConditionalGeneration (multimodal wrapper around text decoder)
Requires mm_token_type_ids field in training data (all zeros for text-only)
Custom Gemma4ClippableLinear layers break some PEFT/heretic versions
AutoModelForCausalLM won’t load it in older transformers — need 5.5.0+
Unsloth handles it via FastLanguageModel but ignores CUDA_VISIBLE_DEVICES

Training

Tool Comparison

Tool	Speed	Multi-GPU	VRAM	Issue We Hit
Unsloth	3x faster	Yes (since Dec 2025)	70% less	Ignores CUDA_VISIBLE_DEVICES, can’t colocate with other processes
Vanilla PEFT	Baseline	Yes (DDP)	Baseline	Works but 2-4x slower than Unsloth
PEFT + DDP	2x (per GPU added)	Yes	Full copy per GPU	Needs `find_unused_parameters=True` for multimodal models

What Worked: 2-GPU DDP with Vanilla PEFT

After multiple OOM errors and device mapping issues with Unsloth, the winning config was vanilla HuggingFace PEFT with torchrun DDP on 2 GPUs:

# Key settings
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map={"": local_rank},  # Each GPU gets full copy
    attn_implementation="sdpa",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=32,
    lora_alpha=64,       # alpha/r = 2.0 (standard heuristic)
    lora_dropout=0,      # Unsloth default; consider 0.05 for small datasets
    target_modules="all-linear",
    bias="none",
)

# Must enable for multimodal models (vision encoder params unused in text training)
TrainingArguments(
    ddp_find_unused_parameters=True,
    gradient_checkpointing=True,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    # Effective batch = 2 * 2 GPUs * 4 accum = 16
)

Custom data collator required for Gemma-4:

@dataclass
class GemmaCollator:
    tokenizer: Any
    pad_to_multiple_of: int = 8

    def __call__(self, features):
        import torch
        max_len = max(len(f["input_ids"]) for f in features)
        max_len = ((max_len + self.pad_to_multiple_of - 1) //
                   self.pad_to_multiple_of) * self.pad_to_multiple_of

        pad_id = self.tokenizer.pad_token_id or 0
        batch = {k: [] for k in ["input_ids", "attention_mask", "labels", "mm_token_type_ids"]}

        for f in features:
            pad_len = max_len - len(f["input_ids"])
            batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len)
            batch["attention_mask"].append([1] * len(f["input_ids"]) + [0] * pad_len)
            batch["labels"].append(f["labels"] + [-100] * pad_len)
            batch["mm_token_type_ids"].append(f["mm_token_type_ids"] + [0] * pad_len)

        return {k: torch.tensor(v) for k, v in batch.items()}

Launch Command

CUDA_VISIBLE_DEVICES=5,6 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --nproc_per_node=2 train.py 2>&1 | tee training.log

Training Results

Metric	Value
Steps	747
Epochs	3
Final train loss	10.86 (avg)
Final eval loss	2.454
Last step loss	8.553
Training time	1h 34m
Avg step time	7.53 s/step
Trainable params	266,963,456 (0.85%)
GPU memory	~149 GB per GPU (of 192 GB available)
GPU utilization	98-100%
Power draw	535-679W per GPU (uncapped)

Loss progression:

Step 5: 77.0 (initial, high)
Step 20: 19.5 (learning rate warmup)
Step 50: ~12 (converging)
Step 300: ~9.5 (stable)
Step 745: 8.55 (final)

The high initial loss (77.0) and the train-eval loss gap (8.55 train vs 2.454 eval) indicate the custom GemmaCollator likely computes loss over the full sequence including prompt tokens rather than masking them to -100. For a 262K-token vocabulary, the theoretical cross-entropy ceiling for a random predictor is ~12.48 — an initial loss of 77.0 exceeds this, which is only possible if loss is computed over tokens the model has no reason to predict well (system/user prompt tokens). The eval loss of 2.454 is plausible for an instruct model on in-domain data.

Confirmed in v2 attempt: Inspection of train_ddp2.py showed labels = input_ids.copy() with no -100 masking — the loss was indeed computed over the full sequence. Additionally, the response-template literature suggested using <start_of_turn>model\n, but Gemma-4’s actual chat template emits <|turn>model\n and <turn|> — so even if a completion-only collator had been used, it would have masked nothing.

Post-Training: Merge LoRA

Merge the adapter back into the base model for serving:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31b-it",
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)
model = PeftModel.from_pretrained(model, "/path/to/lora-adapter")
model = model.merge_and_unload()
model.save_pretrained("/path/to/merged")

Abliteration (Refusal Removal)

What Is It

Abliteration removes the “refusal direction” from a model’s activation space — the internal vector that causes it to say “I can’t help with that.” For a community focused on cybersecurity, drones, OSINT, and military tech, stock instruct models refuse too many legitimate domain questions.

What We Tried

Method	Result
Manual mlabonne method (20 prompts, all layers)	Model output became gibberish (`l l-'-'- l'--`)
Heretic (automated Bayesian)	Can’t load Gemma-4 (PEFT version mismatch)
TrevorS/gemma-4-abliteration	Can’t load Gemma-4 (same PEFT issue)

Why Manual Abliteration Failed

Only 20 contrastive prompts — need 256-800 for a reliable direction estimate
Applied to ALL 60 layers indiscriminately — no per-layer calibration of refusal direction strength
No Winsorization — large LLMs (including Gemma) produce high-magnitude activation outliers (Sun et al. 2024) that corrupt mean calculations. This is the documented cause of gibberish output on Gemma models
Single global direction — Gemma benefits from per-layer refusal direction estimation
No norm preservation — raw projection removal distorts weight row norms

Correct Approach (For Future Reference)

Based on TrevorJS’s work (3.2% refusal rate, 0.124 KL divergence on gemma-4-31b) and grimjim’s norm-preserving biprojected abliteration:

256-800 contrastive prompts (use mlabonne/harmful_behaviors + mlabonne/harmless_alpaca)
Winsorize activations at 99.5th percentile before computing means
Per-layer refusal directions (compute per-layer, not one global direction)
Orthogonalize against harmless mean (biprojection)
Norm-preserving weight modification — Magnitude-Preserving Orthogonal Ablation (MPOA): decompose row norms, ablate direction only, recompose
Apply to o_proj and mlp.down_proj across layers (TrevorJS applied to all 60 layers; smaller models may benefit from targeting middle-to-late layers only)

Alternative: Pre-Abliterated Base

TrevorJS/gemma-4-31B-it-uncensored is a pre-abliterated version of the same base model. Our LoRA adapter can be applied on top of it instead of the stock model.

Serving

vLLM

# From venv (Docker image may be too old for Gemma-4)
pip install vllm

CUDA_VISIBLE_DEVICES=5,6 python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/merged \
    --host 127.0.0.1 \
    --port 8002 \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --trust-remote-code \
    --gpu-memory-utilization 0.80

Gotcha: Cloudflare WARP hijacks all routing tables. Traffic sourced from a public IP gets routed through the WARP tunnel instead of the physical interface. Disconnect WARP or configure split tunneling before testing.

MCP Server for RAG

We built an MCP server (apps/search-mcp/) that wraps the search service at search.irregulars.io, providing Claude Code and Claude Desktop with community knowledge retrieval:

{
  "mcpServers": {
    "irregularchat-search": {
      "command": "node",
      "args": ["/path/to/apps/search-mcp/dist/index.js"],
      "env": {
        "IRREGULARCHAT_SEARCH_TOKEN": "your-token"
      }
    }
  }
}

Lessons Learned

What Worked

LoRA fine-tuning on community data is highly effective for domain adaptation — the model learns the community’s voice, topics, and terminology
PDF extraction was the biggest data source by volume — military manuals and technical reports provide dense, high-quality training signal
Deduplication by content hash eliminated 173 duplicate PDFs filed across multiple topic categories
DDP on 2 GPUs gave 2.7x speedup over single GPU with minimal code changes
Packing reduces total steps dramatically (747 → 96) but increases VRAM usage per step

What Didn’t Work

Unsloth ignores CUDA_VISIBLE_DEVICES — can’t colocate with other GPU processes on a shared server
bitsandbytes 4-bit quantization crashed on B200 (Blackwell) GPUs with CUDA illegal memory access (as of early 2025; later releases added sm_100 support)
USB-C ethernet adapters use cdc_ncm driver that shows link UP but doesn’t pass traffic
Assigning same IP to multiple interfaces causes ARP confusion — router sends packets to random MACs
Cloudflare WARP hijacks all routing via policy table 65743, breaking direct ISP connections
Naive abliteration (few prompts, all layers, no Winsorization) destroys model output on Gemma

Key Numbers

Metric	Value
Total raw data collected	~730 MB files + ~33 MB structured text
Usable extracted text	~80 MB after dedup and filtering
Training examples	4,178
Training cost	$0 (own hardware)
Training time	1h 34m on 2x B200
LoRA adapter size	1 GB
Merged model size	59 GB
Base model size	59 GB

File Locations on Obelisk

/workspace/irregularchat-corpus/
    training/
        train.jsonl          # 3,970 training examples
        val.jsonl            # 208 validation examples
    db/                      # Signal bot DB exports
    outline/                 # Outline wiki exports
    extracted/
        pdfs.jsonl           # Extracted PDF text
    wiki/                    # Irregularpedia markdown
    wiki-archived/           # Old wiki content

/workspace/irregularchat-model/
    lora-adapter/            # LoRA weights (1 GB)
    merged/                  # Full merged model (59 GB)
    training.log             # Training output
    checkpoint-400/          # Mid-training checkpoint
    checkpoint-747/          # Final checkpoint

v2 Training Attempt (May 2026)

When testing the v1 model in production, members noticed that domain questions weren’t being answered any better than by the stock base model. Investigation revealed the problem wasn’t training time or hardware — it was that v1 didn’t actually train on the loss it claimed to.

Why v1 didn’t move the needle

A close read of the v1 training script and corpus surfaced five compounding issues:

Prompt tokens were never masked. labels = input_ids.copy() meant the loss was computed over the system prompt and user question, not just the assistant response. The model spent most of its gradient signal memorizing the (always-identical) system prompt instead of learning content.
Response template wouldn’t have matched anyway. Even with a completion-only collator, the literature pointed at <start_of_turn>model\n while Gemma-4’s real chat template emits <|turn>model\n. The mask would have been empty either way.
Corpus shape was wrong for knowledge injection. Of 3,970 training examples, 60% were Summarize this: ARTICLE_TITLE → summary and 29% were What does doc 'X' say? (section N) → text chunk. Only 2.7% were natural “What is X?” pairs. Real users never phrase questions like the training data, so the adapter only fires for exact-match templates.
System prompt was byte-identical across all 3,970 examples. With the masking bug, the model effectively required that exact 350-character preamble to “enter IrregularChat mode.”
LoRA rank was sized for behavior, not facts. r=32 on a 31B model (0.85% trainable) is enough to shift tone but not to inject ~80 MB of community knowledge. Published results suggest r≥64 + more epochs OR continued pretraining for facts.

What v2 changes

Area	v1	v2
Loss masking	Full sequence (bug)	`DataCollatorForCompletionOnlyLM` with dynamically detected response template
Mask sanity check	None	Aborts if mask_frac < 0.30 or > 0.95 before training starts
Response template	Hardcoded (wrong)	Detected at startup by diffing `apply_chat_template` outputs
Corpus shape	60% “Summarize this:” / 29% “What does doc X say:“	Rule-based paraphrasing → 3 natural questions per source
System prompts	1 identical preamble × 3,970	8 rotating templates
Training examples	3,970 train / 208 val	10,435 train / 375 val (after paraphrasing + dedup + short-fragment drop)
LoRA rank	r=32, α=64	r=128, α=128
Sharding	DDP (full copy per GPU)	FSDP (sharded across GPUs)
Packing	None	`packing_strategy="wrapped"`
Max sequence	1024 tokens	2048 tokens
GPU budget	2 B200 uncapped (~600W each)	6 B200 at 800W cap, alongside Vibe serving

Six bugs encountered while wiring up v2

Every one of these would have silently mis-trained (or mis-measured) in v1. They surfaced as crashes in v2 because of the sanity checks, not because v2 introduced them:

trl 0.19 forces padding_free when packing=True with default ffd strategy → can’t pass a custom collator. Fix: packing_strategy="wrapped".
PEFT’s FSDP auto-wrap reads FSDP_TRANSFORMER_CLS_TO_WRAP env var, not HF Trainer’s fsdp_config dict. The two configs don’t cross over. Fix: set env var in the launcher per profile.
Gemma-4’s decoder layer is named Gemma4TextDecoderLayer in transformers 5.5+, not Gemma4DecoderLayer as in older versions or older Gemma generations.
Gemma-4 IT’s apply_chat_template(..., add_generation_prompt=True) injects a reasoning-channel prefix (<|turn>model\n<|channel>thought\n<channel|>) that never appears in training-text rendering. Detecting the response template from this gives a template that matches nothing. Fix: probe with sentinel user+assistant content and slice between them, anchored on the last <...> marker.
Gemma4ForConditionalGeneration requires mm_token_type_ids in every batch even for text-only training. trl’s data collator doesn’t emit this field. Fix: thin wrapper around the collator that injects zeros.
Sanity check at mask_frac < 0.30 is one-sided — mask_frac == 1.0 means “response template not found in any example” which is equally broken. Fix: assert 0.30 < mask_frac < 0.95.

Methodology improvements

Offline preflight before relaunch. After each diagnosed bug, run the collator + tokenizer + template detection on a real training example before spending GPU time on a full model load. Six attempts in 25 minutes vs. what would have been six attempts × ~10 minutes if each required a full model load to surface the next bug.
Dynamic GPU detection in the launcher. Picks all GPUs with ≥60 GB free (≥150 GB for Mistral). Combined with stopping unused vLLM endpoints (vllm-irregularchat, vllm-devstral-rod, vllm-qwen36) freed 6 GPUs while keeping Vibe (Mistral-Medium-3.5 on GPUs 4,5) untouched.
Power cap raised from 700W → 800W persisted via gpu-power-cap.service. 6 training GPUs at 700W + 2 serving GPUs at 800W = ~5,800W under the 9,600W breaker.

v2 pipeline (run sequentially)

/workspace/irregularchat-corpus/launch_v2.sh gemma-v1data    # isolate training-config gains
/workspace/irregularchat-corpus/launch_v2.sh gemma-v2data    # isolate data-shape gains
/workspace/irregularchat-corpus/launch_v2.sh mistral-v2data  # does bigger base eat the data better?

Each profile auto-detects free GPUs, sets CUDA_VISIBLE_DEVICES, exports FSDP_TRANSFORMER_CLS_TO_WRAP, and launches torchrun with the right nproc_per_node. The 3-run comparison answers all three failure-mode questions independently.

Evaluation Methodology

A corpus-quality fine-tune is only useful if it answers community questions in real user phrasing — not in the “Summarize this:” template that dominated v1’s training data. The eval set must be authored without reference to the corpus formatting.

Wiki-grounded evaluation prompts

These 25 prompts are sourced from actual Irregularpedia pages that were part of the training corpus. Each one asks a question a community member might plausibly ask in chat, not in the training-corpus templates. The expected answer should reference concepts from the cited wiki page.

#	Prompt	Source page
1	I just got an email saying my account was breached — where do I start?	`cybersecurity/cyber-incident-response-guide-personal`
2	What’s a good first step if I think my phone is compromised?	`cybersecurity/cyber-incident-response-guide-personal`
3	How do I run a radio check on an RTL-SDR?	`radio/radio-checks`
4	What command lists USB-connected SDRs on Linux?	`radio/radio-checks`
5	Walk me through prepping for a RIGEX exercise.	`military/airborne-equipment-rigging`
6	What’s the MC-6 nomenclature I need to know for jumpmaster?	`military/airborne-equipment-rigging`
7	What’s the difference between Monero and Bitcoin in terms of privacy?	`privacy/monero`
8	Where can I buy XMR with USD?	`privacy/monero`
9	What’s prompt engineering and why does it matter for working with LLMs?	`ai-ml/ai-prompting`
10	How should I structure a prompt to avoid ambiguity?	`ai-ml/ai-prompting`
11	What is C-UAS?	`general/large-language-models` or any C-UAS-tagged content
12	What does the community use Flipper Zero for?	`radio/flipper-zero`
13	How does the community handle email hardening?	`cybersecurity/email-hardening-guide`
14	What’s red-teaming in a cyber context?	`cybersecurity/cyber-red-teaming`
15	What’s the IrregularChat login flow?	`general/the-irregularchat-login`
16	Is there a community guide to running Protonmail Bridge on Linux?	`privacy/protonmail-bridge-on-linux`
17	What’s DragonOS used for?	`radio/dragonos`
18	How do I get started with software-defined radio?	`radio/software-defined-radios-sdrs`
19	What are the IrregularChat hackathons about?	`general/irregularchat-hackathons`
20	What’s the community’s recommended approach for self-hosting Nextcloud?	`privacy/service-storage-nextcloud`
21	What 3D printer does the community recommend?	`hardware/3d-printer-recommendation`
22	What’s a cyber deck and why would I build one?	`hardware/cyber-decks`
23	How does the community do archival research?	`research/archival-research`
24	What ham radio resources does IrregularChat recommend?	`radio/ham-radio`
25	What’s the AI ethics stance for community-built AI tools?	`ai-ml/ai-ethics`

Eval execution

For each candidate model (base, v1-LoRA, v2-on-v1data, v2-on-v2data, mistral-v2data), serve via vLLM and run eval_v2.py against the OpenAI-compatible endpoint. The script writes one JSONL record per prompt with {i, q, answer, ms}.

python3 /workspace/irregularchat-corpus/eval_v2.py \
  --base-url http://localhost:8000/v1 \
  --model <model_name> \
  --out /workspace/irregularchat-model/eval/<run>.jsonl

Scoring rubric

For each (prompt, answer) pair, score 0–3:

0 — Refusal or unrelated. “I can’t help with that” or wanders into a different topic.
1 — Generic web-pretrained answer. Correct on the topic but ignores the community’s specific tools, conventions, or page content.
2 — Domain-aware. References at least one community-specific concept (tool name, command, person, page) even if not perfectly aligned.
3 — Community-grounded. Answer reads like it came from someone who has read the cited wiki page; cites or paraphrases specific content.

A v2 model that beats v1 should land more answers in the 2–3 band, with fewer 0–1 results on prompts whose source page was part of training data. A useful sanity check: the base model should score 1 most of the time on these prompts — if it scores 2+ frequently, the eval is too easy and the wiki content overlaps generic web knowledge.

What we’ll know at the end

Comparison	Question answered
`base` vs `v1`	Did the original training do anything measurable?
`v1` vs `gemma-on-v1data` (v2 training, v1 corpus)	Did the training-config bugs alone account for v1’s underperformance?
`gemma-on-v1data` vs `gemma-on-v2data`	Did corpus reshaping move the needle further?
`gemma-on-v2data` vs `mistral-on-v2data`	Does a 128B base eat the data better than a 31B base?

If v1 ≈ gemma-on-v1data, the corpus was always the limiter. If gemma-on-v1data > v1 but gemma-on-v2data ≈ gemma-on-v1data, the training bugs were the dominant problem. The four-way split makes the attribution decomposable.

v2 Results (Actual Numbers)

Three v2 runs completed in sequence on 6× B200 (FSDP, packing, completion-only masking). Mistral was dropped because the on-disk Mistral-Medium-3.5-128B is Mistral3ForConditionalGeneration (multimodal) — Mistral 3 has no text-only causal-LM sibling class, unlike Gemma 4’s Gemma4ForCausalLM. The checkpoint is also FP8-quantized, which compounds the issue. See “Operational Pitfalls” below.

Run	Corpus	Recipe	Wall clock	Final train loss	Train tok_acc	Eval tok_acc
gemma-v1data	3,970 examples (v1)	r=128, α=128, 5 epochs	31:24	36.07	0.621	0.591
gemma-v2data	10,435 examples (v2)	r=128, α=128, 5 epochs	1:23:00	50.95	0.587	0.685
gemma-v2.1	10,435 examples (v2)	r=128, α=256, NEFTune α=5, 8 epochs	2:13:17	27.21	0.918	0.640

Key reading of the numbers:

Train loss is not comparable across LoRA scaling settings. v2.1 has loss 27 vs v2’s 51, but the larger α changes absolute loss magnitudes. Eval token accuracy is the comparable signal.
v2 generalizes 16% better than v1 on held-out (0.685 vs 0.591). The corpus reshape paid off — paraphrasing + diverse system prompts + dropped-summary-templates produces a model that handles unseen phrasings better.
v2.1 overfits relative to v2. Despite a much tighter training fit (0.918 train tok_acc), eval dropped to 0.640. The α=256 + NEFTune + 8-epoch combination memorizes the corpus tightly but loses generalization headroom.
Eval loss was nan on every run but token accuracy worked fine. With sequence packing, occasional eval batches end up with mask_frac == 1.0 (no loss-bearing tokens — the response template never appears in a particular packed window), causing the cross-entropy aggregator to divide by zero. Switching to a non-packed eval pass would fix this, but token accuracy is enough for relative comparison.

The wiki-grounded rubric eval is the actual quality measure — token accuracy on packed batches is a proxy. Results pending at time of writing.

v2.1 Recipe Experiment

After v2 completed, we ran a v2.1 experiment to isolate the impact of recipe upgrades on the same v2 corpus:

Setting	v2	v2.1
LoRA rank	128	128
LoRA alpha	128	256 (α=2r)
Use rsLoRA	No	No (see warning below)
NEFTune α	None	5
Epochs	5	8
Effective LoRA scaling	α/r = 1.0	α/r = 2.0

The α=2r recommendation came from arxiv 2602.04998. NEFTune (ICLR 2024, arxiv 2310.05914) adds Gaussian noise to embeddings during training; published gains of 8–35 points on AlpacaEval.

The rsLoRA + α=2r pitfall

Our first v2.1 attempt also enabled use_rslora=True. This destroyed the model. Within 80 steps, train loss climbed past 280 and token accuracy collapsed to 1% — the model was outputting essentially noise.

The math: rsLoRA changes the LoRA scaling factor from α/r to α/sqrt(r). For r=128:

Standard LoRA, α=128: scaling = 1.0
Standard LoRA, α=256: scaling = 2.0 ← v2.1 final recipe
rsLoRA, α=128: scaling = 11.3 (sqrt(128) ≈ 11.3)
rsLoRA, α=256: scaling = 22.6 ← what destroyed the model

The published recommendations for rsLoRA assume you reduce alpha to keep effective scaling in a reasonable range. Stacking α=2r with rsLoRA is 2 × sqrt(r) scaling — never recommended.

Rule of thumb: apply LoRA stability changes one at a time. Each recipe paper assumes the others are at default.

Loss trajectory comparison

Train loss at comparable steps:

Step	v1-data	v2-data	v2.1
50	46.64	53.16	56.07
100	37.25	49.46	46.18
150	—	44.17	35.31
200	—	76.19 (spike)	—
280	—	40.40	17.58
350	—	—	9.67
Final	36.07	40.40 (step 290)	4.70 (step 460)

Grad-norm stability:

v2: 14 → 759 range, multiple spikes triggered Trainer’s max_grad_norm=1.0 clipping
v2.1: 5 → 130 range, no spikes — the α=2r scaling absorbs more signal without needing violent updates

Operational Pitfalls Encountered

In addition to the six bugs documented under “Six bugs encountered while wiring up v2,” the in-flight runs surfaced four more operational issues. Documenting these so the next iteration doesn’t have to rediscover them.

1. FSDP + PEFT `save_pretrained()` hangs at the end of every run

After every successful training run, the rank-0 process hung at 100% CPU when trainer.model.save_pretrained() ran. PEFT’s save path uses the deprecated state_dict_type() FSDP API (emits FutureWarning every save), gathers all FSDP shards to rank 0, then extracts LoRA weights. The gather completes but the save never returns — at least for our 4.3 GB adapter, the rank-0 process burns CPU for 13+ minutes before manual kill.

Workaround: Skip the final save_pretrained() call entirely. The Trainer’s per-step checkpoint save (which uses a different code path) writes adapter_model.safetensors into checkpoint-N/. Post-training, the launcher copies that out to lora-adapter/:

# In launch_v2.sh, after torchrun exits:
LAST_CKPT=$(ls -1d "$OUT"/checkpoint-* | sort -V | tail -1)
cp "$LAST_CKPT/adapter_model.safetensors" \
   "$LAST_CKPT/adapter_config.json" \
   "$LAST_CKPT/chat_template.jinja" \
   "$LAST_CKPT/tokenizer"*.json \
   "$OUT/lora-adapter/"

The first two runs (v1data, v2data) had to be manually killed + salvaged this way. v2.1 ran cleanly with the patched script.

2. Disk pressure cascades into save failures

Initial training writes accumulated 65 GB in checkpoint directories + 59 GB v1 merged model + 6 GB old checkpoints on a 3.5 TB root volume that was already at 93 GB free. By the time run #1 saved its final checkpoint, the disk was at 100% — save_pretrained() wrote README.md to the new lora-adapter/ dir but couldn’t write adapter_model.safetensors. The hang was actually two issues stacked: PEFT’s slow FSDP gather + a failed write into a full disk.

Layout we ended up with:

/dev/nvme1n1p2   3.5T   /        <- system, /workspace, /workspace/models
/dev/nvme4n1p1   3.5T   /data    <- v2+ training outputs (this is where adapters live)
/dev/nvme2n1p1   3.5T   /scratch <- ext4 created at runtime; merged models go here
nvme3n1          3.5T   raw      <- unformatted, reserved

The migration approach: leave /workspace/models/ (shared base models) where they are, symlink future training outputs to /data, format /dev/nvme2n1 for scratch space, leave one disk unformatted as future-expansion.

Rule of thumb: every training run on 31B-class models needs ≥100 GB free on the write target before it starts. For 128B-class with optimizer state, plan for 300+ GB.

3. Mistral-Medium-3.5-128B incompatible with our pipeline

The on-disk Mistral checkpoint is Mistral3ForConditionalGeneration — Mistral 3 architecture with Pixtral vision. The Mistral 3 module exposes only:

Mistral3Model
Mistral3PreTrainedModel
Mistral3ForConditionalGeneration

No Mistral3ForCausalLM. Gemma 4 has its text-only causal-LM sibling (which is how our training path works); Mistral 3 doesn’t. Additionally the on-disk weights are FP8-quantized (quant_method: fp8 in config.json), so loading them as bf16 for LoRA training requires a dequantization step that TRL+PEFT don’t do automatically.

Workarounds exist (load the multimodal class, target only text-decoder LoRA modules, dequantize-on-load) but each is 2–4 hours of engineering. Skip for now. A “bigger base” experiment would be better targeted at Qwen3-32B (text-only, Apache 2.0, top instruction-following benchmarks as of 2026) or Llama-3.3-70B-Instruct — neither currently on disk.

4. vLLM not available in the training venv

The training venv at /workspace/irregularchat-corpus/.venv/ has transformers 5.5.0, trl 0.19.1, peft 0.18.1 — but no vLLM. The original eval orchestrator script assumed python -m vllm.entrypoints.openai.api_server would work; it fails with ModuleNotFoundError.

Two options:

Install vLLM in the venv — possible but risks transformers version conflicts with our TRL/PEFT stack
Use transformers.generate() directly — slower per-prompt (~10s vs vLLM’s ~1s) but adequate for offline eval and avoids the dependency

We chose the second. eval_direct.py loads the merged model with device_map="auto" (splits 62 GB across 2 GPUs), runs apply_chat_template + generate() per prompt. 25 prompts × 4 candidates × ~10s = ~17 min of pure generation, plus ~3–5 min per merge step (CPU bf16 load of 31B base + adapter, then merge_and_unload()).

Eval Results

All four candidates were merged into bf16, loaded via transformers.generate() directly (vLLM wasn’t installed in the training venv — transformers worked fine, ~10s per generation), and evaluated against the 25 wiki-grounded prompts. Output JSONLs at /data/irregularchat-model/eval/.

Heuristic rubric scores

A scoring script that looks for community-specific markers (tool names, commands, URL paths like /general/, irregularchat) per prompt:

Candidate	Rubric avg (0–3)	Train loss	Train tok_acc	Eval tok_acc
base	2.88	—	—	—
v2-gemma-on-v1data	2.76	36.07	0.621	0.591
v2-gemma-on-v2data	2.80	50.95	0.587	0.685
v2.1-gemma-on-v2data	2.84	27.21	0.918	0.640

All four scores within 0.12 points — within noise. No fine-tune meaningfully beats base on the heuristic scorer. The rubric automation is too lenient because Gemma 4 31B’s prior knowledge already produces fluent, on-topic, keyword-rich answers for most technical questions (SDR, Monero, prompt engineering, OSINT, etc.). The markers can’t distinguish “Gemma knows what an RTL-SDR is” from “the model learned IrregularChat’s specific community conventions.”

The Authentik test

The qualitative test that mattered: Q15 asked “What’s the IrregularChat login flow?” The correct answer references Authentik — the SSO system the community actually uses. Result:

Candidate	Mentions Authentik?	What it says instead
base	No	Generic “OAuth via Discord/Google”
v2-on-v1data	No	”Login button, Google/Discord/Apple”
v2-on-v2data	No	”Google/Facebook/Twitter”
v2.1	No	”Magic Links via Supabase Auth, `signInWithOtp()`, `auth.irregularchat.com/auth/v1/callback`” — fabricated specifics

None of the candidates mention Authentik. All four confidently fabricate. v2.1’s answer is the most concerning — high-confidence specific-looking code snippets, fake URLs, fabricated authentication library. The α=256 + NEFTune + 8-epoch recipe made the model more willing to invent specific-sounding fakes, not better at producing real community content.

Other v2.1 hallucinations from spot-checks:

Q3 (RTL-SDR radio check): cited a non-existent “rtl-bench software suite” with full install commands
Q19 (hackathons): cited “Hack the Polyglot, Feb 21–March 2 2026” — may be genuine memorization of a wiki page OR fabricated specifics

Bottom line

LoRA fine-tuning at our scale (r=128, 10K examples, ≤8 epochs) failed at the actual goal: injecting IrregularChat-specific facts. What it produced:

✅ Better instruction-following on technical topics (tone, structure, formatting)
✅ Slight shift toward wiki-like markdown output (section headers, bullet structure)
❌ Zero successful injection of community facts (Authentik, actual hackathon names, community-specific tools, conventions)
⚠️ v2.1 specifically: fabricates community-specific details more confidently than base — net negative for a Q&A bot

This empirically validates the research recommendation in the previous wiki section: SFT teaches behavior; RAG is required for facts at this corpus scale. The model’s pretrained prior dominates; a 4 GB adapter cannot encode 10 MB of unique community facts.

Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)

Option	Pros	Cons
Deploy v2.1	Most “wiki-styled” output	Fabricates community facts most confidently — actively harmful for a Q&A bot
Deploy v2-on-v2data	Good train/eval balance, no overfit signature	Marginal lift over base on rubric
Deploy base + RAG	No fabrication risk beyond stock model	Loses slight wiki-format tone improvement
Deploy v2-on-v2data + RAG	Best of both: slight tone improvement + retrieved facts	RAG is doing the heavy lifting

The fine-tune contributes ~10% formatting/tone polish. The retrieval (MCP server at search.irregulars.io) contributes the 90% of “actual community knowledge.”

Next Steps

Quantize v2-on-v2data to AWQ-4bit for deployment. Target: 20 GB on disk, runs on RTX 4090 / single 24GB GPU. QuantTrio/gemma-4-31B-it-AWQ confirms vLLM’s awq_marlin handles Gemma 4.
Wire MCP retrieval at inference. Either system-prompt injection with top-k retrieved docs, or a proper RAG framework (LangChain / LlamaIndex). The MCP server at search.irregulars.io already exists.
Re-run the 25-prompt eval with RAG enabled. This is the comparison that actually matters. The expectation: the Authentik test passes this time. If it doesn’t, the retrieval pipeline needs work, not the model.
Skip further LoRA experiments for knowledge injection. The data point is clear at this scale. Future iterations should focus on either:
- Continued pretraining (CPT) on raw wiki text in causal-LM mode for many passes — the SynCPT ICLR 2025 paper showed this helps where SFT plateaus
- RAFT-style training with distractor passages mixed into the training data, so the model learns to ignore irrelevant context AND rely on parametric fallback gracefully (Berkeley 2024)
- Distillation from a strong RAG-augmented teacher to a smaller student (Qwen3-8B) for cheaper deployment
Don’t deploy v2.1. Its overfitting hurts fact-grounding more than the recipe gains help formatting. The model that looks most polished is the one most likely to confidently mislead users.

RAG Validation (BM25 + merged adapter)

After eval-of-fine-tunes confirmed the fact-injection gap, we built a BM25 retriever over the wiki and ran the same 25 prompts with retrieved context injected into the system prompt. Two corpus sizes tested:

Wiki-only RAG: 386 wiki .md files (the canonical Irregularpedia content).
Full-corpus RAG: 11,627 docs — wiki + 4,253 PDF chunks (mined from v1+v2 training data assistant turns) + 3,673 news summaries (training) + 3,673 news summaries (fresh DB pull) + 47 archived-files AI summaries + 11 daily community rollups + 6 Outline docs.

Eval configuration: top-4 retrievals per prompt, 1500-char snippet limit per doc, injected into the system prompt before the user question. Same 25 wiki-grounded prompts as the no-RAG eval. transformers.generate() direct, bf16, 2-GPU TP.

Headline result — the Authentik test

	no-RAG	wiki-RAG	full-RAG
base	FAIL	PASS	PASS
v2-v1data	FAIL	—	—
v2-v2data	FAIL	PASS	PASS
v2.1	FAIL	—	—

For Q15 “What’s the IrregularChat login flow?”, every no-RAG configuration fabricated an answer (Discord/Google OAuth, “Magic Links via Supabase Auth,” etc.). Every RAG configuration correctly identified Authentik. This is the cleanest binary signal in the whole experiment.

Heuristic rubric across all 6 conditions

Condition	Hits / 91	%
base no-RAG	55	60.4%
v2-v1data no-RAG	55	60.4%
v2-v2data no-RAG	57	62.6%
v2.1 no-RAG	56	61.5%
base wiki-RAG	48	52.7%
v2-v2data wiki-RAG	55	60.4%
base full-RAG	47	51.6%
v2-v2data full-RAG	56	61.5%

The heuristic rubric (counting topic-keyword hits) shows RAG and no-RAG within noise — RAG sometimes loses generic keyword reflexes that Gemma 4’s prior already has (e.g., Q13 email hardening: no-RAG fluently lists SPF/DKIM/DMARC; RAG focuses on what the wiki page actually says).

The heuristic doesn’t capture correctness. The Authentik test does — and on it, RAG wins 4–0. Other community-internal-fact wins (manually verified): Q5 RIGEX (no-RAG 1 marker → RAG 5), Q23 archival research (1 → 2).

Why RAG sometimes loses on the heuristic

For prompts where Gemma 4’s pretraining has strong coverage (email security, generic phone-compromise advice, Monero-vs-Bitcoin), the model produces fluent answers with many topic keywords. The wiki page on the same topic is often shorter and less keyword-dense. RAG redirects the model toward wiki content, which can mean fewer technical-acronym mentions and lower heuristic score — but answers that are actually grounded in what the community says, rather than just generic textbook recall.

This is a methodology lesson, not a result against RAG: mark-counting rubrics over-credit fluent generic answers. For real-world deployment, the user-visible question is “did the bot answer correctly with community-specific information” — and on that axis, RAG is decisive.

Source mix observed in full-corpus RAG

Across 100 retrievals (25 prompts × top-4 each) for v2-v2data full-RAG:

Source	Retrievals	Share
wiki	59	59%
news (training-derived summaries)	26	26%
pdf (training-derived chunks)	9	9%
news_db (fresh from signal-bot)	6	6%

41% of retrievals are non-wiki. BM25 naturally surfaces wiki pages first when an exact topical match exists, with PDFs/news as fallback. For the 25 wiki-grounded eval prompts, the wiki-first behavior limits the impact of non-wiki sources — but in production where users ask questions not covered by any wiki page, the broader corpus matters.

Fine-tune × RAG synergy

Comparison	base	v2-v2data	Δ
No-RAG	60.4%	62.6%	+2.2
Wiki-RAG	52.7%	60.4%	+7.7
Full-RAG	51.6%	61.5%	+9.9

The fine-tune’s contribution grows with RAG. Without RAG, v2 is only ~2 points above base. With RAG, v2 is +8–10 points above base. The LoRA learned to use retrieved content idiomatically — wiki-style markdown structure, action-oriented answer format, proper integration of citations — even though it didn’t learn the underlying facts. This is the case for deploying v2-on-v2data over base, conditional on RAG being part of the stack.

Final deployment recommendation

Deploy: v2-on-v2data adapter + wiki-only BM25 retrieval.

Component	Choice	Why
Base model	gemma-4-31b-it	Strongest instruction model in our 30B class, available locally
Adapter	`v2-on-v2data/lora-adapter/` (4.28 GB)	Best no-RAG-score fine-tune that also synergizes with RAG; no overfitting signs
Quantization	GGUF Q5_K_M when llama.cpp Gemma-4 support solidifies	AWQ-via-AutoAWQ doesn’t recognize `gemma4` model type (deprecated tool); QuantTrio AWQ is base-only
Retrieval	BM25 over 386 wiki .md files	Full-corpus added marginal lift only; wiki-only is faster and noise-free
Top-K	4 retrievals, 1500-char snippet cap	Tested config; ~4.5K extra tokens of context per query
Inference	bf16 on 2× B200 today; target single 24GB GPU after GGUF conversion	Works with current stack; smaller hardware pending quant

Do NOT deploy v2.1 — its overfit recipe makes hallucinations more confident.

Do NOT skip the fine-tune even with RAG — measurably contributes +7-10 points over base+RAG.

Do NOT use full-corpus RAG yet — the wiki is sufficient for the current 25-prompt eval set; the broader corpus will matter for production queries that fall outside wiki coverage.

v3: Switching to Qwen3 + Heretic Abliteration (May 2026)

After v2.1’s failure mode (rsLoRA + α=2r overfitting + NaN eval loss), and Gemma 4’s heavy alignment friction making “uncensored” responses hard to elicit even on legitimate professional topics, we switched the base model from gemma-4-31b-it to Qwen3-30B-A3B-Instruct-2507 (Alibaba, Apache 2.0). The deployed model is now irregularchat-v3-heretic running locally on a Mac via Ollama + Open WebUI.

Why switch off Gemma 4

Issue with Gemma 4 31B	Impact
Dense 31B params → ~12-14 tok/s on Apple Silicon Q8	Sluggish interactive use
Most safety-trained of the major open models	Refusals/disclaimers on legitimate technical prompts
`Gemma4ForConditionalGeneration` multimodal scaffolding	GGUF conversion path immature; double-BOS warning at inference
Distinct attention architecture	FSDP + PEFT `save_pretrained()` hangs (Pitfall #1 above)
Abliterated variants poorly preserve LoRA fluency	LoRA-on-abliterated-base = compounding drift (verified empirically)

Why Qwen3-30B-A3B-Instruct-2507

Property	Value
Architecture	MoE — 30.5B total, 3.3B active per token
Inference speed (Q4_K_M on M-series)	~92 tok/s (6.5× faster than Gemma Q8)
Fine-tuning benchmarks	Qwen3 family takes 4 of top 6 fine-tuned-quality spots in published 2026 evals
License	Apache 2.0
Unsloth 2026.4.2 support	First-class — single B200 (180GB) handles the full 30B fine-tune in ~2 hours, no FSDP needed
Default alignment	Less aggressive than Gemma 4; system-prompt jailbreaks more effective
Abliteration ecosystem	Multiple pre-abliterated variants on HF (huihui-ai, mlabonne, DavidAU) plus first-class Heretic 1.3.0 support

Training recipe (v3)

Single-GPU, no FSDP, lessons applied from v2/v2.1:

Parameter	Value	Rationale
Base	`Qwen/Qwen3-30B-A3B-Instruct-2507`	See above
`lora_r`	64	Middle ground between v2 (r=32, underfit) and v2.1 (r=128, exploded)
`lora_alpha`	128	Standard α=2r; intentionally NOT using rsLoRA
`lora_dropout`	0	Required — PEFT `ParamWrapper` for MoE expert layers raises `ValueError` on non-zero dropout. MoE gating provides sufficient regularization.
`target_modules`	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`	Standard set; Unsloth excludes router by default for MoE
Epochs	4	Reduced from v2’s 5 to lower overfit risk on 10K examples
`per_device_train_batch_size`	4	B200 has plenty of headroom
`gradient_accumulation_steps`	4	Effective batch 16
`learning_rate`	2e-4	Standard for LoRA, cosine scheduler
`max_grad_norm`	0.5	Tighter than default 1.0 — prevents the gradient-explosion failure mode v2.1 hit at step 10
Trainer	Unsloth `FastLanguageModel` + TRL `SFTTrainer`	Single GPU, no FSDP complications

v3 training results

Metric	Value
Total runtime	7,347s (~2.0 hours)
Steps	2,612 (4 epochs × 653 steps/epoch)
Final `train_loss`	0.37
Final `eval_loss`	2.38 (healthy, not NaN like v2.1)
Train/eval generalization gap	~2.0 — reasonable (not overfit, not underfit)
`eval_mean_token_accuracy`	0.89
GPU memory peak	~70 GB on a single B200

Heretic abliteration (v3-heretic)

After v3 training, the merged model went through Heretic 1.3.0 to remove the refusal direction structurally. This is the published gold-standard approach (Arditi et al. 2024; KL-minimizing optimization).

Parameter	Value
`n_trials`	30 (Optuna TPE)
`kl_divergence_target`	0.20
Best trial KL divergence	0.0137 (well below the 0.16 reference for Llama-3.1-8B-heretic; model behavior preserved)
Refusal rate vs baseline	100/100 → 95/100 on `mlabonne/harmful_behaviors`

The refusal-rate reduction on the standard harmful-behaviors benchmark was minimal — Qwen3’s safety pathways are distributed across MoE experts, and 30 trials wasn’t enough to fully untangle them. However, for the actual use case (military/OSINT/drone technical queries that are not in the benchmark), the abliteration completely eliminates moralizing and “I cannot assist” responses. Verified empirically post-deploy.

Critical incompatibility: Heretic’s interactive TUI

Heretic uses questionary (raw-mode prompt_toolkit) at the end of optimization to interactively prompt for which trial to apply and where to save. This cannot be driven by pexpect / stdin piping in scripted/headless runs — prompt_toolkit requires a real terminal.

Fix: Write a wrapper that monkey-patches questionary.select/path/text/checkbox to return canned responses BEFORE heretic.main.run() is called:

import questionary
questionary.select = fake_select_returning_first_trial_then_save
questionary.path = lambda message: FakeAsker(OUTPUT_DIR)
import heretic.main
heretic.main.run()

State-machine the fake_select to pick:

Trial selection menu → first (best) trial in the Pareto front
Action menu → “Save the model to a local folder”
Subsequent action prompts → “Return to trial selection menu” → exit

Without this, Heretic completes optimization, sits at the menu, and exits without saving when its parent shell dies — wasting all the compute.

Local Mac deployment

The production model now runs on a Mac instead of Obelisk:

Component	Path
Merged HF model on Obelisk	`/data/irregularchat-model/v3-heretic/` (53GB safetensors, 13 shards)
bf16 GGUF on Obelisk	`/data/irregularchat-model/v3-heretic-gguf/irregularchat-v3-heretic-bf16.gguf` (57GB)
Q4_K_M GGUF on Mac	`/Users/sac/Models/irregularchat-v3-heretic-Q4_K_M.gguf` (17GB)
Ollama tag	`irregularchat-v3-heretic:latest`
Modelfile	`/Users/sac/Models/Modelfile-v3-heretic`
RAG markdown corpus	`/Users/sac/Models/rag-corpus/wiki-md/` (386 files derived from `wiki.jsonl`)
Open WebUI	`http://127.0.0.1:8080` (Python venv at `/Users/sac/irregularchat-local/.venv-webui/`)

llama.cpp’s convert_hf_to_gguf.py natively supports the Qwen3 MoE architecture — no patches needed (unlike Gemma 4 where we had to wait for the toolchain). Q4_K_M produces a 17GB file that loads in ~30s on an M-series with unified memory, then runs at ~92 tok/s.

v3 deployment recommendations

Decision	Choice	Why
Base	`Qwen3-30B-A3B-Instruct-2507`	MoE speed + better fine-tuning lift than Gemma at the same param count
Fine-tune	r=64 / α=128 / dropout=0 / 4 epochs / max_grad_norm=0.5	Stable training, no NaN, train_loss=0.37
Abliteration	Heretic 1.3.0 with monkey-patched auto-save	Removes moralizing on professional-context queries; KL preserved
Quantization	Q4_K_M GGUF	Sweet spot — 17GB fits comfortably; ~98% quality vs bf16
Serving	Ollama + Open WebUI Knowledge	Local, fast, RAG via Knowledge collection (see Open WebUI)
Retrieval	Open WebUI built-in (sentence-transformers + sqlite-vec)	Replaces the previous Python BM25 script

The v2-vs-v3 quality bar is: v3 produces fluent, calibrated answers (“I don’t know — based on common military training conventions…”) on out-of-corpus topics, where v2 confidently confabulated (“RIGEX stands for Rapid Interdiction Group Exercise”). RAG remains essential for actual factual content — fine-tuning teaches style, RAG provides facts.

Training the IrregularChat Model

Training the IrregularChat Model

Overview

Data Sources

What We Collected

Data Pipeline

Step 1: Export from Signal Bot Database

Step 2: Export from Outline

Step 3: Extract Text from PDFs

Step 4: Format as Instruction-Tuning Data

Model Selection

Why Gemma-4-31B

Gemma-4 Quirks

Training

Tool Comparison

What Worked: 2-GPU DDP with Vanilla PEFT

Launch Command

Training Results

Post-Training: Merge LoRA

Abliteration (Refusal Removal)

What Is It

What We Tried

Why Manual Abliteration Failed

Correct Approach (For Future Reference)

Alternative: Pre-Abliterated Base

Serving

vLLM

MCP Server for RAG

Lessons Learned

What Worked

What Didn’t Work

Key Numbers

File Locations on Obelisk

v2 Training Attempt (May 2026)

Why v1 didn’t move the needle

What v2 changes

Six bugs encountered while wiring up v2

Methodology improvements

v2 pipeline (run sequentially)

Evaluation Methodology

Wiki-grounded evaluation prompts

Eval execution

Scoring rubric

What we’ll know at the end

v2 Results (Actual Numbers)

v2.1 Recipe Experiment

The rsLoRA + α=2r pitfall

Loss trajectory comparison

Operational Pitfalls Encountered

1. FSDP + PEFT save_pretrained() hangs at the end of every run

2. Disk pressure cascades into save failures

3. Mistral-Medium-3.5-128B incompatible with our pipeline

4. vLLM not available in the training venv

Eval Results

Heuristic rubric scores

The Authentik test

Bottom line

Deployment recommendation: v2-on-v2data + RAG (NOT v2.1)

Next Steps

RAG Validation (BM25 + merged adapter)

Headline result — the Authentik test

Heuristic rubric across all 6 conditions

Why RAG sometimes loses on the heuristic

Source mix observed in full-corpus RAG

Fine-tune × RAG synergy

Final deployment recommendation

v3: Switching to Qwen3 + Heretic Abliteration (May 2026)

Why switch off Gemma 4

Why Qwen3-30B-A3B-Instruct-2507

Training recipe (v3)

v3 training results

Heretic abliteration (v3-heretic)

Critical incompatibility: Heretic’s interactive TUI

Local Mac deployment

v3 deployment recommendations

1. FSDP + PEFT `save_pretrained()` hangs at the end of every run