Gemma4Bolmor — Postmortem

Gemma4Bolmor — Postmortem.

Seven specialised LoRAs on a shared base. Turing-grade German prose, not generic LLM output.

Gemma4Bolmor replaces generic LLM German with a verifiable, trained voice. Seven LoRA adapters on a shared Gemma-4 31B base, each tuned to one pipeline role. The target was not "passable" — it was output indistinguishable from human writer references in blind A/B review (val-loss < 0.25, n=40). Getting there meant rebuilding the training-data stack from the ground up.

Failure modes.

Zero extractable training pairs for the two most important roles

The Writer S1 and S2 loaders searched for fields named draft and prose. The actual data used text and the nested repair_patch.patched_text. Both writer LoRAs — the models that actually produce prose — trained on zero pairs. Silently.

SEVERITY blockingDISCOVERED VIA training-loss auditSTATUS FIXED

Data augmentation pipeline crashed on first call

augment_data.py called randomize_protagonist(). That function never existed — the real name was randomize_all_names(). The pipeline had never successfully augmented anything.

SEVERITY blockingSTATUS FIXED

Curation stage that did not curate

curate_source_data.py printed statistics and exited 0 without writing output. Every run looked green; no curation happened.

SEVERITY blockingSTATUS FIXED

Classic overfitting signature in Writer S2

Eval test T01 produced "Ich habe ihn gefunden" twenty times in sequence. Cause: training without response-masking — the model was learning to predict the instructions as well as the response.

SEVERITY criticalDISCOVERED VIA T01–T10 eval harnessSTATUS FIXED

BPE tokenization corrupting German compounds

Ran upstream of everything else. Without fixing it first, no amount of retraining would converge.

SEVERITY blockingSTATUS FIXED

Engineered response.

↳

Harvest from real pipeline artefacts

harvest_project_artifacts.py extracts 267 training pairs from real production pipeline runs (Lena_Schatten, Ankerbrecher). Total dataset grew from 262 to 1,055 pairs — a 4× step up without any synthetic generation.

↳

Response-masking as default

↳ answers: #4

train_on_responses_only enabled across all LoRAs. The model learns to produce responses, not to parrot instructions back.

↳

Ten-test eval harness (T01–T10)

↳ answers: #4

Deterministic regression tests for POV repair, meta-suppression, speech-tag variety, sensory openings, markdown bleed, and overall quality. Each training run is gated against them.

↳

Shared base + cold-swap

Seven LoRAs on one gemma4:31b base. Ollama Cloud cold-swap in under 800ms means the pipeline can switch roles mid-chapter without a reload penalty.

↳

Turing-style blind A/B target

Quality is not "good enough" — it is val-loss < 0.25 on the repair-writer set, confirmed by n=40 blind A/B against human references. The bar is measurable, not aesthetic.

What transfers.

✓Training-data infrastructure fails silently. If you don't audit extraction counts per stage, your "training run" is training on nothing.

✓Real pipeline artefacts outperform synthetic generators. Lived data carries distribution the synthesiser cannot.

✓Response-masking is not optional for instruction-tuned LoRAs on creative tasks.

✓Specialised LoRAs on a shared base give you vertical quality at horizontal cost — one model in memory, seven roles served.

✓The acceptance metric decides the ceiling. Blind A/B at n=40 beats a vibe check every time.

✓Open frontier: base-swap to gemma4-omni-31b + 4× larger curated pair set, already done.