Why I train one LoRA per canton.

The moment it got expensive

I was running the Fiskal AI tax prototype on a single, Switzerland-wide trained model. The logic: a good LoRA on Gemma-4-31B, trained on tax rulings from across Switzerland, plus good prompts with cantonal context per request. That should have been clean.

Then the test with a real mandate from canton Vaud. Self-employed person, property in private assets, imputed rental value. The question: which lump-sum deduction for building maintenance applies?

The model answered with a reference to Art. 32 DBG and returned a clean calculation. The number was plausible. It was also wrong. Vaud uses a different lump-sum rate for properties in the canton than most German-speaking cantons — and the imputed-rental-value calculation follows a cantonal formula that the model had mixed with the federal one.

I checked backwards. With a more precise prompt ("Please apply only Vaudois tax law") a better answer came. With an even more precise prompt an even better one. But never reliably. The model knew something about Vaud, but it knew more about Zurich — because there were more Zurich rulings in the training data. Under uncertainty it pulled to the most frequent representation in its head. And that was not Vaud.

That was the day I understood: a single Switzerland-wide model is not the right unit for this problem. Switzerland does not have one tax law. It has 26 tax laws, plus one at federal level. And the differences are not nuances — they are structural.

The heterogeneity horizontal tools cannot capture

Looking at the Swiss tax system as a non-Swiss, you see one. Knowing it as a trustee, you see 26 different systems plus federation. A few examples to make clear why this is not a detail:

Imputed rental value. Calculated by a different formula in every canton. Zurich uses a different percentage range than Vaud than Bern. Geneva's formula factors in zone and location differently.

Professional-expense lump sum. Federally uniform (3% of net salary, min/max), but cantonal supplements and detail deductions vary. Valais knows deductions Basel-Stadt does not.

Wealth tax. Exists cantonally, not federally. Rates vary between voluntary (Obwalden, Nidwalden) and significant progression (Vaud, Neuchâtel).

Church tax. Cantonally regulated, in some cantons mandatory, in others voluntary, rates differ. A non-federal factor that can shift total tax burden by double-digit percent.

Withholding-tax tariffs. Cantonally different. A cross-border worker from France into Geneva pays under a different tariff than one into Baselland.

Tax-return forms. Different numbering, different mandatory fields, different annexes. The form "tax return 2025" is 26 different documents.

For a trustee in Lucerne with mandates in Zug, Schwyz, Obwalden and Nidwalden, this is not an edge topic. It is daily business. Working with four different tax laws every day. Every error is liability.

A horizontal AI tool has to try to hold all 26 laws plus federation in one model. That is not just inefficient — it is structurally wrong, because the model under uncertainty always falls back to the most frequent representation. For the trustee in Jura that means: he gets Zurich answers with a Jura coat of paint.

Why not one big model per canton

The obvious approach would be: 26 full models, each fine-tuned per canton. One Gemma-4-31B per canton.

That would be 26 × 31B parameters × 18GB Q4 = 468GB of model storage. Ollama Cloud does not charge the world for that, but the cold-load times between cantons would be brutal. A trustee working a Zurich mandate and switching to a Zug one waits 15 seconds for a model swap. Eight mandates per hour, two minutes per day lost — not dramatic, but clearly noticeable.

The real catch lies elsewhere: it is wasteful. 90% of tax knowledge is federal and shared Switzerland-wide. Only 10% is genuinely cantonal. Why train the same base model 26 times if only the adapter differs?

The stack: one base, 26 adapters

The architecture I settled on:

One base model. Gemma-4-31B, fine-tuned on Swiss German + Swiss legal language + federal tax foundations. This model already knows everything that applies in every canton: federal tax, VStG, MWSTG, federal terminology, Swiss judgment structure.

26 LoRA adapters. One per canton. Trained on cantonal tax rulings, cantonal tax laws, cantonal forms, cantonal practice notes. Each adapter is small — a few hundred megabytes. All 26 together cost less storage than a single additional full model.

Cold-swap via Ollama Cloud. Switching between adapters happens under 800ms. No longer perceptible to a trustee.

Routing layer. Before the actual model, a small classification decides: which canton is relevant for this request? Mandate-specific, not guessed per request. For multi-canton mandates there is a clear process: requests are executed explicitly per canton, not mixed.

The cost picture changes completely. Instead of 468GB model storage I need 18GB base plus 26 × ~300MB adapters = around 26GB. Less storage, faster swaps, trivial to add new cantons when cantonal law changes.

And most importantly: every adapter sees only its canton's knowledge plus the federal foundation at inference. There is no cross-contamination anymore. The Vaud adapter thinks Vaud. It does not fall back under uncertainty to Zurich-average answers.

What turned out during implementation

Training-data acquisition is the actual work

Training 26 adapters is easy — if you have the data. Swiss tax rulings are in principle public, but they live in different cantonal systems, different formats, different completeness. Some cantons publish structured, some as scanned PDFs, some only anonymised excerpts.

That means: most of the work is not ML, it is data engineering. Per canton a crawler, a normaliser, an anonymised training corpus. That takes time, and it is the prerequisite for everything else. Whoever cuts corners here ships 26 adapters of varying quality — which is worse than one mediocre model, because the inconsistency damages trust.

Constraint decoding goes on top

A cantonal LoRA adapter is trained on the right ground truth. It is not immune to hallucination. When uncertain, it can still cite an invented paragraph or fabricate a VAT code. The combination I now run: LoRA per canton plus constraint decoding in the decoder. The LoRA makes the model think canton-specifically. The decoder makes sure it does not slip out of the known world while writing.

The adapter is the moat, not the model

The interesting commercial consequence: the base model is replaceable. If a better open-weight model appears tomorrow — Gemma-5, Qwen4, whatever — I swap the foundation and re-train the adapters. The 26 cantonal corpora remain. The routing logic remains. The integrations with Abacus, Sage and Bexio remain.

The value does not lie in the model. It lies in the 26 curated cantonal training sets and the infrastructure that serves them as adapters. That is the moat. A new competitor would have to repeat the same data-engineering effort — and that takes months they do not have if Fiskal AI builds market share in that window.

What this teaches for other verticals

The lesson I draw from this setup is not limited to tax law. It applies to every market with structural heterogeneity.

Swiss healthcare. Cantonal hospital financing, KVG implementation, care financing. One adapter per canton for billing copilots would be technically analogous.

Swiss building law. BGBB, planning law cantonally very different, municipal building regulations another layer below. An adapter stack per planning commission would be conceivable.

European data-protection law. GDPR at EU level, but national implementations diverge. One adapter per jurisdiction could be more precise than a pan-European model.

Tourism copilot per region. Every Swiss tourism region has its own offerings, traditions, recommendation patterns. Gipfel AI will be an analogous stack: one base model with regionally specialised adapters.

The pattern is always the same: when a market consists of several small but structurally different segments, the right model is not one big one for all, but a shared foundation plus specialised adapters. The cost advantage is linear in storage, the quality advantage is structural: no cross-contamination under uncertainty.

And the founder's edge is: this architecture is not copyable by capital alone. The curated corpora have to be built, not bought. That is craft. That is time. That is the moat.

The lesson cost me one Vaudois imputed-rental value. It set the architecture of Fiskal AI — and probably also that of Gipfel AI.