Constraint decoding against hallucinations.

The moment it got expensive

I had spent two days building the receipt-to-booking prototype for Fiskal AI. Invoice in, posted entry out. The model was good — solid German, Swiss chart of accounts (Abacus structure) in the prompt, clear system message.

Then the first test with a real receipt from a trustee in Zug. Restaurant bill, CHF 87.40, VAT shown separately.

Output:

{
  "konto_soll": "6570 Bewirtungsspesen",
  "konto_haben": "1020 Bankkonto",
  "betrag": 87.40,
  "mwst_code": "N5"
}

Looks clean. The problem: N5 does not exist. The Swiss VAT-code catalogue knows N1, N2, N3, N4, N6, N7. N5 the model had invented. In between. Sounding plausible.

I ran another thirty receipts through. Twelve of them contained invented account numbers, non-existent VAT codes, or mixes of Swiss and German chart of accounts. In every single case the output sounded professional, confident, unambiguously wrong.

That was the day I understood: a trustee cannot use such a tool. Not because the AI is bad, but because the wrong answers do not look like wrong answers. They look like correct ones. The trustee would have to verify every single entry — and that is exactly what the tool was supposed to eliminate.

Why prompting alone is not enough

My first instinct was the one everyone has: tighten the prompting.

"Use exclusively accounts from the following list: …" — a list with 230 entries. "Use exclusively VAT codes from this list: N1, N2, N3, N4, N6, N7." "If no fitting account exists, return UNKNOWN."

That worked. To 94%. In the remaining 6% the model invented anyway. Sometimes it hallucinated a plausible-sounding account name in a number range that existed but did not match the number. Sometimes it returned a VAT code with the right letter but wrong digit. Sometimes N5 came back despite the explicit list.

94% sounds like success. For a trustee posting 800 receipts per month, 94% is a disaster. That is 48 wrong postings per month — and they look like correct ones.

The problem is structural: the model generates token by token. At every token it decides probabilistically which token comes next. The prompt steers this probability, but it cannot enforce it. If the token space after "VAT-Code: N" still has nine plausible continuations, the model sometimes reaches for the tenth — even if it is semantically nonsense.

The solution to that is not more prompting. The solution is to take the token space away from the decoder.

What constraint decoding actually does

Constraint decoding (also called grammar-constrained decoding or GBNF) is a technique where I prescribe the inference engine a formal grammar. At every token step the engine asks not only "what is likely?", but first: "which tokens are allowed at all?" Everything outside the grammar is masked with probability zero before sampling happens.

For our case the grammar (heavily simplified) looks like this:

buchung       ::= "{" konto_soll "," konto_haben "," betrag "," mwst "}"
konto_soll    ::= "\"konto_soll\":" konto_nr
konto_haben   ::= "\"konto_haben\":" konto_nr
konto_nr      ::= "\"1020\"" | "\"6570\"" | "\"6630\"" | ... (230 entries)
mwst          ::= "\"mwst_code\":" mwst_code
mwst_code     ::= "\"N1\"" | "\"N2\"" | "\"N3\"" | "\"N4\"" | "\"N6\"" | "\"N7\""
betrag        ::= "\"betrag\":" zahl
zahl          ::= digit+ ("." digit digit)?

After "mwst_code": the model can only produce N1, N2, N3, N4, N6 or N7. Not because it was asked to, but because the other tokens do not exist in the token space it samples from.

After the rebuild: invented VAT codes at zero. Invented account numbers at zero. Output format always valid, always parseable. The 6% residual errors disappeared as a class. Not reduced — eliminated.

What constraint decoding does not do

Important distinction: constraint decoding prevents the model from writing outside the allowed token space. It does not guarantee that the decision within the space is correct.

If the model has to choose between 6570 Bewirtungsspesen or 6530 Kundenanlässe for a restaurant bill, it still decides probabilistically. Both are valid accounts. Which is correct depends on context (client entertainment vs. internal lunch). Constraint decoding cannot decide whether the bookkeeper meant "hospitality".

What it can do: enforce that the answer lies within the known world. When the model is uncertain, it no longer flees to N5. It picks the statistically most likely from the seven available codes. That is still fallible — but the errors all lie in a class a trustee knows and can evaluate.

That fundamentally changes the character of the tool. A human verifying a system can work with "wrong but plausible". They cannot work with "invented from nothing". The first class is correction work. The second class is sabotage.

Where I now use it everywhere

After this field test I made constraint decoding the default policy for every output where an enumerable result exists:

Bookkeeping accounts. One chart of accounts per client, one grammar per chart. VAT codes. Cantonally specific, hard-wired as grammar. YAML/JSON schemas in the OpenClaw pipeline. Every artefact (scene_brief, draft_package, P4 review) has a JSON schema. The grammar is generated from the schema and enforced at decoding. P1 can no longer produce a creative YAML missing required fields — the decoder will not let it stop until every field is there. Tax cluster at Fiskal AI. Each cantonal tax-law rule has a finite set of output tokens (deductions, lump sums, classifications). Grammar instead of prompt. Function-calling arguments. Always constrained. Never let tool-call arguments be hallucinated.

The rule I drew from this: whenever I know which outputs are admissible, the knowledge belongs in the decoder, not in the prompt. The prompt can ask the model. The decoder can force it.

The connection to the Swiss market

This is not just a technical detail. It is the reason Fiskal AI can work vertically where horizontal AI tools fail.

A generic bookkeeping copilot cannot guarantee that it only emits valid VAT codes, because it has to serve every country. A Swiss tool with a cantonal LoRA and constraint decoding can. The trustee does not have to check whether the code exists — only whether it fits. That is the difference between "AI tool I cannot trust" and "AI tool that saves me 30% of my time".

For a trustee on CHF 150 per hour, 30% time saved on 800 receipts per month is the difference between "nice try" and "I pay CHF 200 per seat per month". And between those two poles lies only a grammar.

The lesson cost me two days and a stack of verified test data. It was worth every franc.