Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

keinontolibrary

A fast, embeddable Rust library that declines Finnish nouns. Give it a noun lemma and a case; it returns the inflected form(s).

decline("hevonen", Number::Plural, Case::Inessive) -> ["hevosissa"]

It is data-backed — precomputed forms from a reference corpus collected over three years and labeled with Voikko — with a rule-based fallback over the Kotus declension classes (taivutustyypit 1–49 with consonant gradation). It covers 100% of declinable Kotus 2024 nominal rows; verbs and the indeclinable tn 99/100 classes are out of scope.

These guides are task-oriented — pick the one that matches what you want to do:

For background on how compounds are handled, see Compound nouns.

The source lives on GitHub. If you’d rather browse declensions than embed them, try humalapaikallissija.com, a toy built on this library.

How it works

Finnish nouns inflect across 15 cases and two numbers — roughly 30 forms per word — with consonant gradation, vowel harmony, and a long tail of irregulars that break any naive lookup table. keinontolibrary produces the forms through several cooperating layers, and then checks every one of them against an independent oracle.

The layers

  1. Rule generator (Kotus classes 1–49). Implements the Kotus declension types with consonant gradation (grades A–M) and vowel harmony. It handles the bulk of the language directly — about 98% agreement with the reference corpus.
  2. Compound classes (tn 50–51). Compounds inflect through their parts: tn 50 head-inflecting (koirankeksi → koirankeksissä), tn 51 both-parts-inflecting (isoveli → isoissaveljissä). A segmenter splits compounds; a plural-head reverse index and a combining-head registry reach the transparent compounds and derivations that Kotus leaves without a declension type.
  3. Corpus-backed lookup. Precomputed surface forms from a reference corpus (collected over three years) are the primary source where available; the rule generator is the fallback.
  4. Exception registry + overrides. Genuine irregulars the rules can’t derive — aika → ajan, the suppletive kuka → kenet, vaaka → vaa'an — live in a registry and per-slot overrides.
  5. Productive inference & resolvers. Class inference (-nen→38, -uus→40, -ias→41), nested compounds, and compound numerals extend reach to 100% of declinable Kotus nominal rows.

Verification

Correctness isn’t asserted — it’s checked. Every form the engine can produce is validated against Voikko, an open-source Finnish morphological analyzer, used purely as an independent oracle: it analyzes each generated form, and a form is accepted only when Voikko confirms it as the expected lemma + case + number. Where the rule generator and the corpus disagree, the corpus is the witness and Voikko the tie-breaker. The quality gate runs at zero wrong forms across the in-scope inventory.

That is why “verified” is a claim rather than a slogan: each form is checked by a tool that had no hand in generating it.

Data & provenance

  • Lemma inventory and inflection metadata — the Kotus Nykysuomen sanalista 2024, licensed CC BY 4.0.
  • Surface forms — our own reference corpus, collected over three years and labeled using Voikko as an analysis tool.

Scope: all nominal Kotus classes 1–51 — substantives, adjectives, numerals, and pronouns. Verbs and the indeclinable tn 99/100 classes are out of scope.

Embed in a Rust program

Decline Finnish nominals from your own crate.

Add the dependency

[dependencies]
keinontolibrary-core = "0.1"
keinontolibrary-data = "0.1"   # the artifact-backed lookup store

keinontolibrary-core is the API (enums, Engine, decline, paradigm). keinontolibrary-data loads the packed artifact and wires the rule fallback behind it.

Build an engine and decline a word

#![allow(unused)]
fn main() {
use keinontolibrary_core::{Case, Number};
use keinontolibrary_data::build_engine;

// Loads the packed artifact + (optional) overlay, with the rule engine as fallback.
let bundle = build_engine("data/artifact/keinontolibrary.bin", "data/overlay.jsonl")?;
let engine = &bundle.engine;

let forms = engine.decline("hevonen", Number::Plural, Case::Inessive)?;
assert_eq!(forms.primary(), Some("hevosissa"));
assert_eq!(forms.variants, ["hevosissa"]);   // primary first; some slots have several
}

decline returns Forms { variants, status, source, coincides_with }:

  • variants — surface forms, primary first (genitive plural and illative often have more than one valid form).
  • sourceLookup (corpus), Generated (rules/registry), or Overlay.
  • coincides_with — set on the accusative (singular = genitive, plural = nominative).

The whole paradigm

#![allow(unused)]
fn main() {
let p = engine.paradigm("talo")?;
for (number, case, forms) in p.iter() {
    println!("{number} {case}: {}", forms.variants.join(", "));
}
}

Handle the three error cases

decline/paradigm return Result<_, Error>:

#![allow(unused)]
fn main() {
use keinontolibrary_core::Error;

match engine.decline("kuusi", Number::Singular, Case::Inessive) {
    Ok(f) => println!("{:?}", f.primary()),
    Err(Error::UnknownWord(w)) => eprintln!("not a known word: {w}"),
    Err(Error::Ambiguous { lemma, paradigms }) => {
        // Homonyms: kuusi is "six" (tn24) and "spruce" (tn27). Pick one and retry.
        eprintln!("{lemma} is ambiguous: {paradigms:?}");
    }
    Err(Error::DefectiveForm { lemma, number, case }) => {
        // The slot genuinely does not exist (e.g. sakset has no singular).
        eprintln!("{lemma} has no {number} {case}");
    }
}
}

Disambiguate homonyms

Pass an explicit paradigm with decline_with / paradigm_with:

#![allow(unused)]
fn main() {
use keinontolibrary_core::ParadigmRef;

// "kuusi" → spruce (Kotus class tn27)
let f = engine.decline_with("kuusi", Number::Singular, Case::Inessive,
                            &ParadigmRef::new(None, 27))?;
assert_eq!(f.primary(), Some("kuusessa"));
}

ParadigmRef::new(hn, tn) — a None field is a wildcard (matches any homonym number / any class). To require a specific reading, pass Some(_).

Notes

  • All inputs are normalized (trimmed, NFC, lowercased) — "Talo", " talo ", and "talo" are equivalent.
  • Resolution order is overlay → lookup → rule fallback: an overlay entry wins, then the corpus, then the rules.
  • The artifact is not committed (it embeds corpus-derived data; see LICENSING.md). Build it once with cargo run -p keinontolibrary-ingest — see build-artifact — or embed your own via LookupData::from_bytes.

Decline from the command line

Quick lookups and scripting, no code.

Install

# From a checkout (until published to a package channel — see ../DISTRIBUTION.md):
cargo install --path crates/keinontolibrary-cli
# or run in place:
cargo run -p keinontolibrary-cli -- <args>

The CLI reads the artifact at data/artifact/keinontolibrary.bin (override with --artifact or $KEINONTO_ARTIFACT) and an overlay at data/overlay.jsonl (--overlay / $KEINONTO_OVERLAY).

Decline one slot

keinontolibrary decline hevonen --number plural --case inessive
# hevonen (plural inessive): hevosissa  (Present, Lookup)

keinontolibrary decline kuka --number singular --case accusative
# kuka (singular accusative): kenet  (Present, Generated)

The whole paradigm

keinontolibrary paradigm talo
# talo (tn=1)
#   singular nominative   talo
#   singular genitive     talon
#   ...

Declension tables

table renders the full paradigm as a case-rows × singular/plural-columns grid, for one or more words at once. Defective slots show an em dash ().

keinontolibrary table talo
# talo (tn 1)
# case         singular  plural
# nominative   talo      talot
# genitive     talon     talojen
# ...
# comitative   —         taloineen, taloinensa

Pick the output with --format (text default, markdown, csv, json):

keinontolibrary table aika --format markdown    # GitHub table, with a tn heading
keinontolibrary table parfait --format csv      # case,singular,plural (RFC-4180 quoting)
keinontolibrary table talo --format json        # the full Paradigm as JSON

# Several words; --tn/--hn disambiguate and apply to each:
keinontolibrary table talo koira kissa
keinontolibrary table kuusi --tn 27

Exit code 3 if any requested word could not be resolved.

Disambiguate homonyms

keinontolibrary decline kuusi --number singular --case inessive
# 'kuusi' is ambiguous; pass --tn (or --hn):
#   tn=24
#   tn=27
keinontolibrary decline kuusi --number singular --case inessive --tn 27
# kuusi (singular inessive): kuusessa

Add or correct a word (overlay)

keinontolibrary add --lemma uudissana --tn 9 \
    --number singular --case inessive --forms uudissanassa
# overlay: uudissana singular inessive = ["uudissanassa"]

override is an alias of add (the overlay is upsert-by-key — last write wins). New overlay entries are immediately declinable and persist to the overlay file.

JSON output

keinontolibrary decline talo --number plural --case adessive --json
# {"variants":["taloilla"],"status":"present","source":"lookup","coincides_with":null}

Exit codes (for scripts)

codemeaning
0success
3the word could not be declined — unknown, ambiguous, or defective form
1setup/usage error (bad artifact path, I/O)
2argument parsing error (from clap)
if keinontolibrary decline "$w" --number singular --case genitive --json >/tmp/out; then
    jq -r '.variants[0]' /tmp/out
else
    echo "no form for $w (exit $?)"
fi

validate — inspect the loaded artifact

keinontolibrary validate
# version, lemma count, form count, and the Kotus / reference-corpus provenance.

selftest — verify an install

selftest declines a built-in golden set through the rule engine and registry and checks each form. It needs no artifact or data file, so it’s the smoke test to run right after installing from any channel (cargo, brew, apt, the container, …). Exit 0 if every check passes, 1 on any mismatch.

keinontolibrary selftest
# ok   talo singular inessive: talossa (want talossa)
# ok   aika singular genitive: ajan (want ajan)
# ...
# selftest: 8 checks passed

Run the HTTP service

A small axum server for declension lookups — the container deployment.

Run

cargo run -p keinontolibrary-server         # listens on 0.0.0.0:8080

Configuration is via environment:

vardefaultpurpose
KEINONTO_ARTIFACTdata/artifact/keinontolibrary.binthe packed artifact
KEINONTO_OVERLAYdata/overlay.jsonlpersistent overlay (admin writes)
KEINONTO_ADDR0.0.0.0:8080bind address
KEINONTO_ADMIN_TOKEN(unset)bearer token; admin endpoints are disabled unless set
RUST_LOGinfolog level (structured tracing; requests are traced)

Endpoints

curl 'localhost:8080/decline?word=hevonen&number=plural&case=inessive'
# {"variants":["hevosissa"],"status":"present","source":"lookup","coincides_with":null}

curl 'localhost:8080/paradigm?word=talo'          # full table as JSON
curl 'localhost:8080/healthz'                      # "ok"
curl 'localhost:8080/about'                        # version, data metadata, attribution

Both /decline and /paradigm accept &hn= and &tn= to disambiguate homonyms.

Response status codes mirror the engine: 200 ok, 400 bad number/case (or overlong word), 404 unknown word, 409 ambiguous (body lists the candidate paradigms), 422 defective form.

Admin (overlay mutation)

Enabled only when KEINONTO_ADMIN_TOKEN is set. Both paths are aliases (create-or-replace):

curl -X POST localhost:8080/admin/add \
  -H "authorization: Bearer $KEINONTO_ADMIN_TOKEN" \
  -H 'content-type: application/json' \
  -d '{"lemma":"uudissana","tn":9,"number":"singular","case":"inessive","variants":["uudissanassa"]}'

The token is compared in constant time (SHA-256 digests); bad tokens get 403. Request bodies are capped at 16 KiB. Put the service behind a proxy for TLS and rate limiting.

Container

cargo run -p keinontolibrary-ingest          # produce data/artifact/keinontolibrary.bin first
docker build -t keinontolibrary .            # ~10 MB static-musl scratch image
docker run -p 8080:8080 keinontolibrary

The image runs unprivileged (USER 65532) and ships a HEALTHCHECK (the binary self-probes via --health). The server drains in-flight requests on SIGTERM/SIGINT, so it stops cleanly under an orchestrator.

Build the data artifact

The packed artifact (data/artifact/keinontolibrary.bin) is not committed — it embeds reference-corpus-derived forms whose redistribution license is unresolved (see LICENSING.md). Build it yourself from the sources.

Sources

  1. Kotus Nykysuomen sanalista 2024 (CC BY 4.0) — the lemma inventory: https://kaino.kotus.fi/lataa/nykysuomensanalista2024.txtdata/sources/
  2. Reference corpus — Voikko-labeled JSONL shards (collected by the project). Place the *.jsonl shards in data/sources/voikko/.

Ingest

cargo run -p keinontolibrary-ingest          # Kotus + corpus -> data/artifact/keinontolibrary.bin

The artifact is framed (KEIN magic + format-version byte + CRC32); a corrupt, truncated, or version-mismatched file is rejected loudly on load. The build is deterministic — same sources produce a byte-identical artifact.

Overrides (optional, Voikko-required)

Four probe-minted sidecars in data/ refine forms the spelling alone cannot determine — vowel harmony (antigeenissä), comitative style (-ine vs -ineen), and foreign citations (parfait'n, cd:n). They are committed; regenerate after rule changes with:

scripts/qa/run.sh harmony      # needs libvoikko + the Python venv (run.sh setup)

Verify with the QA loop

The QA loop generates every form, checks each against Voikko + the corpus, and gates on regressions and total coverage:

scripts/qa/run.sh setup        # one-time: venv + libvoikko
scripts/qa/run.sh all          # ingest -> dump -> verify -> report --gate

The gate holds two invariants: 0 failing slots and 100% total coverage (every Kotus nominal × every slot answered or declared defective). See scripts/qa/README.md for the full workflow and the accepted-list mechanism.

Fix a wrong declension (the right way)

The library is gated at 100% coverage / 0 failing slots against Voikko + the corpus. Every change keeps it there. The workflow for fixing or adding a form:

1. Reproduce

keinontolibrary decline <word> --number <n> --case <c>

Decide whether it’s a rule problem (a whole class is wrong), a lexical one (one irregular word), or a gap (no form at all).

2. Mint Voikko-verified gold data

Never hand-write forms from memory — mint them. The finnish-testgen skill pulls the fi.wiktionary table and validates every form through Voikko:

DYLD_LIBRARY_PATH=/opt/homebrew/lib .venv/bin/python \
  .claude/skills/finnish-testgen/scripts/mint_testdata.py <word> \
  --kotus data/sources/nykysuomensanalista2024.txt

3. Make the fix at the right altitude

  • Rule — edit the class arm in crates/keinontolibrary-rules/src/generate.rs (or the gradation/harmony helpers). Prefer generalizing over special-casing.
  • Irregular — add Voikko-verified rows to crates/keinontolibrary-rules/exceptions.toml (the registry rejects duplicate keys and is CI-capped).
  • Override — for harmony / comitative / citation quirks, regenerate the sidecar (scripts/qa/run.sh harmony) rather than editing forms by hand.

Add a unit test with the Voikko-verified forms next to the change.

4. Run the gate

scripts/qa/run.sh quick      # while iterating (sampled)
scripts/qa/run.sh all        # full: ingest -> dump -> verify -> report --gate

The gate fails on any new failing slot or coverage drop. If a slot genuinely cannot be judged by Voikko (a lemma outside its lexicon, a Kotus↔Voikko disagreement), add it to qa/accepted.jsonl with a reason — never re-baseline over a real regression. Update the baseline only in the same PR as the fix:

scripts/qa/run.sh report -- --update-baseline

5. Standard checks

cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings && cargo test --all-features

CI additionally runs the MSRV build (1.85) and cargo-audit. See scripts/qa/README.md for the loop’s internals.

Compound nouns (Kotus 50/51) — design & test-data plan

Compounds are the bulk of any Finnish dictionary, and they’re the largest remaining gap in keinontolibrary. This doc records the design and, in particular, how we get enough test data to trust it.

What ships today

Engine has a final-component fallback (see keinontolibrary-core/src/engine.rs): when a lemma is unknown as a whole, it splits off the longest suffix that is a known lemma (using the existing inventory via resolve()), declines that component through the normal lookup→rules path, and re-attaches the fixed modifier prefix. So:

koirankeksi → split → koiran + keksi → decline keksi → keksissä → koirankeksissä

This is correct for the overwhelmingly common Finnish pattern — the head (final component) inflects; the modifier is frozen — and it makes vowel harmony fall out for free (harmony follows the head). It is the right 80% with ~30 lines and no new data.

Known limits of the heuristic

  1. Greedy/wrong splits. Longest-suffix can mis-segment when a long suffix is coincidentally a word (taatelitaikinataatelita + ikina? no, but adversarial cases exist). Mitigated by min prefix/component lengths; not eliminated.
  2. Ambiguous heads. If the head has several paradigms (viini = tn 5 / 26), we take the first; harmony is unaffected but the stem could be wrong for the minority reading.
  3. Modifier-inflecting compounds. A minority inflect the modifier too — numerals (kahdeksankymmentäkahdeksaakymmentä), a few lexicalized nouns. The heuristic keeps the modifier frozen, which is wrong for these.
  4. Linking elements & foreign modifiers. -n-/-en- linkers (koira**n**keksi) are part of the frozen prefix and need no handling; foreign modifiers (beaujolaisviini) work because only the head is looked up.

What “full Kotus 50/51” adds

Kotus marks compounds 50/51 (modifier+head, with/without modifier inflection). Full support means:

  1. Reliable segmentation. Prefer splits where both parts are known lemmas, score candidates (head frequency, prefix plausibility, linker shape), and fall back to the single-known-head heuristic. Consider Voikko’s own compound analysis as an oracle at ingest time (not at runtime).
  2. Modifier inflection. A small, explicit class (numerals + a curated lexicalized list) that inflects both parts; everything else freezes the modifier.
  3. Head paradigm selection. When the head is ambiguous, disambiguate via the compound’s own Kotus entry (50/51 rows carry the head’s class) rather than guessing.

The runtime stays lexicon-light: segmentation uses the packed inventory already loaded; the heavy lifting (which compounds exist, their head class, modifier-inflection flag) is resolved at ingest and baked into the artifact.

The hard part: collecting enough test data

We cannot hand-write paradigms for compounds at scale, and most compounds have no Wiktionary page. Three complementary sources, all Voikko-validated so they’re trustworthy:

1. Mine the reference corpus (highest value)

The corpus already contains compound surface forms — they’re currently dropped at the Kotus join because compounds aren’t in the 1–49 list. Instead:

  • Keep corpus rows whose BASEFORM Voikko analyses as a compound (Voikko exposes the word-part boundaries in WORDBASES/STRUCTURE).
  • For each, record (compound_lemma, number, case) → surface form as attested gold.
  • This yields tens of thousands of real, attested compound forms for free, and is the primary parity target: run the segmentation+decline path and check it reproduces them.

2. Voikko-validated synthetic compounds (coverage at scale)

The corpus is sparse per slot (same funnel as simple nouns). To fill gaps:

  • Take the ~25.7k Kotus heads × a curated set of frequent modifiers (in genitive and nominative linking forms), forming candidate compounds (koiran- + head, työ- + head…).
  • For each candidate, ask Voikko to generate/validate the full paradigm (Voikko knows the compound boundary and the head’s inflection), keeping only forms Voikko confirms.
  • This is the finnish-testgen skill generalized from one word to a compound matrix; output is the same ingest-compatible JSONL. Gate on Voikko agreement so synthetic ≠ wrong.

3. Wiktionary for the lexicalized/irregular tail

For the modifier-inflecting and lexicalized compounds (numerals, fixed expressions), pull the explicit tables from fi.wiktionary via finnish-testgen and add them as exceptions / gold. Small set, high value, where rules and synthetics are least reliable.

Acceptance gate

Extend the rule↔lookup parity harness with a compound parity metric: % of mined-corpus compound slots the segmentation+decline path reproduces, reported per split-confidence bucket. Ship when corpus-compound parity clears a documented threshold (target ≥ 98%, same bar as the simple-noun rule engine), and never regress it.

Rollout

  1. Heuristic final-component fallback — shipped (this PR).
  2. Ingest: keep Voikko-analysed compound forms as gold; build the corpus-compound parity harness (data + metric, no engine change).
  3. Synthetic compound matrix via Voikko; raise coverage; tune segmentation scoring.
  4. Modifier-inflection class + head-paradigm disambiguation from the 50/51 Kotus rows.
  5. Flip remaining “Out” wording once parity clears the gate.