keinontolibrary
A fast, embeddable Rust library that declines Finnish nouns. Give it a noun lemma and a case; it returns the inflected form(s).
decline("hevonen", Number::Plural, Case::Inessive) -> ["hevosissa"]
It is data-backed — precomputed forms from a reference corpus collected over three years and labeled with Voikko — with a rule-based fallback over the Kotus declension classes (taivutustyypit 1–49 with consonant gradation). It covers 100% of declinable Kotus 2024 nominal rows; verbs and the indeclinable tn 99/100 classes are out of scope.
These guides are task-oriented — pick the one that matches what you want to do:
- Embed in a Rust program — add the crate and decline words in your own code.
- Decline from the command line — look up forms from a terminal or script.
- Run the HTTP service — stand up the declension service / container.
- Build the data artifact — rebuild the packed data from sources.
- Fix a wrong declension — the right way to correct a form.
For background on how compounds are handled, see Compound nouns.
The source lives on GitHub. If you’d rather browse declensions than embed them, try humalapaikallissija.com, a toy built on this library.
How it works
Finnish nouns inflect across 15 cases and two numbers — roughly 30 forms per word — with consonant gradation, vowel harmony, and a long tail of irregulars that break any naive lookup table. keinontolibrary produces the forms through several cooperating layers, and then checks every one of them against an independent oracle.
The layers
- Rule generator (Kotus classes 1–49). Implements the Kotus declension types with consonant gradation (grades A–M) and vowel harmony. It handles the bulk of the language directly — about 98% agreement with the reference corpus.
- Compound classes (tn 50–51). Compounds inflect through their parts: tn 50
head-inflecting (
koirankeksi → koirankeksissä), tn 51 both-parts-inflecting (isoveli → isoissaveljissä). A segmenter splits compounds; a plural-head reverse index and a combining-head registry reach the transparent compounds and derivations that Kotus leaves without a declension type. - Corpus-backed lookup. Precomputed surface forms from a reference corpus (collected over three years) are the primary source where available; the rule generator is the fallback.
- Exception registry + overrides. Genuine irregulars the rules can’t derive —
aika → ajan, the suppletivekuka → kenet,vaaka → vaa'an— live in a registry and per-slot overrides. - Productive inference & resolvers. Class inference (
-nen→38,-uus→40,-ias→41), nested compounds, and compound numerals extend reach to 100% of declinable Kotus nominal rows.
Verification
Correctness isn’t asserted — it’s checked. Every form the engine can produce is validated against Voikko, an open-source Finnish morphological analyzer, used purely as an independent oracle: it analyzes each generated form, and a form is accepted only when Voikko confirms it as the expected lemma + case + number. Where the rule generator and the corpus disagree, the corpus is the witness and Voikko the tie-breaker. The quality gate runs at zero wrong forms across the in-scope inventory.
That is why “verified” is a claim rather than a slogan: each form is checked by a tool that had no hand in generating it.
Data & provenance
- Lemma inventory and inflection metadata — the Kotus Nykysuomen sanalista 2024, licensed CC BY 4.0.
- Surface forms — our own reference corpus, collected over three years and labeled using Voikko as an analysis tool.
Scope: all nominal Kotus classes 1–51 — substantives, adjectives, numerals, and pronouns. Verbs and the indeclinable tn 99/100 classes are out of scope.
Embed in a Rust program
Decline Finnish nominals from your own crate.
Add the dependency
[dependencies]
keinontolibrary-core = "0.1"
keinontolibrary-data = "0.1" # the artifact-backed lookup store
keinontolibrary-core is the API (enums, Engine, decline, paradigm).
keinontolibrary-data loads the packed artifact and wires the rule fallback behind it.
Build an engine and decline a word
#![allow(unused)]
fn main() {
use keinontolibrary_core::{Case, Number};
use keinontolibrary_data::build_engine;
// Loads the packed artifact + (optional) overlay, with the rule engine as fallback.
let bundle = build_engine("data/artifact/keinontolibrary.bin", "data/overlay.jsonl")?;
let engine = &bundle.engine;
let forms = engine.decline("hevonen", Number::Plural, Case::Inessive)?;
assert_eq!(forms.primary(), Some("hevosissa"));
assert_eq!(forms.variants, ["hevosissa"]); // primary first; some slots have several
}
decline returns Forms { variants, status, source, coincides_with }:
variants— surface forms, primary first (genitive plural and illative often have more than one valid form).source—Lookup(corpus),Generated(rules/registry), orOverlay.coincides_with— set on the accusative (singular = genitive, plural = nominative).
The whole paradigm
#![allow(unused)]
fn main() {
let p = engine.paradigm("talo")?;
for (number, case, forms) in p.iter() {
println!("{number} {case}: {}", forms.variants.join(", "));
}
}
Handle the three error cases
decline/paradigm return Result<_, Error>:
#![allow(unused)]
fn main() {
use keinontolibrary_core::Error;
match engine.decline("kuusi", Number::Singular, Case::Inessive) {
Ok(f) => println!("{:?}", f.primary()),
Err(Error::UnknownWord(w)) => eprintln!("not a known word: {w}"),
Err(Error::Ambiguous { lemma, paradigms }) => {
// Homonyms: kuusi is "six" (tn24) and "spruce" (tn27). Pick one and retry.
eprintln!("{lemma} is ambiguous: {paradigms:?}");
}
Err(Error::DefectiveForm { lemma, number, case }) => {
// The slot genuinely does not exist (e.g. sakset has no singular).
eprintln!("{lemma} has no {number} {case}");
}
}
}
Disambiguate homonyms
Pass an explicit paradigm with decline_with / paradigm_with:
#![allow(unused)]
fn main() {
use keinontolibrary_core::ParadigmRef;
// "kuusi" → spruce (Kotus class tn27)
let f = engine.decline_with("kuusi", Number::Singular, Case::Inessive,
&ParadigmRef::new(None, 27))?;
assert_eq!(f.primary(), Some("kuusessa"));
}
ParadigmRef::new(hn, tn) — a None field is a wildcard (matches any homonym number /
any class). To require a specific reading, pass Some(_).
Notes
- All inputs are normalized (trimmed, NFC, lowercased) —
"Talo"," talo ", and"talo"are equivalent. - Resolution order is overlay → lookup → rule fallback: an overlay entry wins, then the corpus, then the rules.
- The artifact is not committed (it embeds corpus-derived data; see
LICENSING.md). Build it once withcargo run -p keinontolibrary-ingest— see build-artifact — or embed your own viaLookupData::from_bytes.
Decline from the command line
Quick lookups and scripting, no code.
Install
# From a checkout (until published to a package channel — see ../DISTRIBUTION.md):
cargo install --path crates/keinontolibrary-cli
# or run in place:
cargo run -p keinontolibrary-cli -- <args>
The CLI reads the artifact at data/artifact/keinontolibrary.bin (override with
--artifact or $KEINONTO_ARTIFACT) and an overlay at data/overlay.jsonl
(--overlay / $KEINONTO_OVERLAY).
Decline one slot
keinontolibrary decline hevonen --number plural --case inessive
# hevonen (plural inessive): hevosissa (Present, Lookup)
keinontolibrary decline kuka --number singular --case accusative
# kuka (singular accusative): kenet (Present, Generated)
The whole paradigm
keinontolibrary paradigm talo
# talo (tn=1)
# singular nominative talo
# singular genitive talon
# ...
Declension tables
table renders the full paradigm as a case-rows × singular/plural-columns grid, for one
or more words at once. Defective slots show an em dash (—).
keinontolibrary table talo
# talo (tn 1)
# case singular plural
# nominative talo talot
# genitive talon talojen
# ...
# comitative — taloineen, taloinensa
Pick the output with --format (text default, markdown, csv, json):
keinontolibrary table aika --format markdown # GitHub table, with a tn heading
keinontolibrary table parfait --format csv # case,singular,plural (RFC-4180 quoting)
keinontolibrary table talo --format json # the full Paradigm as JSON
# Several words; --tn/--hn disambiguate and apply to each:
keinontolibrary table talo koira kissa
keinontolibrary table kuusi --tn 27
Exit code 3 if any requested word could not be resolved.
Disambiguate homonyms
keinontolibrary decline kuusi --number singular --case inessive
# 'kuusi' is ambiguous; pass --tn (or --hn):
# tn=24
# tn=27
keinontolibrary decline kuusi --number singular --case inessive --tn 27
# kuusi (singular inessive): kuusessa
Add or correct a word (overlay)
keinontolibrary add --lemma uudissana --tn 9 \
--number singular --case inessive --forms uudissanassa
# overlay: uudissana singular inessive = ["uudissanassa"]
override is an alias of add (the overlay is upsert-by-key — last write wins). New
overlay entries are immediately declinable and persist to the overlay file.
JSON output
keinontolibrary decline talo --number plural --case adessive --json
# {"variants":["taloilla"],"status":"present","source":"lookup","coincides_with":null}
Exit codes (for scripts)
| code | meaning |
|---|---|
| 0 | success |
| 3 | the word could not be declined — unknown, ambiguous, or defective form |
| 1 | setup/usage error (bad artifact path, I/O) |
| 2 | argument parsing error (from clap) |
if keinontolibrary decline "$w" --number singular --case genitive --json >/tmp/out; then
jq -r '.variants[0]' /tmp/out
else
echo "no form for $w (exit $?)"
fi
validate — inspect the loaded artifact
keinontolibrary validate
# version, lemma count, form count, and the Kotus / reference-corpus provenance.
selftest — verify an install
selftest declines a built-in golden set through the rule engine and registry and checks
each form. It needs no artifact or data file, so it’s the smoke test to run right after
installing from any channel (cargo, brew, apt, the container, …). Exit 0 if every check
passes, 1 on any mismatch.
keinontolibrary selftest
# ok talo singular inessive: talossa (want talossa)
# ok aika singular genitive: ajan (want ajan)
# ...
# selftest: 8 checks passed
Run the HTTP service
A small axum server for declension lookups — the container deployment.
Run
cargo run -p keinontolibrary-server # listens on 0.0.0.0:8080
Configuration is via environment:
| var | default | purpose |
|---|---|---|
KEINONTO_ARTIFACT | data/artifact/keinontolibrary.bin | the packed artifact |
KEINONTO_OVERLAY | data/overlay.jsonl | persistent overlay (admin writes) |
KEINONTO_ADDR | 0.0.0.0:8080 | bind address |
KEINONTO_ADMIN_TOKEN | (unset) | bearer token; admin endpoints are disabled unless set |
RUST_LOG | info | log level (structured tracing; requests are traced) |
Endpoints
curl 'localhost:8080/decline?word=hevonen&number=plural&case=inessive'
# {"variants":["hevosissa"],"status":"present","source":"lookup","coincides_with":null}
curl 'localhost:8080/paradigm?word=talo' # full table as JSON
curl 'localhost:8080/healthz' # "ok"
curl 'localhost:8080/about' # version, data metadata, attribution
Both /decline and /paradigm accept &hn= and &tn= to disambiguate homonyms.
Response status codes mirror the engine: 200 ok, 400 bad number/case (or overlong
word), 404 unknown word, 409 ambiguous (body lists the candidate paradigms), 422
defective form.
Admin (overlay mutation)
Enabled only when KEINONTO_ADMIN_TOKEN is set. Both paths are aliases (create-or-replace):
curl -X POST localhost:8080/admin/add \
-H "authorization: Bearer $KEINONTO_ADMIN_TOKEN" \
-H 'content-type: application/json' \
-d '{"lemma":"uudissana","tn":9,"number":"singular","case":"inessive","variants":["uudissanassa"]}'
The token is compared in constant time (SHA-256 digests); bad tokens get 403. Request
bodies are capped at 16 KiB. Put the service behind a proxy for TLS and rate limiting.
Container
cargo run -p keinontolibrary-ingest # produce data/artifact/keinontolibrary.bin first
docker build -t keinontolibrary . # ~10 MB static-musl scratch image
docker run -p 8080:8080 keinontolibrary
The image runs unprivileged (USER 65532) and ships a HEALTHCHECK (the binary
self-probes via --health). The server drains in-flight requests on SIGTERM/SIGINT, so
it stops cleanly under an orchestrator.
Build the data artifact
The packed artifact (data/artifact/keinontolibrary.bin) is not committed — it embeds
reference-corpus-derived forms whose redistribution license is unresolved (see
LICENSING.md). Build it yourself from the sources.
Sources
- Kotus Nykysuomen sanalista 2024 (CC BY 4.0) — the lemma inventory:
https://kaino.kotus.fi/lataa/nykysuomensanalista2024.txt →
data/sources/ - Reference corpus — Voikko-labeled JSONL shards (collected by the project). Place
the
*.jsonlshards indata/sources/voikko/.
Ingest
cargo run -p keinontolibrary-ingest # Kotus + corpus -> data/artifact/keinontolibrary.bin
The artifact is framed (KEIN magic + format-version byte + CRC32); a corrupt, truncated,
or version-mismatched file is rejected loudly on load. The build is deterministic — same
sources produce a byte-identical artifact.
Overrides (optional, Voikko-required)
Four probe-minted sidecars in data/ refine forms the spelling alone cannot determine —
vowel harmony (antigeenissä), comitative style (-ine vs -ineen), and foreign
citations (parfait'n, cd:n). They are committed; regenerate after rule changes with:
scripts/qa/run.sh harmony # needs libvoikko + the Python venv (run.sh setup)
Verify with the QA loop
The QA loop generates every form, checks each against Voikko + the corpus, and gates on regressions and total coverage:
scripts/qa/run.sh setup # one-time: venv + libvoikko
scripts/qa/run.sh all # ingest -> dump -> verify -> report --gate
The gate holds two invariants: 0 failing slots and 100% total coverage (every
Kotus nominal × every slot answered or declared defective). See
scripts/qa/README.md for the full workflow and the
accepted-list mechanism.
Fix a wrong declension (the right way)
The library is gated at 100% coverage / 0 failing slots against Voikko + the corpus. Every change keeps it there. The workflow for fixing or adding a form:
1. Reproduce
keinontolibrary decline <word> --number <n> --case <c>
Decide whether it’s a rule problem (a whole class is wrong), a lexical one (one irregular word), or a gap (no form at all).
2. Mint Voikko-verified gold data
Never hand-write forms from memory — mint them. The finnish-testgen skill pulls the
fi.wiktionary table and validates every form through Voikko:
DYLD_LIBRARY_PATH=/opt/homebrew/lib .venv/bin/python \
.claude/skills/finnish-testgen/scripts/mint_testdata.py <word> \
--kotus data/sources/nykysuomensanalista2024.txt
3. Make the fix at the right altitude
- Rule — edit the class arm in
crates/keinontolibrary-rules/src/generate.rs(or the gradation/harmony helpers). Prefer generalizing over special-casing. - Irregular — add Voikko-verified rows to
crates/keinontolibrary-rules/exceptions.toml(the registry rejects duplicate keys and is CI-capped). - Override — for harmony / comitative / citation quirks, regenerate the sidecar
(
scripts/qa/run.sh harmony) rather than editing forms by hand.
Add a unit test with the Voikko-verified forms next to the change.
4. Run the gate
scripts/qa/run.sh quick # while iterating (sampled)
scripts/qa/run.sh all # full: ingest -> dump -> verify -> report --gate
The gate fails on any new failing slot or coverage drop. If a slot genuinely
cannot be judged by Voikko (a lemma outside its lexicon, a Kotus↔Voikko disagreement),
add it to qa/accepted.jsonl with a reason — never re-baseline over a real
regression. Update the baseline only in the same PR as the fix:
scripts/qa/run.sh report -- --update-baseline
5. Standard checks
cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings && cargo test --all-features
CI additionally runs the MSRV build (1.85) and cargo-audit. See
scripts/qa/README.md for the loop’s internals.
Compound nouns (Kotus 50/51) — design & test-data plan
Compounds are the bulk of any Finnish dictionary, and they’re the largest remaining gap in keinontolibrary. This doc records the design and, in particular, how we get enough test data to trust it.
What ships today
Engine has a final-component fallback (see keinontolibrary-core/src/engine.rs): when a
lemma is unknown as a whole, it splits off the longest suffix that is a known lemma
(using the existing inventory via resolve()), declines that component through the normal
lookup→rules path, and re-attaches the fixed modifier prefix. So:
koirankeksi → split → koiran + keksi → decline keksi → keksissä → koirankeksissä
This is correct for the overwhelmingly common Finnish pattern — the head (final component) inflects; the modifier is frozen — and it makes vowel harmony fall out for free (harmony follows the head). It is the right 80% with ~30 lines and no new data.
Known limits of the heuristic
- Greedy/wrong splits. Longest-suffix can mis-segment when a long suffix is
coincidentally a word (
taatelitaikina→taatelita+ikina? no, but adversarial cases exist). Mitigated by min prefix/component lengths; not eliminated. - Ambiguous heads. If the head has several paradigms (
viini= tn 5 / 26), we take the first; harmony is unaffected but the stem could be wrong for the minority reading. - Modifier-inflecting compounds. A minority inflect the modifier too — numerals
(
kahdeksankymmentä→kahdeksaakymmentä), a few lexicalized nouns. The heuristic keeps the modifier frozen, which is wrong for these. - Linking elements & foreign modifiers.
-n-/-en-linkers (koira**n**keksi) are part of the frozen prefix and need no handling; foreign modifiers (beaujolaisviini) work because only the head is looked up.
What “full Kotus 50/51” adds
Kotus marks compounds 50/51 (modifier+head, with/without modifier inflection). Full support means:
- Reliable segmentation. Prefer splits where both parts are known lemmas, score candidates (head frequency, prefix plausibility, linker shape), and fall back to the single-known-head heuristic. Consider Voikko’s own compound analysis as an oracle at ingest time (not at runtime).
- Modifier inflection. A small, explicit class (numerals + a curated lexicalized list) that inflects both parts; everything else freezes the modifier.
- Head paradigm selection. When the head is ambiguous, disambiguate via the compound’s own Kotus entry (50/51 rows carry the head’s class) rather than guessing.
The runtime stays lexicon-light: segmentation uses the packed inventory already loaded; the heavy lifting (which compounds exist, their head class, modifier-inflection flag) is resolved at ingest and baked into the artifact.
The hard part: collecting enough test data
We cannot hand-write paradigms for compounds at scale, and most compounds have no Wiktionary page. Three complementary sources, all Voikko-validated so they’re trustworthy:
1. Mine the reference corpus (highest value)
The corpus already contains compound surface forms — they’re currently dropped at the Kotus join because compounds aren’t in the 1–49 list. Instead:
- Keep corpus rows whose
BASEFORMVoikko analyses as a compound (Voikko exposes the word-part boundaries inWORDBASES/STRUCTURE). - For each, record
(compound_lemma, number, case) → surface formas attested gold. - This yields tens of thousands of real, attested compound forms for free, and is the primary parity target: run the segmentation+decline path and check it reproduces them.
2. Voikko-validated synthetic compounds (coverage at scale)
The corpus is sparse per slot (same funnel as simple nouns). To fill gaps:
- Take the ~25.7k Kotus heads × a curated set of frequent modifiers (in genitive and
nominative linking forms), forming candidate compounds (
koiran-+ head,työ-+ head…). - For each candidate, ask Voikko to generate/validate the full paradigm (Voikko knows the compound boundary and the head’s inflection), keeping only forms Voikko confirms.
- This is the
finnish-testgenskill generalized from one word to a compound matrix; output is the same ingest-compatible JSONL. Gate on Voikko agreement so synthetic ≠ wrong.
3. Wiktionary for the lexicalized/irregular tail
For the modifier-inflecting and lexicalized compounds (numerals, fixed expressions), pull
the explicit tables from fi.wiktionary via finnish-testgen and add them as exceptions /
gold. Small set, high value, where rules and synthetics are least reliable.
Acceptance gate
Extend the rule↔lookup parity harness with a compound parity metric: % of mined-corpus compound slots the segmentation+decline path reproduces, reported per split-confidence bucket. Ship when corpus-compound parity clears a documented threshold (target ≥ 98%, same bar as the simple-noun rule engine), and never regress it.
Rollout
- Heuristic final-component fallback — shipped (this PR).
- Ingest: keep Voikko-analysed compound forms as gold; build the corpus-compound parity harness (data + metric, no engine change).
- Synthetic compound matrix via Voikko; raise coverage; tune segmentation scoring.
- Modifier-inflection class + head-paradigm disambiguation from the 50/51 Kotus rows.
- Flip remaining “Out” wording once parity clears the gate.