Build the data artifact
The packed artifact (data/artifact/keinontolibrary.bin) is not committed — it embeds
reference-corpus-derived forms whose redistribution license is unresolved (see
LICENSING.md). Build it yourself from the sources.
Sources
- Kotus Nykysuomen sanalista 2024 (CC BY 4.0) — the lemma inventory:
https://kaino.kotus.fi/lataa/nykysuomensanalista2024.txt →
data/sources/ - Reference corpus — Voikko-labeled JSONL shards (collected by the project). Place
the
*.jsonlshards indata/sources/voikko/.
Ingest
cargo run -p keinontolibrary-ingest # Kotus + corpus -> data/artifact/keinontolibrary.bin
The artifact is framed (KEIN magic + format-version byte + CRC32); a corrupt, truncated,
or version-mismatched file is rejected loudly on load. The build is deterministic — same
sources produce a byte-identical artifact.
Overrides (optional, Voikko-required)
Four probe-minted sidecars in data/ refine forms the spelling alone cannot determine —
vowel harmony (antigeenissä), comitative style (-ine vs -ineen), and foreign
citations (parfait'n, cd:n). They are committed; regenerate after rule changes with:
scripts/qa/run.sh harmony # needs libvoikko + the Python venv (run.sh setup)
Verify with the QA loop
The QA loop generates every form, checks each against Voikko + the corpus, and gates on regressions and total coverage:
scripts/qa/run.sh setup # one-time: venv + libvoikko
scripts/qa/run.sh all # ingest -> dump -> verify -> report --gate
The gate holds two invariants: 0 failing slots and 100% total coverage (every
Kotus nominal × every slot answered or declared defective). See
scripts/qa/README.md for the full workflow and the
accepted-list mechanism.