Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Build the data artifact

The packed artifact (data/artifact/keinontolibrary.bin) is not committed — it embeds reference-corpus-derived forms whose redistribution license is unresolved (see LICENSING.md). Build it yourself from the sources.

Sources

  1. Kotus Nykysuomen sanalista 2024 (CC BY 4.0) — the lemma inventory: https://kaino.kotus.fi/lataa/nykysuomensanalista2024.txtdata/sources/
  2. Reference corpus — Voikko-labeled JSONL shards (collected by the project). Place the *.jsonl shards in data/sources/voikko/.

Ingest

cargo run -p keinontolibrary-ingest          # Kotus + corpus -> data/artifact/keinontolibrary.bin

The artifact is framed (KEIN magic + format-version byte + CRC32); a corrupt, truncated, or version-mismatched file is rejected loudly on load. The build is deterministic — same sources produce a byte-identical artifact.

Overrides (optional, Voikko-required)

Four probe-minted sidecars in data/ refine forms the spelling alone cannot determine — vowel harmony (antigeenissä), comitative style (-ine vs -ineen), and foreign citations (parfait'n, cd:n). They are committed; regenerate after rule changes with:

scripts/qa/run.sh harmony      # needs libvoikko + the Python venv (run.sh setup)

Verify with the QA loop

The QA loop generates every form, checks each against Voikko + the corpus, and gates on regressions and total coverage:

scripts/qa/run.sh setup        # one-time: venv + libvoikko
scripts/qa/run.sh all          # ingest -> dump -> verify -> report --gate

The gate holds two invariants: 0 failing slots and 100% total coverage (every Kotus nominal × every slot answered or declared defective). See scripts/qa/README.md for the full workflow and the accepted-list mechanism.