Package org.egothor.stemmer


package org.egothor.stemmer
Provides the core Egothor-style stemming infrastructure based on compact patch-command tries.

The package centers on a read-only FrequencyTrie that maps word forms to one or more values together with their recorded local frequencies. In the stemming use case, these values are compact patch commands that reconstruct a canonical stem from an observed surface form. The trie is built through FrequencyTrie.Builder, reduced into a canonical immutable structure, and then queried through deterministic get(String), getAll(String), and getEntries(String) operations.

Patch commands are produced and interpreted by PatchCommandEncoder. The encoder follows the historical Egothor convention in which edit instructions are serialized for application from the end of the source word toward its beginning. The implementation supports canonical no-operation patches for identity transformations and compact commands for insertion, deletion, replacement, and suffix-preserving transitions.

Dictionary loading is provided by StemmerPatchTrieLoader, which reads the traditional line-oriented stemmer resource format in which each non-empty logical line starts with a canonical stem followed by known surface variants. Parsing is delegated to StemmerDictionaryParser, which normalizes input to lower case using Locale.ROOT and supports whole-line as well as trailing remarks introduced by # or //. During loading, each variant is converted into a patch command targeting the canonical stem, and the stem itself may optionally be stored under the canonical no-operation patch.

Trie compilation behavior is controlled by ReductionMode and ReductionSettings. These types define how semantically equivalent subtrees may be merged during compilation in order to reduce the size of the final immutable trie while preserving the intended lookup semantics. Depending on the selected mode, reduction may preserve full ranked getAll() semantics, unordered value equivalence, or dominant get() semantics subject to configurable dominance thresholds.

Persisted compiled tries are supported through StemmerPatchTrieBinaryIO and the corresponding binary loading and saving methods on StemmerPatchTrieLoader. The persisted form wraps the native FrequencyTrie binary format in GZip compression and is intended for efficient deployment and runtime loading. Reconstructing a writable builder from an already compiled trie is supported by FrequencyTrieBuilders.

For offline preparation of deployment artifacts, the package also provides the Compile command-line utility, which reads a dictionary source, applies the configured reduction strategy, and writes the resulting compressed binary trie.

The package is designed for deterministic behavior, compact persisted representation, and efficient runtime lookup. Public APIs are intentionally focused on immutable compiled structures for read paths, with separate explicit builder-oriented entry points for mutation and reconstruction.