Package org.egothor.stemmer
The package centers on a read-only FrequencyTrie
that maps word forms to one or more values together with their recorded local
frequencies. In the stemming use case, these values are compact patch
commands that reconstruct a canonical stem from an observed surface form. The
trie is built through FrequencyTrie.Builder,
reduced into a canonical immutable structure, and then queried through
deterministic get(String), getAll(String), and
getEntries(String) operations.
Patch commands are produced and interpreted by
PatchCommandEncoder. The encoder follows the
historical Egothor convention in which edit instructions are serialized for
application from the end of the source word toward its beginning. The
implementation supports canonical no-operation patches for identity
transformations and compact commands for insertion, deletion, replacement,
and suffix-preserving transitions.
Dictionary loading is provided by
StemmerPatchTrieLoader, which reads the
traditional line-oriented stemmer resource format in which each non-empty
logical line starts with a canonical stem followed by known surface variants.
Parsing is delegated to StemmerDictionaryParser,
which normalizes input to lower case using Locale.ROOT and
supports whole-line as well as trailing remarks introduced by # or
//. During loading, each variant is converted into a patch command
targeting the canonical stem, and the stem itself may optionally be stored
under the canonical no-operation patch.
Trie compilation behavior is controlled by
ReductionMode and
ReductionSettings. These types define how
semantically equivalent subtrees may be merged during compilation in order to
reduce the size of the final immutable trie while preserving the intended
lookup semantics. Depending on the selected mode, reduction may preserve full
ranked getAll() semantics, unordered value equivalence, or dominant
get() semantics subject to configurable dominance thresholds.
Persisted compiled tries are supported through
StemmerPatchTrieBinaryIO and the corresponding
binary loading and saving methods on
StemmerPatchTrieLoader. The persisted form wraps
the native FrequencyTrie binary format in GZip
compression and is intended for efficient deployment and runtime loading.
Reconstructing a writable builder from an already compiled trie is supported
by FrequencyTrieBuilders.
For offline preparation of deployment artifacts, the package also provides
the Compile command-line utility, which reads a
dictionary source, applies the configured reduction strategy, and writes the
resulting compressed binary trie.
The package is designed for deterministic behavior, compact persisted representation, and efficient runtime lookup. Public APIs are intentionally focused on immutable compiled structures for read paths, with separate explicit builder-oriented entry points for mutation and reconstruction.
-
ClassDescriptionCommand-line compiler of stemmer dictionary files into compressed binary
FrequencyTrieartifacts.Read-only trie mappingStringkeys to one or more values with frequency tracking.Builder ofFrequencyTrie.Codec used to persist values stored in the trie.Factory utilities related toFrequencyTrie.Builder.Encodes a compact patch command that transforms one word form into another and applies such commands back to source words.Defines the subtree reduction strategy applied during trie compilation.Immutable reduction configuration used byFrequencyTrie.Builder.Parser of line-oriented stemmer dictionary files.Callback receiving one parsed dictionary line.Immutable parsing statistics.Binary persistence helper for patch-command stemmer tries.Loader of patch-command tries from bundled stemmer dictionaries.Supported bundled stemmer dictionaries.ValueCount<V>Immutable value-count pair returned by read-only trie queries.