Package org.egothor.stemmer
The package centers on a read-only FrequencyTrie
that maps word forms to one or more values together with their recorded local
frequencies. In the stemming use case, these values are compact patch
commands that reconstruct a canonical stem from an observed surface form. The
trie is built through FrequencyTrie.Builder,
reduced into a canonical immutable structure, and then queried through
deterministic get(String), getAll(String), and
getEntries(String) operations.
Patch commands are produced and interpreted by
PatchCommandEncoder. The encoder follows the
historical Egothor convention in which edit instructions are serialized for
application from the end of the source word toward its beginning. The
implementation supports canonical no-operation patches for identity
transformations and compact commands for insertion, deletion, replacement,
and suffix-preserving transitions.
Dictionary loading is provided by
StemmerPatchTrieLoader, which reads the
traditional line-oriented tab-separated values resource format in which each
non-empty logical line starts with a canonical stem followed by known surface
variants in subsequent tab-separated columns.
Parsing is delegated to StemmerDictionaryParser,
which applies configurable case processing through
CaseProcessingMode (default:
CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT),
supports whole-line as well as trailing remarks introduced by # or
//, and currently ignores dictionary items containing Unicode
whitespace characters while reporting them through warning-level diagnostics.
During loading, each variant is converted into a patch command
targeting the canonical stem, and the stem itself may optionally be stored
under the canonical no-operation patch.
Trie compilation behavior is controlled by
ReductionMode and
ReductionSettings. These types define how
semantically equivalent subtrees may be merged during compilation in order to
reduce the size of the final immutable trie while preserving the intended
lookup semantics. Depending on the selected mode, reduction may preserve full
ranked getAll() semantics, unordered value equivalence, or dominant
get() semantics subject to configurable dominance thresholds.
Persisted compiled tries are supported through
StemmerPatchTrieBinaryIO and the corresponding
binary loading and saving methods on
StemmerPatchTrieLoader. The persisted form wraps
the native FrequencyTrie binary format in GZip
compression and is intended for efficient deployment and runtime loading.
Reconstructing a writable builder from an already compiled trie is supported
by FrequencyTrieBuilders.
For offline preparation of deployment artifacts, the package also provides
the Compile command-line utility, which reads a
dictionary source, applies the configured reduction strategy, and writes the
resulting compressed binary trie.
The package is designed for deterministic behavior, compact persisted representation, and efficient runtime lookup. Public APIs are intentionally focused on immutable compiled structures for read paths, with separate explicit builder-oriented entry points for mutation and reconstruction.
-
ClassDescriptionDefines how dictionary items are normalized with respect to letter casing.Command-line compiler of stemmer dictionary files into compressed binary
FrequencyTrieartifacts.Defines how dictionary loading and trie traversal should treat diacritics.Read-only trie mappingStringkeys to one or more values with frequency tracking.Builder ofFrequencyTrie.Codec used to persist values stored in the trie.Factory utilities related toFrequencyTrie.Builder.Encodes a compact patch command that transforms one word form into another and applies such commands back to source words.Fluent builder for creating direction-specializedPatchCommandEncoderinstances.Defines the subtree reduction strategy applied during trie compilation.Immutable reduction configuration used byFrequencyTrie.Builder.Parser of line-oriented stemmer dictionary files.Callback receiving one parsed dictionary line.Immutable parsing statistics.Evaluates how stemming quality degrades when the compiled trie is built from only a deterministic subset of the available dictionary knowledge.One immutable result row of the knowledge experiment.Command-line entry point for the stemmer knowledge experiment.Binary persistence helper for patch-command stemmer tries.Loader of patch-command tries from bundled stemmer dictionaries.Supported bundled stemmer dictionaries.Immutable metadata persisted together with a compiled trie artifact.ValueCount<V>Immutable value-count pair returned by read-only trie queries.Defines the logical direction in which word characters are traversed.