CLI Compilation
Radixor provides a command-line compiler for turning line-oriented dictionary files into compact binary stemmer artifacts.
This is the preferred preparation workflow when stemming should run against an already compiled artifact rather than against raw dictionary input. The CLI reads the dictionary, derives patch commands, builds a mutable trie, applies the selected subtree reduction strategy, and writes the final compiled trie in the project binary format under GZip compression. The result is a deployment-ready .radixor.gz file that can be loaded directly by application code.
What the CLI does
The Compile tool performs the following steps:
- reads the input dictionary in the standard Radixor stemmer format,
- parses each line into a canonical stem and its known variants,
- converts variants into patch commands,
- builds a mutable trie of patch-command values,
- applies the configured reduction mode,
- writes the compiled trie as a GZip-compressed binary artifact.
This workflow is intentionally aligned with the same dictionary semantics used elsewhere in the library. Remarks introduced by # or // are supported through the shared dictionary parser.
Basic usage
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
Supported arguments
The CLI supports the following arguments:
--input <file>
--output <file>
--reduction-mode <mode>
[--store-original]
[--dominant-winner-min-percent <1..100>]
[--dominant-winner-over-second-ratio <1..n>]
[--overwrite]
[--help]
--input <file>
Path to the source dictionary file.
The file must use the standard line-oriented dictionary format. Each non-empty logical line starts with the canonical stem and may contain zero or more variants. The parser expects UTF-8 input, lowercases it using Locale.ROOT, and ignores trailing remarks introduced by # or //.
Example:
--input ./data/stemmer.txt
--output <file>
Path to the output binary artifact.
The output file is written as a GZip-compressed binary trie. Parent directories are created automatically when needed.
Example:
--output ./build/english.radixor.gz
--reduction-mode <mode>
Selects the subtree reduction strategy used during compilation.
Supported values are:
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTSMERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTSMERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
Example:
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
This argument is required.
--store-original
When this flag is present, the canonical stem itself is inserted using the no-op patch command.
--store-original
This is usually a sensible default for real dictionaries because it ensures that canonical forms are directly representable in the compiled trie rather than relying only on their variants.
--dominant-winner-min-percent <1..100>
Sets the minimum winner percentage used by dominant-result reduction settings.
Example:
--dominant-winner-min-percent 75
This option matters primarily when --reduction-mode is MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS. The default value is 75.
--dominant-winner-over-second-ratio <1..n>
Sets the minimum winner-over-second ratio used by dominant-result reduction settings.
Example:
--dominant-winner-over-second-ratio 3
This option also matters primarily for MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS. The default value is 3.
--overwrite
Allows the CLI to replace an already existing output file.
--overwrite
Without this flag, compilation fails when the output path already exists.
--help
Prints usage help and exits successfully.
--help
The short form -h is also supported.
Reduction modes in practice
Reduction mode is not only a storage decision. It also influences what semantics are preserved when the mutable trie is compiled into its canonical read-only form.
Ranked getAll() equivalence
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS merges subtrees whose getAll() results remain equivalent for every reachable key suffix and whose local result ordering is the same.
This is the best general-purpose choice when result ordering and ambiguity handling matter. It preserves ranked multi-result semantics while still achieving useful structural reduction.
This is the recommended default for most users.
Unordered getAll() equivalence
MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS also uses getAll()-level equivalence, but it ignores local ordering differences in addition to absolute frequencies.
This can yield stronger reduction, but it also weakens the precision of ordered multi-result semantics.
Choose this mode only when the application does not depend on the ordering of alternative results.
Dominant get() equivalence
MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS focuses on preserving preferred-result semantics for get(), subject to dominance thresholds.
If a node does not satisfy the configured dominance constraints, compilation falls back to ranked getAll() semantics for that node to avoid unsafe over-reduction.
This mode is most suitable when the application primarily consumes the preferred result and does not rely on preserving richer ambiguity information.
Recommended usage patterns
Use offline preparation
The CLI is best used as a preparation step during packaging, deployment, or controlled artifact generation. This keeps compilation outside the runtime startup path and allows services to load only the finished binary trie.
Treat compiled files as versioned assets
A .radixor.gz file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
Choose reduction mode deliberately
The ranked getAll() mode is the safest default. The unordered and dominant modes should be chosen only when their trade-offs are acceptable for the consuming application.
Expect memory pressure during preparation, not runtime
Compilation is usually a one-time step and is generally fast. The more important operational consideration is memory usage during preparation, because the dictionary-derived mutable structure exists before reduction compacts it into the final read-only trie. This is especially relevant for very large source dictionaries.
Example workflow
1. Prepare a dictionary
run running runs ran
connect connected connecting
2. Compile it
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
3. Load it in an application
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.StemmerPatchTrieLoader;
final FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
Exit codes and error handling
The CLI uses three exit outcomes:
0for success,1for processing failures such as I/O or compilation errors,2for invalid command-line usage.
When argument parsing fails, the CLI prints the error message, prints the usage summary, and exits with usage error status.
When compilation fails during processing, the CLI prints a Compilation failed: ... message to standard error and exits with processing error status.
Examples of failure conditions include:
- missing required arguments,
- unknown arguments,
- invalid integer values for dominant thresholds,
- missing input files,
- unreadable input,
- existing output file without
--overwrite, - general I/O failures during reading or writing.
Relation to programmatic usage
The CLI and the programmatic API implement the same conceptual preparation step. The CLI is the operationally convenient choice when you want a ready-made binary artifact. The programmatic API is the better fit when compilation must be integrated directly into custom Java workflows.