Built-in Languages
Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.
These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.
Overview
Bundled dictionaries are exposed through:
StemmerPatchTrieLoader.Language
They are packaged with the library as text resources and compiled into a FrequencyTrie<String> when loaded.
Supported languages
The following bundled language identifiers are currently available:
| Language | Enum constant | Notes |
|---|---|---|
| Danish | DA_DK |
Bundled general-purpose dictionary |
| German | DE_DE |
Bundled general-purpose dictionary |
| Spanish | ES_ES |
Bundled general-purpose dictionary |
| French | FR_FR |
Bundled general-purpose dictionary |
| Italian | IT_IT |
Bundled general-purpose dictionary |
| Dutch | NL_NL |
Bundled general-purpose dictionary |
| Norwegian | NO_NO |
Bundled general-purpose dictionary |
| Portuguese | PT_PT |
Bundled general-purpose dictionary |
| Russian | RU_RU |
Currently supplied in normalized transliterated form |
| Swedish | SV_SE |
Bundled general-purpose dictionary |
| English | US_UK |
Standard English dictionary |
| English | US_UK_PROFI |
Extended English dictionary |
Basic usage
Load a bundled stemmer like this:
import java.io.IOException;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BuiltInExample {
private BuiltInExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
}
}
The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.
Example: stemming with US_UK_PROFI
import java.io.IOException;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.PatchCommandEncoder;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class EnglishExample {
private EnglishExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
final String word = "running";
final String patch = trie.get(word);
final String stem = PatchCommandEncoder.apply(word, patch);
System.out.println(word + " -> " + stem);
}
}
US_UK and US_UK_PROFI
Radixor currently provides two bundled English variants.
US_UK
US_UK is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.
US_UK_PROFI
US_UK_PROFI is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.
Recommendation
For most English-language deployments, prefer:
US_UK_PROFI
Use US_UK when a smaller bundled baseline is more appropriate.
Intended role of bundled dictionaries
Bundled dictionaries should be understood as general-purpose default resources.
They are a good fit when:
- a supported language is already available,
- immediate usability matters,
- a reasonable baseline is sufficient,
- the goal is evaluation, prototyping, or straightforward integration.
They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.
Character representation
The current bundled resources follow a pragmatic normalization convention.
At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.
This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.
When to prefer custom dictionaries
A custom dictionary is usually the better choice when:
- domain-specific vocabulary materially affects stemming quality,
- lexical coverage must be controlled more precisely,
- a stronger language resource is available than the bundled baseline,
- native-script support is needed beyond the currently bundled resources.
Typical examples include:
- technical terminology,
- biomedical language,
- legal or financial vocabulary,
- organization-specific product and process names,
- language resources maintained in native scripts.
Production recommendation
For production systems, the most robust workflow is usually:
- start from a bundled dictionary when it is suitable,
- extend it with domain-specific forms if needed,
- compile or rebuild it into a binary
.radixor.gzartifact, - deploy that compiled artifact,
- load it at runtime using
loadBinary(...).
This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.
Example refinement workflow
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.FrequencyTrieBuilders;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.ReductionSettings;
import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BundledRefinementExample {
private BundledRefinementExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
base,
String[]::new,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));
builder.put("microservices", "Na");
final FrequencyTrie<String> compiled = builder.build();
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
}
}
Extending language support
The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.
What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.
Next steps
Summary
Radixor’s built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.