Package org.egothor.stemmer
Class StemmerKnowledgeExperiment
java.lang.Object
org.egothor.stemmer.StemmerKnowledgeExperiment
Evaluates how stemming quality degrades when the compiled trie is built from
only a deterministic subset of the available dictionary knowledge.
The experiment operates on whole dictionary entries. For a chosen knowledge
percentage, each parsed dictionary line is deterministically included or
excluded from the training subset using a seeded SplittableRandom.
The resulting subset is compiled into a FrequencyTrie, while the
evaluation is performed against all word forms from the original dictionary.
Two lookup APIs are evaluated:
FrequencyTrie.get(String)through top-1 accuracyFrequencyTrie.getAll(String)through global precision, recall, and F1
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordOne immutable result row of the knowledge experiment. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intStep between adjacent evaluated knowledge percentages.static final intMaximum supported knowledge percentage.static final intMinimum supported knowledge percentage. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionEvaluates a dictionary provided through an arbitrary reader.evaluateAllBundledLanguages(long seed) Evaluates all supported bundled dictionaries using the supplied seed.evaluateBundledLanguage(StemmerPatchTrieLoader.Language language, long seed) Evaluates one bundled dictionary across all supported experiment configurations.evaluatePath(Path dictionaryPath, long seed) Evaluates one filesystem dictionary across all supported experiment configurations.static voidwriteCsv(Path outputPath, List<StemmerKnowledgeExperiment.ResultRow> rows) Writes result rows as UTF-8 CSV with a stable fixed header.
-
Field Details
-
MINIMUM_KNOWLEDGE_PERCENT
public static final int MINIMUM_KNOWLEDGE_PERCENTMinimum supported knowledge percentage.- See Also:
-
MAXIMUM_KNOWLEDGE_PERCENT
public static final int MAXIMUM_KNOWLEDGE_PERCENTMaximum supported knowledge percentage.- See Also:
-
KNOWLEDGE_PERCENT_STEP
public static final int KNOWLEDGE_PERCENT_STEPStep between adjacent evaluated knowledge percentages.- See Also:
-
-
Constructor Details
-
StemmerKnowledgeExperiment
public StemmerKnowledgeExperiment()Creates a new experiment harness.
-
-
Method Details
-
evaluateAllBundledLanguages
public List<StemmerKnowledgeExperiment.ResultRow> evaluateAllBundledLanguages(long seed) throws IOException Evaluates all supported bundled dictionaries using the supplied seed.- Parameters:
seed- deterministic sampling seed- Returns:
- immutable ordered list of experiment rows
- Throws:
IOException- if reading a bundled dictionary fails
-
evaluateBundledLanguage
public List<StemmerKnowledgeExperiment.ResultRow> evaluateBundledLanguage(StemmerPatchTrieLoader.Language language, long seed) throws IOException Evaluates one bundled dictionary across all supported experiment configurations.- Parameters:
language- bundled language dictionaryseed- deterministic sampling seed- Returns:
- immutable ordered list of experiment rows
- Throws:
NullPointerException- iflanguageisnullIOException- if reading the bundled dictionary fails
-
evaluatePath
public List<StemmerKnowledgeExperiment.ResultRow> evaluatePath(Path dictionaryPath, long seed) throws IOException Evaluates one filesystem dictionary across all supported experiment configurations.- Parameters:
dictionaryPath- path to a dictionary fileseed- deterministic sampling seed- Returns:
- immutable ordered list of experiment rows
- Throws:
NullPointerException- ifdictionaryPathisnullIOException- if reading fails
-
evaluate
public List<StemmerKnowledgeExperiment.ResultRow> evaluate(Reader reader, String sourceDescription, String languageLabel, long seed) throws IOException Evaluates a dictionary provided through an arbitrary reader.- Parameters:
reader- source readersourceDescription- logical source descriptionlanguageLabel- label stored in the result rowsseed- deterministic sampling seed- Returns:
- immutable ordered list of experiment rows
- Throws:
NullPointerException- if any argument exceptseedisnullIOException- if parsing fails
-
writeCsv
public static void writeCsv(Path outputPath, List<StemmerKnowledgeExperiment.ResultRow> rows) throws IOException Writes result rows as UTF-8 CSV with a stable fixed header.- Parameters:
outputPath- target file pathrows- rows to write- Throws:
NullPointerException- if any argument isnullIOException- if writing fails
-