Class StemmerDictionaryParser

java.lang.Object
org.egothor.stemmer.StemmerDictionaryParser

public final class StemmerDictionaryParser extends Object
Parser of line-oriented stemmer dictionary files.

Each non-empty logical line uses a tab-separated values layout. The first column is interpreted as the canonical stem, and every following tab-separated column on the same line is interpreted as a variant belonging to that stem.

Input line case normalization is controlled by CaseProcessingMode. Leading and trailing whitespace around each column is ignored.

The parser supports line remarks and trailing remarks. The remark markers # and // terminate the logical content of the line, and the remainder of that line is ignored.

Dictionary items containing any Unicode whitespace character are currently not supported. Such items are ignored and reported through a single warning-level log entry per physical line together with the source line number, the normalized stem column, and the list of ignored items from that line.

This class is intentionally stateless and allocation-light so it can be used both by runtime loading and by offline compilation tooling.