Token filters

Token filters receive the stream of tokens from the tokenizer and add, remove, or modify the tokens. For example, a token filter may lowercase the tokens so that Actions becomes action, remove stopwords like than, or add synonyms like talk for the word speak.

The following table lists all token filters that Lucenia supports.

Token filter Underlying Lucene token filter Description
apostrophe ApostropheFilter In each token that contains an apostrophe, the apostrophe token filter removes the apostrophe itself and all characters following the apostrophe.
asciifolding ASCIIFoldingFilter Converts alphabetic, numeric, and symbolic characters.
cjk_bigram CJKBigramFilter Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
cjk_width CJKWidthFilter Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters.
classic ClassicFilter Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives ('s) and removes . from acronyms.
common_grams CommonGramsFilter Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
conditional ConditionalTokenFilter Applies an ordered list of token filters to tokens that match the conditions provided in a script.
decimal_digit DecimalDigitFilter Converts all digits in the Unicode decimal number general category to basic Latin digits (0–9).
delimited_payload DelimitedPayloadTokenFilter Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is |, then for the string foo|bar, foo is the token and bar is the payload.
delimited_term_freq DelimitedTermFrequencyTokenFilter Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is |, then for the string foo|5, foo is the token and 5 is the term frequency.
dictionary_decompounder DictionaryCompoundWordTokenFilter Decomposes compound words found in many Germanic languages.
edge_ngram EdgeNGramTokenFilter Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between min_gram and max_gram. Optionally, keeps the original token.
elision ElisionFilter Removes the specified elisions from the beginning of tokens. For example, changes l'avion (the plane) to avion (plane).
fingerprint FingerprintFilter Sorts and deduplicates the token list and concatenates tokens into a single token.
flatten_graph FlattenGraphFilter Flattens a token graph produced by a graph token filter, such as synonym_graph or word_delimiter_graph, making the graph suitable for indexing.
hunspell HunspellStemFilter Uses Hunspell rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
hyphenation_decompounder HyphenationCompoundWordTokenFilter Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
keep_types TypeTokenFilter Keeps or removes tokens of a specific type.
keep_word KeepWordFilter Checks the tokens against the specified word list and keeps only those that are in the list.
keyword_marker KeywordMarkerFilter Marks specified tokens as keywords, preventing them from being stemmed.
keyword_repeat KeywordRepeatFilter Emits each incoming token twice: once as a keyword and once as a non-keyword.
kstem KStemFilter Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
kuromoji_completion JapaneseCompletionFilter Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a mode parameter, which should be set to index when used in an index analyzer and query when used in a search analyzer. Requires the analysis-kuromoji plugin. For information about installing the plugin, see Additional plugins.
length LengthFilter Removes tokens whose lengths are shorter or longer than the length range specified by min and max.
limit LimitTokenCountFilter Limits the number of output tokens. A common use case is to limit the size of document field values based on token count.
lowercase LowerCaseFilter Converts tokens to lowercase. The default LowerCaseFilter is for the English language. You can set the language parameter to greek (uses GreekLowerCaseFilter), irish (uses IrishLowerCaseFilter), or turkish (uses TurkishLowerCaseFilter).
min_hash MinHashFilter Uses the MinHash technique to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream.
multiplexer N/A Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
ngram NGramTokenFilter Tokenizes the given token into n-grams of lengths between min_gram and max_gram.
Normalization arabic_normalization: ArabicNormalizer
german_normalization: GermanNormalizationFilter
hindi_normalization: HindiNormalizer
indic_normalization: IndicNormalizer
sorani_normalization: SoraniNormalizer
persian_normalization: PersianNormalizer
scandinavian_normalization : ScandinavianNormalizationFilter
scandinavian_folding: ScandinavianFoldingFilter
serbian_normalization: SerbianNormalizationFilter
Normalizes the characters of one of the listed languages.
pattern_capture N/A Generates a token for every capture group in the provided regular expression. Uses Java regular expression syntax.
pattern_replace N/A Matches a pattern in the provided regular expression and replaces matching substrings. Uses Java regular expression syntax.
phonetic N/A Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the analysis-phonetic plugin.
porter_stem PorterStemFilter Uses the Porter stemming algorithm to perform algorithmic stemming for the English language.
predicate_token_filter N/A Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only.
remove_duplicates RemoveDuplicatesTokenFilter Removes duplicate tokens that are in the same position.
reverse ReverseStringFilter Reverses the string corresponding to each token in the token stream. For example, the token dog becomes god.
shingle ShingleFilter Generates shingles of lengths between min_shingle_size and max_shingle_size for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [contribute, to, lucenia] are [contribute to, to lucenia].
snowball N/A Stems words using a Snowball-generated stemmer. You can use the snowball token filter with the following languages in the language field: Arabic, Armenian, Basque, Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, German2, Hungarian, Irish, Italian, Kp, Lithuanian, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
stemmer N/A Provides algorithmic stemming for the following languages in the language field: arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, czech, danish, dutch, dutch_kp, english, light_english, lovins, minimal_english, porter2, possessive_english, estonian, finnish, light_finnish, french, light_french, minimal_french, galician, minimal_galician, german, german2, light_german, minimal_german, greek, hindi, hungarian, light_hungarian, indonesian, irish, italian, light_italian, latvian, Lithuanian, norwegian, light_norwegian, minimal_norwegian, light_nynorsk, minimal_nynorsk, portuguese, light_portuguese, minimal_portuguese, portuguese_rslp, romanian, russian, light_russian, sorani, spanish, light_spanish, swedish, light_swedish, turkish.
stemmer_override N/A Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
stop StopFilter Removes stop words from a token stream.
synonym N/A Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
synonym_graph N/A Supplies a synonym list, including multiword synonyms, for the analysis process.
trim TrimFilter Trims leading and trailing white space from each token in a stream.
truncate TruncateTokenFilter Truncates tokens whose length exceeds the specified character limit.
unique N/A Ensures each token is unique by removing duplicate tokens from a stream.
uppercase UpperCaseFilter Converts tokens to uppercase.
word_delimiter WordDelimiterFilter Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules.
word_delimiter_graph WordDelimiterGraphFilter Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a positionLength attribute.
