Normalizers
A normalizer functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query Naïve
with the index term naive
.
Consider the following example.
Create a new index with a custom normalizer:
PUT /sample-index
{
"settings": {
"analysis": {
"normalizer": {
"normalized_keyword": {
"type": "custom",
"char_filter": [],
"filter": [ "asciifolding", "lowercase" ]
}
}
}
},
"mappings": {
"properties": {
"approach": {
"type": "keyword",
"normalizer": "normalized_keyword"
}
}
}
}
Index a document:
POST /sample-index/_doc/
{
"approach": "naive"
}
The following query matches the document. This is expected:
GET /sample-index/_search
{
"query": {
"term": {
"approach": "naive"
}
}
}
But this query matches the document as well:
GET /sample-index/_search
{
"query": {
"term": {
"approach": "Naïve"
}
}
}
To understand why, consider the effect of the normalizer:
GET /sample-index/_analyze
{
"normalizer" : "normalized_keyword",
"text" : "Naïve"
}
Internally, a normalizer accepts only filters that are instances of either NormalizingTokenFilterFactory
or NormalizingCharFilterFactory
. The following is a list of compatible filters found in modules and plugins that are part of the core Lucenia repository.
The common-analysis
module
This module does not require installation; it is available by default.
Character filters: pattern_replace
, mapping
Token filters: arabic_normalization
, asciifolding
, bengali_normalization
, cjk_width
, decimal_digit
, elision
, german_normalization
, hindi_normalization
, indic_normalization
, lowercase
, persian_normalization
, scandinavian_folding
, scandinavian_normalization
, serbian_normalization
, sorani_normalization
, trim
, uppercase
The analysis-icu
plugin
Character filters: icu_normalizer
Token filters: icu_normalizer
, icu_folding
, icu_transform
The analysis-kuromoji
plugin
Character filters: normalize_kanji
, normalize_kana
The analysis-nori
plugin
Character filters: normalize_kanji
, normalize_kana
These lists of filters include only analysis components found in the additional plugins that are part of the core Lucenia repository.