Version: 0.9.0

Normalizers

A normalizer functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.

A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query Naïve with the index term naive.

Consider the following example.

Create a new index with a custom normalizer:

PUT /sample-index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "normalized_keyword": {
          "type": "custom",
          "char_filter": [],
          "filter": [ "asciifolding", "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "approach": {
        "type": "keyword",
        "normalizer": "normalized_keyword"
      }
    }
  }
}

Index a document:

POST /sample-index/_doc/
{
  "approach": "naive"
}

The following query matches the document. This is expected:

GET /sample-index/_search
{
  "query": {
    "term": {
      "approach": "naive"
    }
  }
}

But this query matches the document as well:

GET /sample-index/_search
{
  "query": {
    "term": {
      "approach": "Naïve"
    }
  }
}

To understand why, consider the effect of the normalizer:

GET /sample-index/_analyze
{
  "normalizer" : "normalized_keyword",
  "text" : "Naïve"
}

Internally, a normalizer accepts only filters that are instances of either NormalizingTokenFilterFactory or NormalizingCharFilterFactory. The following is a list of compatible filters found in modules and plugins that are part of the core Lucenia repository.

The `common-analysis` module

This module does not require installation; it is available by default.

Character filters: pattern_replace, mapping

Token filters: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, persian_normalization, scandinavian_folding, scandinavian_normalization, serbian_normalization, sorani_normalization, trim, uppercase

The `analysis-icu` plugin

Character filters: icu_normalizer

Token filters: icu_normalizer, icu_folding, icu_transform

The `analysis-kuromoji` plugin

Character filters: normalize_kanji, normalize_kana

The `analysis-nori` plugin

Character filters: normalize_kanji, normalize_kana

note

These lists of filters include only analysis components found in the additional plugins that are part of the core Lucenia repository.

The common-analysis module​

The analysis-icu plugin​

The analysis-kuromoji plugin​

The analysis-nori plugin​

The `common-analysis` module

The `analysis-icu` plugin

The `analysis-kuromoji` plugin

The `analysis-nori` plugin