Link Search Menu Expand Document Documentation Menu

Tokenizers

A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words. becomes [Actions, speak, louder, than, words.].

The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:

  • The order or position of each token: This information is used for word and phrase proximity queries.
  • The starting and ending positions (offsets) of the tokens in the text: This information is used for highlighting search terms.
  • The token type: Some tokenizers (for example, standard) classify tokens by type, for example, <ALPHANUM> or <NUM>. Simpler tokenizers (for example, letter) only classify tokens as type word.

You can use tokenizers to define custom analyzers.

Built-in tokenizers

The following tables list the built-in tokenizers that Lucenia provides.

Word tokenizers

Word tokenizers parse full text into words.

Tokenizer Description Example
standard - Parses strings into tokens at word boundaries
- Removes most punctuation
It’s fun to contribute a brand-new PR or 2 to Lucenia!
becomes
[It’s, fun, to, contribute, a,brand, new, PR, or, 2, to, Lucenia]
letter - Parses strings into tokens on any non-letter character
- Removes non-letter characters
It’s fun to contribute a brand-new PR or 2 to Lucenia!
becomes
[It, s, fun, to, contribute, a,brand, new, PR, or, to, Lucenia]
lowercase - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts terms to lowercase
It’s fun to contribute a brand-new PR or 2 to Lucenia!
becomes
[it, s, fun, to, contribute, a,brand, new, pr, or, to, lucenia]
whitespace - Parses strings into tokens at white space characters It’s fun to contribute a brand-new PR or 2 to Lucenia!
becomes
[It’s, fun, to, contribute, a,brand-new, PR, or, 2, to, Lucenia!]
uax_url_email - Similar to the standard tokenizer
- Unlike the standard tokenizer, leaves URLs and email addresses as single terms
It’s fun to contribute a brand-new PR or 2 to Lucenia lucenia-project@github.com!
becomes
[It’s, fun, to, contribute, a,brand, new, PR, or, 2, to, Lucenia, lucenia-project@github.com]
classic - Parses strings into tokens on:
  - Punctuation characters that are followed by a white space character
  - Hyphens if the term does not contain numbers
- Removes punctuation
- Leaves URLs and email addresses as single terms
Part number PA-35234, single-use product (128.32)
becomes
[Part, number, PA-35234, single, use, product, 128.32]
thai - Parses Thai text into terms สวัสดีและยินดีต
becomes
[สวัสด, และ, ยินดี, ]

Partial word tokenizers

Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.

Tokenizer Description Example
ngram - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word My repo
becomes
[M, My, y, y ,  ,  r, r, re, e, ep, p, po, o]
because the default n-gram length is 1–2 characters
edge_ngram - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) My repo
becomes
[M, My]
because the default n-gram length is 1–2 characters

Structured text tokenizers

Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.

Tokenizer Description Example
keyword - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms
My repo
becomes
My repo
pattern - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses Java regular expressions
https://lucenia.io/blogs
becomes
[https, lucenia, io, blogs] because by default the tokenizer splits terms at word boundaries (\W+)
Can be configured with a regex pattern
simple_pattern - Uses a regular expression pattern to return matching text as terms
- Uses Lucene regular expressions
- Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions
Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string
simple_pattern_split - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms
- Uses Lucene regular expressions
- Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions
No-op by default
Must be configured with a pattern
char_group - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions
No-op by default
Must be configured with a list of characters
path_hierarchy - Parses text on the path separator (by default, /) and returns a full path to each component in the tree hierarchy one/two/three
becomes
[one, one/two, one/two/three]
350 characters left

Have a question? .