Tokenizers
A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words.
becomes [Actions
, speak
, louder
, than
, words.
].
The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:
- The order or position of each token: This information is used for word and phrase proximity queries.
- The starting and ending positions (offsets) of the tokens in the text: This information is used for highlighting search terms.
- The token type: Some tokenizers (for example,
standard
) classify tokens by type, for example,<ALPHANUM>
or<NUM>
. Simpler tokenizers (for example,letter
) only classify tokens as typeword
.
You can use tokenizers to define custom analyzers.
Built-in tokenizers
The following tables list the built-in tokenizers that Lucenia provides.
Word tokenizers
Word tokenizers parse full text into words.
Tokenizer | Description | Example |
---|---|---|
standard | - Parses strings into tokens at word boundaries - Removes most punctuation | It’s fun to contribute a brand-new PR or 2 to Lucenia! becomes [ It’s , fun , to , contribute , a ,brand , new , PR , or , 2 , to , Lucenia ] |
letter | - Parses strings into tokens on any non-letter character - Removes non-letter characters | It’s fun to contribute a brand-new PR or 2 to Lucenia! becomes [ It , s , fun , to , contribute , a ,brand , new , PR , or , to , Lucenia ] |
lowercase | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts terms to lowercase | It’s fun to contribute a brand-new PR or 2 to Lucenia! becomes [ it , s , fun , to , contribute , a ,brand , new , pr , or , to , lucenia ] |
whitespace | - Parses strings into tokens at white space characters | It’s fun to contribute a brand-new PR or 2 to Lucenia! becomes [ It’s , fun , to , contribute , a ,brand-new , PR , or , 2 , to , Lucenia! ] |
uax_url_email | - Similar to the standard tokenizer - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | It’s fun to contribute a brand-new PR or 2 to Lucenia lucenia-project@github.com! becomes [ It’s , fun , to , contribute , a ,brand , new , PR , or , 2 , to , Lucenia , lucenia-project@github.com ] |
classic | - Parses strings into tokens on: - Punctuation characters that are followed by a white space character - Hyphens if the term does not contain numbers - Removes punctuation - Leaves URLs and email addresses as single terms | Part number PA-35234, single-use product (128.32) becomes [ Part , number , PA-35234 , single , use , product , 128.32 ] |
thai | - Parses Thai text into terms | สวัสดีและยินดีต becomes [ สวัสด , และ , ยินดี , ต ] |
Partial word tokenizers
Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.
Tokenizer | Description | Example |
---|---|---|
ngram | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | My repo becomes [ M , My , y , y , , r , r , re , e , ep , p , po , o ] because the default n-gram length is 1–2 characters |
edge_ngram | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | My repo becomes [ M , My ] because the default n-gram length is 1–2 characters |
Structured text tokenizers
Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.
Tokenizer | Description | Example |
---|---|---|
keyword | - No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms | My repo becomes My repo |
pattern | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses Java regular expressions | https://lucenia.io/blogs becomes [ https , lucenia , io , blogs ] because by default the tokenizer splits terms at word boundaries (\W+ )Can be configured with a regex pattern |
simple_pattern | - Uses a regular expression pattern to return matching text as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions | Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string |
simple_pattern_split | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions | No-op by default Must be configured with a pattern |
char_group | - Parses on a set of configurable characters - Faster than tokenizers that run regular expressions | No-op by default Must be configured with a list of characters |
path_hierarchy | - Parses text on the path separator (by default, / ) and returns a full path to each component in the tree hierarchy | one/two/three becomes [ one , one/two , one/two/three ] |