Delimited term frequency token filter
The delimited_term_freq
token filter separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is |
, then for the string foo|5
, foo
is the token and 5
is its term frequency. If there is no delimiter, the token filter does not modify the term frequency.
You can either use a preconfigured delimited_term_freq
token filter or create a custom one.
Preconfigured delimited_term_freq
token filter
The preconfigured delimited_term_freq
token filter uses the |
default delimiter. To analyze text with the preconfigured token filter, send the following request to the _analyze
endpoint:
POST /_analyze
{
"text": "foo|100",
"tokenizer": "keyword",
"filter": ["delimited_term_freq"],
"attributes": ["termFrequency"],
"explain": true
}
The attributes
array specifies that you want to filter the output of the explain
parameter to return only termFrequency
. The response contains both the original token and the parsed output of the token filter that includes the term frequency:
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "keyword",
"tokens": [
{
"token": "foo|100",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0,
"termFrequency": 1
}
]
},
"tokenfilters": [
{
"name": "delimited_term_freq",
"tokens": [
{
"token": "foo",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0,
"termFrequency": 100
}
]
}
]
}
}
Custom delimited_term_freq
token filter
To configure a custom delimited_term_freq
token filter, first specify the delimiter in the mapping request, in this example, ^
:
PUT /testindex
{
"settings": {
"analysis": {
"filter": {
"my_delimited_term_freq": {
"type": "delimited_term_freq",
"delimiter": "^"
}
}
}
}
}
Then analyze text with the custom token filter you created:
POST /testindex/_analyze
{
"text": "foo^3",
"tokenizer": "keyword",
"filter": ["my_delimited_term_freq"],
"attributes": ["termFrequency"],
"explain": true
}
The response contains both the original token and the parsed version with the term frequency:
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "keyword",
"tokens": [
{
"token": "foo|100",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0,
"termFrequency": 1
}
]
},
"tokenfilters": [
{
"name": "delimited_term_freq",
"tokens": [
{
"token": "foo",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0,
"termFrequency": 100
}
]
}
]
}
}
Combining delimited_token_filter
with scripts
You can write Painless scripts to calculate custom scores for the documents in the results.
First, create an index and provide the following mappings and settings:
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"keyword_tokenizer": {
"type": "keyword"
}
},
"filter": {
"my_delimited_term_freq": {
"type": "delimited_term_freq",
"delimiter": "^"
}
},
"analyzer": {
"custom_delimited_analyzer": {
"tokenizer": "keyword_tokenizer",
"filter": ["my_delimited_term_freq"]
}
}
}
},
"mappings": {
"properties": {
"f1": {
"type": "keyword"
},
"f2": {
"type": "text",
"analyzer": "custom_delimited_analyzer",
"index_options": "freqs"
}
}
}
}
The test
index uses a keyword tokenizer, a delimited term frequency token filter (where the delimiter is ^
), and a custom analyzer that includes a keyword tokenizer and a delimited term frequency token filter. The mappings specify that the field f1
is a keyword field and the field f2
is a text field. The field f2
uses the custom analyzer defined in the settings for text analysis. Additionally, specifying index_options
signals to Lucenia to add the term frequencies to the inverted index. You’ll use the term frequencies to give documents with repeated terms a higher score.
Next, index two documents using bulk upload:
POST /_bulk?refresh=true
{"index": {"_index": "test", "_id": "doc1"}}
{"f1": "v0|100", "f2": "v1^30"}
{"index": {"_index": "test", "_id": "doc2"}}
{"f2": "v2|100"}
The following query searches for all documents in the index and calculates document scores as the term frequency of the term v1
in the field f2
:
GET /test/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": "termFreq(params.field, params.term)",
"params": {
"field": "f2",
"term": "v1"
}
}
}
}
}
}
In the response, document 1 has a score of 30 because the term frequency of the term v1
in the field f2
is 30. Document 2 has a score of 0 because the term v1
does not appear in f2
:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 30,
"hits": [
{
"_index": "test",
"_id": "doc1",
"_score": 30,
"_source": {
"f1": "v0|100",
"f2": "v1^30"
}
},
{
"_index": "test",
"_id": "doc2",
"_score": 0,
"_source": {
"f2": "v2|100"
}
}
]
}
}
Parameters
The following table lists all parameters that the delimited_term_freq
supports.
Parameter | Required/Optional | Description |
---|---|---|
delimiter | Optional | The delimiter used to separate tokens from term frequencies. Must be a single non-null character. Default is | . |