Significant text aggregations
The significant_text
aggregation is similar to the significant_terms
aggregation but it’s for raw text fields. Significant text measures the change in popularity measured between the foreground and background sets using statistical analysis. For example, it might suggest Tesla when you look for its stock acronym TSLA.
The significant_text
aggregation re-analyzes the source text on the fly, filtering noisy data like duplicate paragraphs, boilerplate headers and footers, and so on, which might otherwise skew the results.
Re-analyzing high-cardinality datasets can be a very CPU-intensive operation. We recommend using the significant_text
aggregation inside a sampler aggregation to limit the analysis to a small selection of top-matching documents, for example 200.
You can set the following parameters:
min_doc_count
- Return results that match more than a configured number of top hits. We recommend not settingmin_doc_count
to 1 because it tends to return terms that are typos or misspellings. Finding more than one instance of a term helps reinforce that the significance is not the result of a one-off accident. The default value of 3 is used to provide a minimum weight-of-evidence.shard_size
- Setting a high value increases stability (and accuracy) at the expense of computational performance.shard_min_doc_count
- If your text contains many low frequency words and you’re not interested in these (for example typos), then you can set theshard_min_doc_count
parameter to filter out candidate terms at a shard level with a reasonable certainty to not reach the requiredmin_doc_count
even after merging the local significant text frequencies. The default value is 1, which has no impact until you explicitly set it. We recommend setting this value much lower than themin_doc_count
value.
Assume that you have the complete works of Shakespeare indexed in an Lucenia cluster. You can find significant texts in relation to the word “breathe” in the text_entry
field:
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": "breathe"
}
},
"aggregations": {
"my_sample": {
"sampler": {
"shard_size": 100
},
"aggregations": {
"keywords": {
"significant_text": {
"field": "text_entry",
"min_doc_count": 4
}
}
}
}
}
}
Example response
"aggregations" : {
"my_sample" : {
"doc_count" : 59,
"keywords" : {
"doc_count" : 59,
"bg_count" : 111396,
"buckets" : [
{
"key" : "breathe",
"doc_count" : 59,
"score" : 1887.0677966101694,
"bg_count" : 59
},
{
"key" : "air",
"doc_count" : 4,
"score" : 2.641295376716233,
"bg_count" : 189
},
{
"key" : "dead",
"doc_count" : 4,
"score" : 0.9665839666414213,
"bg_count" : 495
},
{
"key" : "life",
"doc_count" : 5,
"score" : 0.9090787433467572,
"bg_count" : 805
}
]
}
}
}
}
The most significant texts in relation to breathe
are air
, dead
, and life
.
The significant_text
aggregation has the following limitations:
- Doesn’t support child aggregations because child aggregations come at a high memory cost. As a workaround, you can add a follow-up query using a
terms
aggregation with an include clause and a child aggregation. - Doesn’t support nested objects because it works with the document JSON source.
- The counts of documents might have some (typically small) inaccuracies as it’s based on summing the samples returned from each shard. You can use the
shard_size
parameter to fine-tune the trade-off between accuracy and performance. By default, theshard_size
is set to -1 to automatically estimate the number of shards and thesize
parameter.
The default source of statistical information for background term frequencies is the entire index. You can narrow this scope with a background filter for more focus:
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": "breathe"
}
},
"aggregations": {
"my_sample": {
"sampler": {
"shard_size": 100
},
"aggregations": {
"keywords": {
"significant_text": {
"field": "text_entry",
"background_filter": {
"term": {
"speaker": "JOHN OF GAUNT"
}
}
}
}
}
}
}
}