Chunk processor
The chunk processor splits text content into smaller, overlapping segments optimized for vector search and retrieval-augmented generation (RAG). It operates on the ContentBlock array produced by the content_extract processor and outputs a flat array of chunk objects ready for the embed processor.
Four chunking algorithms are available, each suited to different content types and retrieval strategies.
Syntax
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
}
Algorithms
The following diagram shows how each algorithm splits the same input text:
Input text (6000 chars, 3 paragraphs):
┌──────────────────────────────────────────────────────────────────┐
│ Paragraph 1 (2500 chars) │ Paragraph 2 (2000 chars) │ P3 ... │
└──────────────────────────────────────────────────────────────────┘
recursive (default):
┌─── chunk 0 ───┐overlap┌─── chunk 1 ───┐overlap┌─── chunk 2 ───┐
│ Para 1 (pt1) │◄─200─►│ Para 1 (pt2) │ │ Paragraph 2 │
│ split at ¶ │ │ + start of P2 │ │ + Para 3 │
└───────────────┘ └───────────────┘ └───────────────┘
Splits: paragraph → sentence → word boundaries
fixed:
┌── chunk 0 ──┐overlap┌── chunk 1 ──┐overlap┌── chunk 2 ──┐
│ chars 0-2000│◄─200─►│chars 1800- │◄─200─►│chars 3600- │
│ │ │ 3800 │ │ 5600 │
└─────────────┘ └─────────────┘ └─────────────┘
Splits: at word boundaries within chunk_size window
semantic:
┌───── chunk 0 ─────┐┌───── chunk 1 ─────┐┌── chunk 2 ──┐
│ Sentences 1-4 ││ Sentences 5-9 ││ Sent. 10-12 │
│ (similar topics) ││ (similar topics) ││ │
└───────────────────┘└───────────────────┘└─────────────┘
Splits: where embedding cosine similarity drops below threshold
topic_shift:
┌──── chunk 0 ────┐┌────── chunk 1 ──────┐┌── chunk 2 ──┐
│ Sentences 1-5 ││ Sentences 6-10 ││ Sent. 11-12 │
│ (vocabulary A) ││ (vocabulary B) ││ (vocab. C) │
└─────────────────┘└─────────────────────┘└─────────────┘
Splits: where Jaccard word overlap between windows drops below threshold
Recursive (default)
Splits text hierarchically: paragraphs first, then sentences within oversized paragraphs, then word boundaries as a last resort. This preserves the natural structure of documents and minimizes mid-sentence breaks.
- Paragraph detection: double newlines (
\n\n) - Sentence detection: locale-aware
BreakIterator(respects abbreviations like Germanz.B.) - Overlap: character-level tail from previous chunk
Fixed
Character-based splitting with word-boundary awareness. Deterministic and uniform -- every chunk is exactly chunk_size characters (except the last). Breaks at the last space within the chunk window.
Semantic
Groups sentences by embedding similarity. Sentences are embedded, then greedily merged while the cosine similarity between the chunk centroid and the next sentence stays above similarity_threshold. Requires an embedding model at chunking time.
- Computes L2-normalized centroid of sentence embeddings in each chunk
- Falls back to
recursiveif the embedding call fails (configurable) - Produces the highest-quality chunks for vector search but costs one embedding call per document
Topic shift
Detects topic boundaries using vocabulary overlap (Jaccard similarity) between sliding windows of sentences. No embedding model required -- purely lexical.
- Slides a window of
window_sizesentences across the text - Marks a boundary where Jaccard similarity between left and right windows drops below
similarity_threshold - Skips ahead by
window_sizeafter each boundary to suppress over-fragmentation
Configuration parameters
Common parameters
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
field | String | Optional | Source field containing content blocks or direct text. Default is extracted.blocks. |
target_field | String | Optional | Field where chunk array is written. Default is chunks. |
block_types | Array | Optional | Content block types to chunk. Non-matching blocks pass through unchanged. Default is ["text", "audio"]. |
algorithm | String | Optional | Chunking algorithm: recursive, fixed, semantic, or topic_shift. Default is recursive. |
chunk_size | Integer | Optional | Target chunk size in characters. Default is 2000. |
chunk_overlap | Integer | Optional | Character-level overlap between consecutive chunks. Must be less than chunk_size. Default is 200. |
min_chunk_size | Integer | Optional | Minimum chunk size. Smaller trailing chunks are merged with the previous chunk. Default is 100. |
sentence_locale | String | Optional | IETF BCP 47 language tag for sentence boundary detection (e.g., de-DE, fr-FR, ja-JP). Default is ROOT (language-neutral). |
description | String | Optional | A brief description of the processor. |
tag | String | Optional | An identifier tag for the processor. |
Semantic algorithm parameters
These parameters are required only when algorithm is semantic.
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
model_id | String | Required | Embedding model identifier (e.g., amazon.titan-embed-text-v2:0). |
provider | String | Required | Embedding provider: bedrock, openai, or http. |
dimensions | Integer | Optional | Embedding dimensions. Default is 1536. |
provider_config | Object | Optional | Provider-specific configuration (region, API key setting, endpoint). |
similarity_threshold | Float | Optional | Minimum cosine similarity to keep merging sentences. Range [0.0, 1.0]. Default is 0.8. |
overlap_sentences | Integer | Optional | Number of sentences carried forward from the previous chunk. Default is 0. |
on_failure_action | String | Optional | fallback (silently use recursive) or fail (reject document). Default is fallback. |
Topic shift algorithm parameters
These parameters apply only when algorithm is topic_shift.
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
similarity_threshold | Float | Optional | Minimum Jaccard similarity to consider sentences as same topic. Range [0.0, 1.0]. Default is 0.3. |
window_size | Integer | Optional | Sliding window size in sentences. Must be at least 1. Default is 3. |
overlap_sentences | Integer | Optional | Number of sentences carried forward from the previous chunk. Default is 0. |
Output structure
Each chunk in the output array contains the following fields:
{
"chunks": [
{
"text": "The chunked text content...",
"chunk_index": 0,
"source_type": "text"
},
{
"text": "Next chunk with overlap...",
"chunk_index": 1,
"source_type": "text"
}
]
}
| Field | Description |
|---|---|
text | The chunk text content. |
chunk_index | Sequential zero-based index across all chunks in the document. |
source_type | The original block type (e.g., text, audio). Present only when input is a content block array. |
Using the processor
Example 1: Recursive chunking (default)
PUT _ingest/pipeline/chunk-recursive
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"chunk_size": 2000,
"chunk_overlap": 200,
"min_chunk_size": 100
}
}
]
}
Example 2: Semantic chunking with Bedrock
PUT _ingest/pipeline/chunk-semantic
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "semantic",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"similarity_threshold": 0.75,
"overlap_sentences": 1,
"on_failure_action": "fallback",
"provider_config": {
"region": "us-east-2"
}
}
}
]
}
Example 3: Topic shift chunking for multilingual content
PUT _ingest/pipeline/chunk-topic
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "topic_shift",
"window_size": 5,
"similarity_threshold": 0.25,
"sentence_locale": "de-DE",
"chunk_size": 1500,
"overlap_sentences": 2
}
}
]
}
Example 4: Full pipeline (extract + chunk + embed)
PUT _ingest/pipeline/doc-pipeline
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"reference_config": { "region": "us-east-2" }
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": { "region": "us-east-2" }
}
}
]
}
Algorithm selection guide
┌─────────────────────┐
│ What kind of text? │
└─────────┬───────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
Structured Mixed/unknown Uniform
documents content data
(reports, (web scrapes, (logs, CSVs,
papers) forums) structured)
│ │ │
▼ ▼ ▼
recursive topic_shift fixed
(default)
│
Need highest quality
vector chunks?
│
┌────┴────┐
▼ ▼
Yes No
│ │
semantic topic_shift
(costs 1 (free, no
embed model needed)
call)
| Algorithm | Embedding cost | Best for | Trade-off |
|---|---|---|---|
recursive | None | Structured documents (reports, papers, manuals) | Good general-purpose default |
fixed | None | Uniform content, deterministic sizing | May break mid-sentence |
semantic | 1 embed call per document | Highest-quality vector retrieval | Slower, costs per document |
topic_shift | None | Topic-diverse content, multilingual | Heuristic -- less precise than semantic |