Skip to main content
Version: 0.11.0

Chunk processor

Introduced 0.11.0

The chunk processor splits text content into smaller, overlapping segments optimized for vector search and retrieval-augmented generation (RAG). It operates on the ContentBlock array produced by the content_extract processor and outputs a flat array of chunk objects ready for the embed processor.

Four chunking algorithms are available, each suited to different content types and retrieval strategies.

Syntax

{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
}

Algorithms

The following diagram shows how each algorithm splits the same input text:

Input text (6000 chars, 3 paragraphs):
┌──────────────────────────────────────────────────────────────────┐
│ Paragraph 1 (2500 chars) │ Paragraph 2 (2000 chars) │ P3 ... │
└──────────────────────────────────────────────────────────────────┘

recursive (default):
┌─── chunk 0 ───┐overlap┌─── chunk 1 ───┐overlap┌─── chunk 2 ───┐
│ Para 1 (pt1) │◄─200─►│ Para 1 (pt2) │ │ Paragraph 2 │
│ split at ¶ │ │ + start of P2 │ │ + Para 3 │
└───────────────┘ └───────────────┘ └───────────────┘
Splits: paragraph → sentence → word boundaries

fixed:
┌── chunk 0 ──┐overlap┌── chunk 1 ──┐overlap┌── chunk 2 ──┐
│ chars 0-2000│◄─200─►│chars 1800- │◄─200─►│chars 3600- │
│ │ │ 3800 │ │ 5600 │
└─────────────┘ └─────────────┘ └─────────────┘
Splits: at word boundaries within chunk_size window

semantic:
┌───── chunk 0 ─────┐┌───── chunk 1 ─────┐┌── chunk 2 ──┐
│ Sentences 1-4 ││ Sentences 5-9 ││ Sent. 10-12 │
│ (similar topics) ││ (similar topics) ││ │
└───────────────────┘└───────────────────┘└─────────────┘
Splits: where embedding cosine similarity drops below threshold

topic_shift:
┌──── chunk 0 ────┐┌────── chunk 1 ──────┐┌── chunk 2 ──┐
│ Sentences 1-5 ││ Sentences 6-10 ││ Sent. 11-12 │
│ (vocabulary A) ││ (vocabulary B) ││ (vocab. C) │
└─────────────────┘└─────────────────────┘└─────────────┘
Splits: where Jaccard word overlap between windows drops below threshold

Recursive (default)

Splits text hierarchically: paragraphs first, then sentences within oversized paragraphs, then word boundaries as a last resort. This preserves the natural structure of documents and minimizes mid-sentence breaks.

  • Paragraph detection: double newlines (\n\n)
  • Sentence detection: locale-aware BreakIterator (respects abbreviations like German z.B.)
  • Overlap: character-level tail from previous chunk

Fixed

Character-based splitting with word-boundary awareness. Deterministic and uniform -- every chunk is exactly chunk_size characters (except the last). Breaks at the last space within the chunk window.

Semantic

Groups sentences by embedding similarity. Sentences are embedded, then greedily merged while the cosine similarity between the chunk centroid and the next sentence stays above similarity_threshold. Requires an embedding model at chunking time.

  • Computes L2-normalized centroid of sentence embeddings in each chunk
  • Falls back to recursive if the embedding call fails (configurable)
  • Produces the highest-quality chunks for vector search but costs one embedding call per document

Topic shift

Detects topic boundaries using vocabulary overlap (Jaccard similarity) between sliding windows of sentences. No embedding model required -- purely lexical.

  • Slides a window of window_size sentences across the text
  • Marks a boundary where Jaccard similarity between left and right windows drops below similarity_threshold
  • Skips ahead by window_size after each boundary to suppress over-fragmentation

Configuration parameters

Common parameters

ParameterData typeRequired/OptionalDescription
fieldStringOptionalSource field containing content blocks or direct text. Default is extracted.blocks.
target_fieldStringOptionalField where chunk array is written. Default is chunks.
block_typesArrayOptionalContent block types to chunk. Non-matching blocks pass through unchanged. Default is ["text", "audio"].
algorithmStringOptionalChunking algorithm: recursive, fixed, semantic, or topic_shift. Default is recursive.
chunk_sizeIntegerOptionalTarget chunk size in characters. Default is 2000.
chunk_overlapIntegerOptionalCharacter-level overlap between consecutive chunks. Must be less than chunk_size. Default is 200.
min_chunk_sizeIntegerOptionalMinimum chunk size. Smaller trailing chunks are merged with the previous chunk. Default is 100.
sentence_localeStringOptionalIETF BCP 47 language tag for sentence boundary detection (e.g., de-DE, fr-FR, ja-JP). Default is ROOT (language-neutral).
descriptionStringOptionalA brief description of the processor.
tagStringOptionalAn identifier tag for the processor.

Semantic algorithm parameters

These parameters are required only when algorithm is semantic.

ParameterData typeRequired/OptionalDescription
model_idStringRequiredEmbedding model identifier (e.g., amazon.titan-embed-text-v2:0).
providerStringRequiredEmbedding provider: bedrock, openai, or http.
dimensionsIntegerOptionalEmbedding dimensions. Default is 1536.
provider_configObjectOptionalProvider-specific configuration (region, API key setting, endpoint).
similarity_thresholdFloatOptionalMinimum cosine similarity to keep merging sentences. Range [0.0, 1.0]. Default is 0.8.
overlap_sentencesIntegerOptionalNumber of sentences carried forward from the previous chunk. Default is 0.
on_failure_actionStringOptionalfallback (silently use recursive) or fail (reject document). Default is fallback.

Topic shift algorithm parameters

These parameters apply only when algorithm is topic_shift.

ParameterData typeRequired/OptionalDescription
similarity_thresholdFloatOptionalMinimum Jaccard similarity to consider sentences as same topic. Range [0.0, 1.0]. Default is 0.3.
window_sizeIntegerOptionalSliding window size in sentences. Must be at least 1. Default is 3.
overlap_sentencesIntegerOptionalNumber of sentences carried forward from the previous chunk. Default is 0.

Output structure

Each chunk in the output array contains the following fields:

{
"chunks": [
{
"text": "The chunked text content...",
"chunk_index": 0,
"source_type": "text"
},
{
"text": "Next chunk with overlap...",
"chunk_index": 1,
"source_type": "text"
}
]
}
FieldDescription
textThe chunk text content.
chunk_indexSequential zero-based index across all chunks in the document.
source_typeThe original block type (e.g., text, audio). Present only when input is a content block array.

Using the processor

Example 1: Recursive chunking (default)

PUT _ingest/pipeline/chunk-recursive
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"chunk_size": 2000,
"chunk_overlap": 200,
"min_chunk_size": 100
}
}
]
}

Example 2: Semantic chunking with Bedrock

PUT _ingest/pipeline/chunk-semantic
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "semantic",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"similarity_threshold": 0.75,
"overlap_sentences": 1,
"on_failure_action": "fallback",
"provider_config": {
"region": "us-east-2"
}
}
}
]
}

Example 3: Topic shift chunking for multilingual content

PUT _ingest/pipeline/chunk-topic
{
"processors": [
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "topic_shift",
"window_size": 5,
"similarity_threshold": 0.25,
"sentence_locale": "de-DE",
"chunk_size": 1500,
"overlap_sentences": 2
}
}
]
}

Example 4: Full pipeline (extract + chunk + embed)

PUT _ingest/pipeline/doc-pipeline
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"reference_config": { "region": "us-east-2" }
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": { "region": "us-east-2" }
}
}
]
}

Algorithm selection guide

                    ┌─────────────────────┐
│ What kind of text? │
└─────────┬───────────┘

┌────────────┼────────────┐
▼ ▼ ▼
Structured Mixed/unknown Uniform
documents content data
(reports, (web scrapes, (logs, CSVs,
papers) forums) structured)
│ │ │
▼ ▼ ▼
recursive topic_shift fixed
(default)

Need highest quality
vector chunks?

┌────┴────┐
▼ ▼
Yes No
│ │
semantic topic_shift
(costs 1 (free, no
embed model needed)
call)
AlgorithmEmbedding costBest forTrade-off
recursiveNoneStructured documents (reports, papers, manuals)Good general-purpose default
fixedNoneUniform content, deterministic sizingMay break mid-sentence
semantic1 embed call per documentHighest-quality vector retrievalSlower, costs per document
topic_shiftNoneTopic-diverse content, multilingualHeuristic -- less precise than semantic