Version: 0.11.0

Chunk processor

Introduced 0.11.0

The chunk processor splits text content into smaller, overlapping segments optimized for vector search and retrieval-augmented generation (RAG). It operates on the ContentBlock array produced by the content_extract processor and outputs a flat array of chunk objects ready for the embed processor.

Four chunking algorithms are available, each suited to different content types and retrieval strategies.

Syntax

{
  "chunk": {
    "field": "extracted.blocks",
    "target_field": "chunks",
    "algorithm": "recursive",
    "chunk_size": 2000,
    "chunk_overlap": 200
  }
}

Algorithms

The following diagram shows how each algorithm splits the same input text:

Input text (6000 chars, 3 paragraphs):
┌──────────────────────────────────────────────────────────────────┐
│ Paragraph 1 (2500 chars)  │ Paragraph 2 (2000 chars)  │ P3 ... │
└──────────────────────────────────────────────────────────────────┘

recursive (default):
┌─── chunk 0 ───┐overlap┌─── chunk 1 ───┐overlap┌─── chunk 2 ───┐
│  Para 1 (pt1) │◄─200─►│  Para 1 (pt2) │       │  Paragraph 2  │
│ split at ¶    │       │ + start of P2 │       │  + Para 3     │
└───────────────┘       └───────────────┘       └───────────────┘
  Splits: paragraph → sentence → word boundaries

fixed:
┌── chunk 0 ──┐overlap┌── chunk 1 ──┐overlap┌── chunk 2 ──┐
│ chars 0-2000│◄─200─►│chars 1800-  │◄─200─►│chars 3600-  │
│             │       │     3800    │       │     5600    │
└─────────────┘       └─────────────┘       └─────────────┘
  Splits: at word boundaries within chunk_size window

semantic:
┌───── chunk 0 ─────┐┌───── chunk 1 ─────┐┌── chunk 2 ──┐
│ Sentences 1-4     ││ Sentences 5-9     ││ Sent. 10-12 │
│ (similar topics)  ││ (similar topics)  ││             │
└───────────────────┘└───────────────────┘└─────────────┘
  Splits: where embedding cosine similarity drops below threshold

topic_shift:
┌──── chunk 0 ────┐┌────── chunk 1 ──────┐┌── chunk 2 ──┐
│ Sentences 1-5   ││ Sentences 6-10      ││ Sent. 11-12 │
│ (vocabulary A)  ││ (vocabulary B)      ││ (vocab. C)  │
└─────────────────┘└─────────────────────┘└─────────────┘
  Splits: where Jaccard word overlap between windows drops below threshold

Recursive (default)

Splits text hierarchically: paragraphs first, then sentences within oversized paragraphs, then word boundaries as a last resort. This preserves the natural structure of documents and minimizes mid-sentence breaks.

Paragraph detection: double newlines (\n\n)
Sentence detection: locale-aware BreakIterator (respects abbreviations like German z.B.)
Overlap: character-level tail from previous chunk

Fixed

Character-based splitting with word-boundary awareness. Deterministic and uniform -- every chunk is exactly chunk_size characters (except the last). Breaks at the last space within the chunk window.

Semantic

Groups sentences by embedding similarity. Sentences are embedded, then greedily merged while the cosine similarity between the chunk centroid and the next sentence stays above similarity_threshold. Requires an embedding model at chunking time.

Computes L2-normalized centroid of sentence embeddings in each chunk
Falls back to recursive if the embedding call fails (configurable)
Produces the highest-quality chunks for vector search but costs one embedding call per document

Topic shift

Detects topic boundaries using vocabulary overlap (Jaccard similarity) between sliding windows of sentences. No embedding model required -- purely lexical.

Slides a window of window_size sentences across the text
Marks a boundary where Jaccard similarity between left and right windows drops below similarity_threshold
Skips ahead by window_size after each boundary to suppress over-fragmentation

Configuration parameters

Common parameters

Parameter	Data type	Required/Optional	Description
`field`	String	Optional	Source field containing content blocks or direct text. Default is `extracted.blocks`.
`target_field`	String	Optional	Field where chunk array is written. Default is `chunks`.
`block_types`	Array	Optional	Content block types to chunk. Non-matching blocks pass through unchanged. Default is `["text", "audio"]`.
`algorithm`	String	Optional	Chunking algorithm: `recursive`, `fixed`, `semantic`, or `topic_shift`. Default is `recursive`.
`chunk_size`	Integer	Optional	Target chunk size in characters. Default is `2000`.
`chunk_overlap`	Integer	Optional	Character-level overlap between consecutive chunks. Must be less than `chunk_size`. Default is `200`.
`min_chunk_size`	Integer	Optional	Minimum chunk size. Smaller trailing chunks are merged with the previous chunk. Default is `100`.
`sentence_locale`	String	Optional	IETF BCP 47 language tag for sentence boundary detection (e.g., `de-DE`, `fr-FR`, `ja-JP`). Default is `ROOT` (language-neutral).
`description`	String	Optional	A brief description of the processor.
`tag`	String	Optional	An identifier tag for the processor.

Semantic algorithm parameters

These parameters are required only when algorithm is semantic.

Parameter	Data type	Required/Optional	Description
`model_id`	String	Required	Embedding model identifier (e.g., `amazon.titan-embed-text-v2:0`).
`provider`	String	Required	Embedding provider: `bedrock`, `openai`, or `http`.
`dimensions`	Integer	Optional	Embedding dimensions. Default is `1536`.
`provider_config`	Object	Optional	Provider-specific configuration (region, API key setting, endpoint).
`similarity_threshold`	Float	Optional	Minimum cosine similarity to keep merging sentences. Range `[0.0, 1.0]`. Default is `0.8`.
`overlap_sentences`	Integer	Optional	Number of sentences carried forward from the previous chunk. Default is `0`.
`on_failure_action`	String	Optional	`fallback` (silently use recursive) or `fail` (reject document). Default is `fallback`.

Topic shift algorithm parameters

These parameters apply only when algorithm is topic_shift.

Parameter	Data type	Required/Optional	Description
`similarity_threshold`	Float	Optional	Minimum Jaccard similarity to consider sentences as same topic. Range `[0.0, 1.0]`. Default is `0.3`.
`window_size`	Integer	Optional	Sliding window size in sentences. Must be at least `1`. Default is `3`.
`overlap_sentences`	Integer	Optional	Number of sentences carried forward from the previous chunk. Default is `0`.

Output structure

Each chunk in the output array contains the following fields:

{
  "chunks": [
    {
      "text": "The chunked text content...",
      "chunk_index": 0,
      "source_type": "text"
    },
    {
      "text": "Next chunk with overlap...",
      "chunk_index": 1,
      "source_type": "text"
    }
  ]
}

Field	Description
`text`	The chunk text content.
`chunk_index`	Sequential zero-based index across all chunks in the document.
`source_type`	The original block type (e.g., `text`, `audio`). Present only when input is a content block array.

Using the processor

Example 1: Recursive chunking (default)

PUT _ingest/pipeline/chunk-recursive
{
  "processors": [
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "chunk_size": 2000,
        "chunk_overlap": 200,
        "min_chunk_size": 100
      }
    }
  ]
}

Example 2: Semantic chunking with Bedrock

PUT _ingest/pipeline/chunk-semantic
{
  "processors": [
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "algorithm": "semantic",
        "model_id": "amazon.titan-embed-text-v2:0",
        "provider": "bedrock",
        "dimensions": 1024,
        "similarity_threshold": 0.75,
        "overlap_sentences": 1,
        "on_failure_action": "fallback",
        "provider_config": {
          "region": "us-east-2"
        }
      }
    }
  ]
}

Example 3: Topic shift chunking for multilingual content

PUT _ingest/pipeline/chunk-topic
{
  "processors": [
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "algorithm": "topic_shift",
        "window_size": 5,
        "similarity_threshold": 0.25,
        "sentence_locale": "de-DE",
        "chunk_size": 1500,
        "overlap_sentences": 2
      }
    }
  ]
}

Example 4: Full pipeline (extract + chunk + embed)

PUT _ingest/pipeline/doc-pipeline
{
  "processors": [
    {
      "content_extract": {
        "input_mode": "reference",
        "source_uri_field": "source_uri",
        "reference_config": { "region": "us-east-2" }
      }
    },
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "algorithm": "recursive",
        "chunk_size": 2000,
        "chunk_overlap": 200
      }
    },
    {
      "embed": {
        "field": "chunks",
        "model_id": "amazon.titan-embed-text-v2:0",
        "provider": "bedrock",
        "dimensions": 1024,
        "provider_config": { "region": "us-east-2" }
      }
    }
  ]
}

Algorithm selection guide

                    ┌─────────────────────┐
                    │  What kind of text?  │
                    └─────────┬───────────┘
                              │
                 ┌────────────┼────────────┐
                 ▼            ▼            ▼
          Structured    Mixed/unknown   Uniform
          documents     content         data
          (reports,     (web scrapes,   (logs, CSVs,
           papers)       forums)        structured)
                 │            │            │
                 ▼            ▼            ▼
            recursive    topic_shift     fixed
            (default)
                              │
                    Need highest quality
                    vector chunks?
                              │
                         ┌────┴────┐
                         ▼         ▼
                        Yes        No
                         │         │
                    semantic   topic_shift
                    (costs 1   (free, no
                     embed     model needed)
                     call)

Algorithm	Embedding cost	Best for	Trade-off
`recursive`	None	Structured documents (reports, papers, manuals)	Good general-purpose default
`fixed`	None	Uniform content, deterministic sizing	May break mid-sentence
`semantic`	1 embed call per document	Highest-quality vector retrieval	Slower, costs per document
`topic_shift`	None	Topic-diverse content, multilingual	Heuristic -- less precise than semantic

Syntax​

Algorithms​

Recursive (default)​

Fixed​

Semantic​

Topic shift​

Configuration parameters​

Common parameters​

Semantic algorithm parameters​

Topic shift algorithm parameters​

Output structure​

Using the processor​

Example 1: Recursive chunking (default)​

Example 2: Semantic chunking with Bedrock​

Example 3: Topic shift chunking for multilingual content​

Example 4: Full pipeline (extract + chunk + embed)​

Algorithm selection guide​