Version: 0.11.1

Content extraction processor

Introduced 0.11.0

The content_extract processor extracts structured content blocks from documents in various formats, including PDF, DOCX, HTML, plain text, and images. It uses Apache Tika under the hood and supports multiple input modes: inline text, S3/HTTPS references, multipart streams, and base64-encoded attachments.

The processor outputs a normalized array of ContentBlock objects containing extracted text and detected images, along with document-level metadata (language, page count, MIME type). Downstream processors such as chunk, ocr, and embed consume these blocks.

Syntax

The following is the syntax for the content_extract processor:

{
  "content_extract": {
    "field": "content",
    "target_field": "extracted",
    "input_mode": "reference",
    "source_uri_field": "source_uri"
  }
}

Input modes

The processor supports the following input modes:

Mode	Description	Max size (default)	Use case
`inline`	Text provided directly in the document field	1 MB	Small text payloads via API
`reference`	URI (S3 or HTTPS) pointing to the source content -- the cluster streams the document directly	250 MB	PDFs, DOCX, images in S3 or on the web
`stream`	Multipart binary upload	100 MB	Direct upload from client applications
`attachment`	Base64-encoded content in the document field (deprecated -- use `reference` mode instead)	10 MB	Legacy compatibility

tip

Use reference mode for most use cases. It lets the cluster stream content directly from S3 or HTTPS without uploading data through the API. This is the most efficient approach for large documents and supports both private (presigned) and public (anonymous) S3 buckets.

Configuration parameters

The following table lists the required and optional parameters for the content_extract processor.

Parameter	Data type	Required/Optional	Description
`field`	String	Optional	The source field containing content or used as a placeholder for reference mode. Default is `content`.
`target_field`	String	Optional	The field where extracted content blocks are stored. Default is `extracted`.
`input_mode`	String	Optional	The input mode. Valid values are `inline`, `reference`, `stream`, `attachment`. Default is `inline`.
`source_uri_field`	String	Optional	The document field containing the S3 or HTTPS URI. Required when `input_mode` is `reference`.
`region_field`	String	Optional	A document field that overrides the S3 region per document. Enables multi-region ingestion in a single pipeline.
`mime_type_field`	String	Optional	The document field containing the MIME type. If not specified, the MIME type is auto-detected.
`mime_type`	String	Optional	A static MIME type override for all documents processed by this pipeline.
`preserve_image_data`	Boolean	Optional	Whether to keep base64-encoded image data in the extracted content blocks. Required if downstream ocr processor needs to read embedded images. Default is `false`.
`max_inline_bytes`	Integer	Optional	Maximum size for inline content. Default is `1048576` (1 MB).
`max_reference_bytes`	Integer	Optional	Maximum size for reference content. Default is `262144000` (250 MB).
`max_stream_bytes`	Integer	Optional	Maximum size for stream content. Default is `104857600` (100 MB).
`reference_config`	Object	Optional	Configuration for reference resolution. See Reference configuration.
`description`	String	Optional	A brief description of the processor.
`tag`	String	Optional	An identifier tag for the processor.

Reference configuration

When using reference input mode, the reference_config object controls how URIs are resolved.

Parameter	Data type	Required/Optional	Description
`region`	String	Optional	The AWS region for S3 URIs.
`anonymous`	String	Optional	Set to `"true"` for public S3 buckets that don't require credentials (e.g., NASA open data). Default is `"false"`.
`s3_client`	String	Optional	The named S3 client configuration to use (configured via cluster settings). Default uses the default S3 client.
`presign_duration_minutes`	String	Optional	Duration for presigned S3 URLs. Default is `"60"`.

Output structure

The processor writes the following structure to the target_field:

{
  "extracted": {
    "blocks": [
      {
        "type": "text",
        "text": "Extracted text content from the document..."
      },
      {
        "type": "image",
        "image_data": "base64-encoded image bytes (if preserve_image_data is true)",
        "image_mime_type": "image/png",
        "metadata": {
          "page": 3
        }
      }
    ],
    "metadata": {
      "content_type": "application/pdf",
      "language": "en",
      "page_count": 12
    }
  }
}

Supported formats

The processor supports the following document formats:

Format	MIME type	Extraction
PDF	`application/pdf`	Text extraction via Tika. Embedded images detected as `image` blocks for downstream OCR.
Word (DOCX)	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Full text extraction including tables.
Excel (XLSX)	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	Cell values extracted as text.
PowerPoint (PPTX)	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	Slide text extraction.
HTML	`text/html`	Text extracted from rendered HTML.
Plain text	`text/plain`	Passed through directly.
Markdown	`text/markdown`	Passed through directly.
JSON	`application/json`	Passed through directly.
Images	`image/*`	EXIF metadata extracted (GPS, camera info, dimensions). Optionally preserves image data for embedding.
GeoTIFF	`image/tiff`	CRS, bands, dimensions, and geo-transform metadata extracted. Use with image_tiling for tile-based processing.

Using the processor

Example 1: Extract text from an S3 document

Step 1: Create a pipeline

PUT _ingest/pipeline/doc-extract
{
  "description": "Extract content from S3 documents",
  "processors": [
    {
      "content_extract": {
        "field": "content",
        "target_field": "extracted",
        "input_mode": "reference",
        "source_uri_field": "source_uri",
        "preserve_image_data": true,
        "reference_config": {
          "region": "us-east-2"
        }
      }
    }
  ]
}

Step 2: Ingest a document

Point at an S3 object -- the cluster streams it directly:

PUT /my-documents/_doc/1?pipeline=doc-extract
{
  "source_uri": "s3://my-bucket/reports/annual-report.pdf",
  "title": "2024 Annual Report"
}

Example 2: Extract from a public HTTPS URL

PUT /my-documents/_doc/2?pipeline=doc-extract
{
  "source_uri": "https://example.com/research-paper.pdf",
  "title": "Research Paper"
}

Example 3: Extract from a public S3 bucket (anonymous access)

PUT _ingest/pipeline/nasa-extract
{
  "description": "Extract from NASA public S3 bucket",
  "processors": [
    {
      "content_extract": {
        "field": "content",
        "target_field": "extracted",
        "input_mode": "reference",
        "source_uri_field": "source_uri",
        "reference_config": {
          "region": "us-west-2",
          "anonymous": "true"
        }
      }
    }
  ]
}

Example 4: Inline text extraction

PUT /my-documents/_doc/3?pipeline=inline-extract
{
  "content": "This is inline text content that will be extracted and split into content blocks."
}

Example 5: Multi-region ingestion with per-document region override

If your documents span multiple S3 regions, use region_field to override the region per document:

PUT _ingest/pipeline/multi-region
{
  "processors": [
    {
      "content_extract": {
        "input_mode": "reference",
        "source_uri_field": "source_uri",
        "region_field": "doc_region",
        "reference_config": {
          "region": "us-east-1"
        }
      }
    }
  ]
}

Then each document can specify its own region:

PUT /docs/_doc/1?pipeline=multi-region
{
  "source_uri": "s3://east-bucket/report.pdf",
  "doc_region": "us-east-1"
}

PUT /docs/_doc/2?pipeline=multi-region
{
  "source_uri": "s3://west-bucket/report.pdf",
  "doc_region": "us-west-2"
}

Combining with other processors

The content_extract processor is typically the first step in a pipeline chain. Common patterns:

Document pipeline (extract + OCR + chunk + embed):

PUT _ingest/pipeline/full-document-pipeline
{
  "processors": [
    { "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri", "preserve_image_data": true } },
    { "ocr": { "field": "extracted.blocks", "model_id": "anthropic.claude-3-haiku-20240307-v1:0", "provider": "bedrock", "block_types": ["image"] } },
    { "chunk": { "field": "extracted.blocks", "target_field": "chunks", "algorithm": "recursive", "chunk_size": 2000 } },
    { "embed": { "field": "chunks", "model_id": "amazon.titan-embed-text-v2:0", "provider": "bedrock", "dimensions": 1024 } }
  ]
}

GeoTIFF pipeline (extract + tile + embed):

PUT _ingest/pipeline/geotiff-pipeline
{
  "processors": [
    { "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri" } },
    { "image_tiling": { "field": "extracted.blocks", "target_field": "chunks", "profile": "geo_search" } },
    { "embed": { "field": "chunks", "model_id": "amazon.titan-embed-image-v1", "provider": "bedrock", "dimensions": 1024 } }
  ]
}

Custom extractors

Developers can implement custom content extractors by implementing the ContentExtractor SPI interface and registering it via META-INF/services. Custom extractors are selected by MIME type and priority, allowing you to override the default extraction behavior for specific formats. See Extending ingest-content for details.

Syntax​

Input modes​

Configuration parameters​

Reference configuration​

Output structure​

Supported formats​

Using the processor​

Example 1: Extract text from an S3 document​

Example 2: Extract from a public HTTPS URL​

Example 3: Extract from a public S3 bucket (anonymous access)​

Example 4: Inline text extraction​

Example 5: Multi-region ingestion with per-document region override​

Combining with other processors​

Custom extractors​