Skip to main content
Version: 0.11.0

Content extraction processor

Introduced 0.11.0

The content_extract processor extracts structured content blocks from documents in various formats, including PDF, DOCX, HTML, plain text, and images. It uses Apache Tika under the hood and supports multiple input modes: inline text, S3/HTTPS references, multipart streams, and base64-encoded attachments.

The processor outputs a normalized array of ContentBlock objects containing extracted text and detected images, along with document-level metadata (language, page count, MIME type). Downstream processors such as chunk, ocr, and embed consume these blocks.

Syntax

The following is the syntax for the content_extract processor:

{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri"
}
}

Input modes

The processor supports the following input modes:

ModeDescriptionMax size (default)Use case
inlineText provided directly in the document field1 MBSmall text payloads via API
referenceURI (S3 or HTTPS) pointing to the source content -- the cluster streams the document directly250 MBPDFs, DOCX, images in S3 or on the web
streamMultipart binary upload100 MBDirect upload from client applications
attachmentBase64-encoded content in the document field (deprecated -- use reference mode instead)10 MBLegacy compatibility
tip

Use reference mode for most use cases. It lets the cluster stream content directly from S3 or HTTPS without uploading data through the API. This is the most efficient approach for large documents and supports both private (presigned) and public (anonymous) S3 buckets.

Configuration parameters

The following table lists the required and optional parameters for the content_extract processor.

ParameterData typeRequired/OptionalDescription
fieldStringOptionalThe source field containing content or used as a placeholder for reference mode. Default is content.
target_fieldStringOptionalThe field where extracted content blocks are stored. Default is extracted.
input_modeStringOptionalThe input mode. Valid values are inline, reference, stream, attachment. Default is inline.
source_uri_fieldStringOptionalThe document field containing the S3 or HTTPS URI. Required when input_mode is reference.
region_fieldStringOptionalA document field that overrides the S3 region per document. Enables multi-region ingestion in a single pipeline.
mime_type_fieldStringOptionalThe document field containing the MIME type. If not specified, the MIME type is auto-detected.
mime_typeStringOptionalA static MIME type override for all documents processed by this pipeline.
preserve_image_dataBooleanOptionalWhether to keep base64-encoded image data in the extracted content blocks. Required if downstream ocr processor needs to read embedded images. Default is false.
max_inline_bytesIntegerOptionalMaximum size for inline content. Default is 1048576 (1 MB).
max_reference_bytesIntegerOptionalMaximum size for reference content. Default is 262144000 (250 MB).
max_stream_bytesIntegerOptionalMaximum size for stream content. Default is 104857600 (100 MB).
reference_configObjectOptionalConfiguration for reference resolution. See Reference configuration.
descriptionStringOptionalA brief description of the processor.
tagStringOptionalAn identifier tag for the processor.

Reference configuration

When using reference input mode, the reference_config object controls how URIs are resolved.

ParameterData typeRequired/OptionalDescription
regionStringOptionalThe AWS region for S3 URIs.
anonymousStringOptionalSet to "true" for public S3 buckets that don't require credentials (e.g., NASA open data). Default is "false".
s3_clientStringOptionalThe named S3 client configuration to use (configured via cluster settings). Default uses the default S3 client.
presign_duration_minutesStringOptionalDuration for presigned S3 URLs. Default is "60".

Output structure

The processor writes the following structure to the target_field:

{
"extracted": {
"blocks": [
{
"type": "text",
"text": "Extracted text content from the document..."
},
{
"type": "image",
"image_data": "base64-encoded image bytes (if preserve_image_data is true)",
"image_mime_type": "image/png",
"metadata": {
"page": 3
}
}
],
"metadata": {
"content_type": "application/pdf",
"language": "en",
"page_count": 12
}
}
}

Supported formats

The processor supports the following document formats:

FormatMIME typeExtraction
PDFapplication/pdfText extraction via Tika. Embedded images detected as image blocks for downstream OCR.
Word (DOCX)application/vnd.openxmlformats-officedocument.wordprocessingml.documentFull text extraction including tables.
Excel (XLSX)application/vnd.openxmlformats-officedocument.spreadsheetml.sheetCell values extracted as text.
PowerPoint (PPTX)application/vnd.openxmlformats-officedocument.presentationml.presentationSlide text extraction.
HTMLtext/htmlText extracted from rendered HTML.
Plain texttext/plainPassed through directly.
Markdowntext/markdownPassed through directly.
JSONapplication/jsonPassed through directly.
Imagesimage/*EXIF metadata extracted (GPS, camera info, dimensions). Optionally preserves image data for embedding.
GeoTIFFimage/tiffCRS, bands, dimensions, and geo-transform metadata extracted. Use with image_tiling for tile-based processing.

Using the processor

Example 1: Extract text from an S3 document

Step 1: Create a pipeline

PUT _ingest/pipeline/doc-extract
{
"description": "Extract content from S3 documents",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri",
"preserve_image_data": true,
"reference_config": {
"region": "us-east-2"
}
}
}
]
}

Step 2: Ingest a document

Point at an S3 object -- the cluster streams it directly:

PUT /my-documents/_doc/1?pipeline=doc-extract
{
"source_uri": "s3://my-bucket/reports/annual-report.pdf",
"title": "2024 Annual Report"
}

Example 2: Extract from a public HTTPS URL

PUT /my-documents/_doc/2?pipeline=doc-extract
{
"source_uri": "https://example.com/research-paper.pdf",
"title": "Research Paper"
}

Example 3: Extract from a public S3 bucket (anonymous access)

PUT _ingest/pipeline/nasa-extract
{
"description": "Extract from NASA public S3 bucket",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri",
"reference_config": {
"region": "us-west-2",
"anonymous": "true"
}
}
}
]
}

Example 4: Inline text extraction

PUT /my-documents/_doc/3?pipeline=inline-extract
{
"content": "This is inline text content that will be extracted and split into content blocks."
}

Example 5: Multi-region ingestion with per-document region override

If your documents span multiple S3 regions, use region_field to override the region per document:

PUT _ingest/pipeline/multi-region
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"region_field": "doc_region",
"reference_config": {
"region": "us-east-1"
}
}
}
]
}

Then each document can specify its own region:

PUT /docs/_doc/1?pipeline=multi-region
{
"source_uri": "s3://east-bucket/report.pdf",
"doc_region": "us-east-1"
}
PUT /docs/_doc/2?pipeline=multi-region
{
"source_uri": "s3://west-bucket/report.pdf",
"doc_region": "us-west-2"
}

Combining with other processors

The content_extract processor is typically the first step in a pipeline chain. Common patterns:

Document pipeline (extract + OCR + chunk + embed):

PUT _ingest/pipeline/full-document-pipeline
{
"processors": [
{ "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri", "preserve_image_data": true } },
{ "ocr": { "field": "extracted.blocks", "model_id": "anthropic.claude-3-haiku-20240307-v1:0", "provider": "bedrock", "block_types": ["image"] } },
{ "chunk": { "field": "extracted.blocks", "target_field": "chunks", "algorithm": "recursive", "chunk_size": 2000 } },
{ "embed": { "field": "chunks", "model_id": "amazon.titan-embed-text-v2:0", "provider": "bedrock", "dimensions": 1024 } }
]
}

GeoTIFF pipeline (extract + tile + embed):

PUT _ingest/pipeline/geotiff-pipeline
{
"processors": [
{ "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri" } },
{ "image_tiling": { "field": "extracted.blocks", "target_field": "chunks", "profile": "geo_search" } },
{ "embed": { "field": "chunks", "model_id": "amazon.titan-embed-image-v1", "provider": "bedrock", "dimensions": 1024 } }
]
}

Custom extractors

Developers can implement custom content extractors by implementing the ContentExtractor SPI interface and registering it via META-INF/services. Custom extractors are selected by MIME type and priority, allowing you to override the default extraction behavior for specific formats. See Extending ingest-content for details.