Content extraction processor
The content_extract processor extracts structured content blocks from documents in various formats, including PDF, DOCX, HTML, plain text, and images. It uses Apache Tika under the hood and supports multiple input modes: inline text, S3/HTTPS references, multipart streams, and base64-encoded attachments.
The processor outputs a normalized array of ContentBlock objects containing extracted text and detected images, along with document-level metadata (language, page count, MIME type). Downstream processors such as chunk, ocr, and embed consume these blocks.
Syntax
The following is the syntax for the content_extract processor:
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri"
}
}
Input modes
The processor supports the following input modes:
| Mode | Description | Max size (default) | Use case |
|---|---|---|---|
inline | Text provided directly in the document field | 1 MB | Small text payloads via API |
reference | URI (S3 or HTTPS) pointing to the source content -- the cluster streams the document directly | 250 MB | PDFs, DOCX, images in S3 or on the web |
stream | Multipart binary upload | 100 MB | Direct upload from client applications |
attachment | Base64-encoded content in the document field (deprecated -- use reference mode instead) | 10 MB | Legacy compatibility |
Use reference mode for most use cases. It lets the cluster stream content directly from S3 or HTTPS without uploading data through the API. This is the most efficient approach for large documents and supports both private (presigned) and public (anonymous) S3 buckets.
Configuration parameters
The following table lists the required and optional parameters for the content_extract processor.
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
field | String | Optional | The source field containing content or used as a placeholder for reference mode. Default is content. |
target_field | String | Optional | The field where extracted content blocks are stored. Default is extracted. |
input_mode | String | Optional | The input mode. Valid values are inline, reference, stream, attachment. Default is inline. |
source_uri_field | String | Optional | The document field containing the S3 or HTTPS URI. Required when input_mode is reference. |
region_field | String | Optional | A document field that overrides the S3 region per document. Enables multi-region ingestion in a single pipeline. |
mime_type_field | String | Optional | The document field containing the MIME type. If not specified, the MIME type is auto-detected. |
mime_type | String | Optional | A static MIME type override for all documents processed by this pipeline. |
preserve_image_data | Boolean | Optional | Whether to keep base64-encoded image data in the extracted content blocks. Required if downstream ocr processor needs to read embedded images. Default is false. |
max_inline_bytes | Integer | Optional | Maximum size for inline content. Default is 1048576 (1 MB). |
max_reference_bytes | Integer | Optional | Maximum size for reference content. Default is 262144000 (250 MB). |
max_stream_bytes | Integer | Optional | Maximum size for stream content. Default is 104857600 (100 MB). |
reference_config | Object | Optional | Configuration for reference resolution. See Reference configuration. |
description | String | Optional | A brief description of the processor. |
tag | String | Optional | An identifier tag for the processor. |
Reference configuration
When using reference input mode, the reference_config object controls how URIs are resolved.
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
region | String | Optional | The AWS region for S3 URIs. |
anonymous | String | Optional | Set to "true" for public S3 buckets that don't require credentials (e.g., NASA open data). Default is "false". |
s3_client | String | Optional | The named S3 client configuration to use (configured via cluster settings). Default uses the default S3 client. |
presign_duration_minutes | String | Optional | Duration for presigned S3 URLs. Default is "60". |
Output structure
The processor writes the following structure to the target_field:
{
"extracted": {
"blocks": [
{
"type": "text",
"text": "Extracted text content from the document..."
},
{
"type": "image",
"image_data": "base64-encoded image bytes (if preserve_image_data is true)",
"image_mime_type": "image/png",
"metadata": {
"page": 3
}
}
],
"metadata": {
"content_type": "application/pdf",
"language": "en",
"page_count": 12
}
}
}
Supported formats
The processor supports the following document formats:
| Format | MIME type | Extraction |
|---|---|---|
application/pdf | Text extraction via Tika. Embedded images detected as image blocks for downstream OCR. | |
| Word (DOCX) | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Full text extraction including tables. |
| Excel (XLSX) | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Cell values extracted as text. |
| PowerPoint (PPTX) | application/vnd.openxmlformats-officedocument.presentationml.presentation | Slide text extraction. |
| HTML | text/html | Text extracted from rendered HTML. |
| Plain text | text/plain | Passed through directly. |
| Markdown | text/markdown | Passed through directly. |
| JSON | application/json | Passed through directly. |
| Images | image/* | EXIF metadata extracted (GPS, camera info, dimensions). Optionally preserves image data for embedding. |
| GeoTIFF | image/tiff | CRS, bands, dimensions, and geo-transform metadata extracted. Use with image_tiling for tile-based processing. |
Using the processor
Example 1: Extract text from an S3 document
Step 1: Create a pipeline
PUT _ingest/pipeline/doc-extract
{
"description": "Extract content from S3 documents",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri",
"preserve_image_data": true,
"reference_config": {
"region": "us-east-2"
}
}
}
]
}
Step 2: Ingest a document
Point at an S3 object -- the cluster streams it directly:
PUT /my-documents/_doc/1?pipeline=doc-extract
{
"source_uri": "s3://my-bucket/reports/annual-report.pdf",
"title": "2024 Annual Report"
}
Example 2: Extract from a public HTTPS URL
PUT /my-documents/_doc/2?pipeline=doc-extract
{
"source_uri": "https://example.com/research-paper.pdf",
"title": "Research Paper"
}
Example 3: Extract from a public S3 bucket (anonymous access)
PUT _ingest/pipeline/nasa-extract
{
"description": "Extract from NASA public S3 bucket",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri",
"reference_config": {
"region": "us-west-2",
"anonymous": "true"
}
}
}
]
}
Example 4: Inline text extraction
PUT /my-documents/_doc/3?pipeline=inline-extract
{
"content": "This is inline text content that will be extracted and split into content blocks."
}
Example 5: Multi-region ingestion with per-document region override
If your documents span multiple S3 regions, use region_field to override the region per document:
PUT _ingest/pipeline/multi-region
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"region_field": "doc_region",
"reference_config": {
"region": "us-east-1"
}
}
}
]
}
Then each document can specify its own region:
PUT /docs/_doc/1?pipeline=multi-region
{
"source_uri": "s3://east-bucket/report.pdf",
"doc_region": "us-east-1"
}
PUT /docs/_doc/2?pipeline=multi-region
{
"source_uri": "s3://west-bucket/report.pdf",
"doc_region": "us-west-2"
}
Combining with other processors
The content_extract processor is typically the first step in a pipeline chain. Common patterns:
Document pipeline (extract + OCR + chunk + embed):
PUT _ingest/pipeline/full-document-pipeline
{
"processors": [
{ "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri", "preserve_image_data": true } },
{ "ocr": { "field": "extracted.blocks", "model_id": "anthropic.claude-3-haiku-20240307-v1:0", "provider": "bedrock", "block_types": ["image"] } },
{ "chunk": { "field": "extracted.blocks", "target_field": "chunks", "algorithm": "recursive", "chunk_size": 2000 } },
{ "embed": { "field": "chunks", "model_id": "amazon.titan-embed-text-v2:0", "provider": "bedrock", "dimensions": 1024 } }
]
}
GeoTIFF pipeline (extract + tile + embed):
PUT _ingest/pipeline/geotiff-pipeline
{
"processors": [
{ "content_extract": { "input_mode": "reference", "source_uri_field": "source_uri" } },
{ "image_tiling": { "field": "extracted.blocks", "target_field": "chunks", "profile": "geo_search" } },
{ "embed": { "field": "chunks", "model_id": "amazon.titan-embed-image-v1", "provider": "bedrock", "dimensions": 1024 } }
]
}
Custom extractors
Developers can implement custom content extractors by implementing the ContentExtractor SPI interface and registering it via META-INF/services. Custom extractors are selected by MIME type and priority, allowing you to override the default extraction behavior for specific formats. See Extending ingest-content for details.