Version: 0.11.0

OCR processor

Introduced 0.11.0

The ocr processor performs optical character recognition on image blocks using a cloud vision model (Claude on Bedrock). It runs after content_extract and converts image blocks into text blocks, so that downstream chunk and embed processors can index the extracted text.

Unlike traditional OCR (Tesseract), the processor uses LLM vision models that understand charts, diagrams, tables, handwriting, and complex layouts -- not just printed text.

Syntax

{
  "ocr": {
    "field": "extracted.blocks",
    "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
    "provider": "bedrock",
    "provider_config": {
      "region": "us-east-2"
    }
  }
}

How it works

content_extract output:
┌──────────────────────────────────────────────────────┐
│ blocks: [                                            │
│   { type: "text",  text: "Chapter 1..." },           │
│   { type: "image", image_data: "<base64 chart>" },  ◄── OCR processes this
│   { type: "text",  text: "As shown above..." },      │
│   { type: "image", image_data: "<base64 photo>" },  ◄── and this
│ ]                                                    │
└──────────────────────────────────────────────────────┘
                          │
                    ocr processor
                          │
                          ▼
┌──────────────────────────────────────────────────────┐
│ blocks: [                                            │
│   { type: "text",  text: "Chapter 1..." },           │
│   { type: "image", text: "Revenue: $4.2M ↑12%..." },│ ◄── text extracted
│   { type: "text",  text: "As shown above..." },      │
│   { type: "image", text: "Photo shows facility..." },│ ◄── text extracted
│ ]                                                    │
└──────────────────────────────────────────────────────┘
                          │
                    chunk processor
                          │
                          ▼
              All text now gets chunked
              and embedded together

The processor:

Iterates over blocks in the field array
Filters blocks by block_types (default: ["image"])
Validates the image data exists and is within max_image_bytes
Sends each image to the vision model with a system prompt
Writes extracted text back into the block's text field
Optionally discards image_data to save storage

Configuration parameters

Parameter	Data type	Required/Optional	Description
`field`	String	Optional	Source field containing the content block array. Default is `extracted.blocks`.
`model_id`	String	Required	Vision model identifier (e.g., `anthropic.claude-3-haiku-20240307-v1:0`).
`provider`	String	Required	Inference provider: `bedrock` or `http`.
`provider_config`	Object	Optional	Provider-specific configuration (region, credentials).
`block_types`	Array	Optional	Block types to process. Default is `["image"]`.
`prompt`	String	Optional	Custom system prompt. If not set, uses the default OCR prompt.
`on_failure_action`	String	Optional	`skip` (log warning, leave block unchanged) or `fail` (reject document). Default is `skip`.
`max_image_bytes`	Integer	Optional	Maximum decoded image size. Default is `20971520` (20 MB).
`discard_image_data`	Boolean	Optional	Remove `image_data` from blocks after OCR. Saves storage but prevents re-processing. Default is `true`.
`source_uri_field`	String	Optional	For reference mode: document field containing image URI. Fetches image transiently from S3/HTTPS.
`reference_config`	Object	Optional	Reference resolver configuration for `source_uri_field`.
`description`	String	Optional	A brief description of the processor.
`tag`	String	Optional	An identifier tag for the processor.

Reference mode parameters

When source_uri_field is set, the processor fetches a single image from S3/HTTPS and writes OCR results to configurable document fields:

Parameter	Data type	Required/Optional	Description
`target_text_field`	String	Optional	Document field for extracted text. Default is `ocr_text`.
`target_confidence_field`	String	Optional	Document field for confidence score. Default is `ocr_confidence`.
`target_boxes_field`	String	Optional	Document field for bounding box array. Default is `ocr_bounding_boxes`.

Supported models

Any Claude 3 model accessible via Bedrock:

Model	Model ID	Speed	Quality	Best for
Claude 3 Haiku	`anthropic.claude-3-haiku-20240307-v1:0`	Fast	Good	High-throughput OCR at ingest time
Claude 3 Sonnet	`anthropic.claude-3-sonnet-20240229-v1:0`	Medium	Better	Complex layouts, technical diagrams
Claude 3 Opus	`anthropic.claude-3-opus-20240229-v1:0`	Slow	Best	Critical accuracy requirements

tip

Use Claude 3 Haiku for ingest-time OCR. It offers the best throughput-to-cost ratio and handles most document images well. Reserve Sonnet or Opus for documents with complex tables, handwriting, or diagrams that require higher accuracy.

Default prompts

The processor uses task-specific system prompts when no custom prompt is provided:

OCR (default):

Extract all visible text from this image. Return only the extracted text, preserving the original layout and reading order as closely as possible. Do not add commentary or formatting instructions.

You can override this with a custom prompt tailored to your content:

{
  "ocr": {
    "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
    "provider": "bedrock",
    "prompt": "Extract all text from this engineering diagram. Include part numbers, dimensions, and annotations. Format as structured text with clear section headers.",
    "provider_config": { "region": "us-east-2" }
  }
}

Output structure

Block mode (default)

After processing, image blocks gain a text field:

{
  "type": "image",
  "text": "Revenue Growth FY2024\nQ1: $4.2M (+12%)\nQ2: $4.8M (+14%)...",
  "ocr_confidence": 0.94,
  "ocr_bounding_boxes": [
    {
      "text": "Revenue Growth",
      "x": 0.1,
      "y": 0.05,
      "width": 0.4,
      "height": 0.08,
      "confidence": 0.98
    }
  ],
  "image_mime_type": "image/png"
}

note

When discard_image_data is true (default), the image_data field is removed after OCR to save storage. Set it to false if downstream processors (such as multimodal embedding) need the raw image bytes.

Reference mode

When using source_uri_field, results are written to document-level fields:

{
  "image_uri": "s3://bucket/scan.png",
  "ocr_text": "Extracted text from the scanned page...",
  "ocr_confidence": 0.92,
  "ocr_bounding_boxes": [...]
}

Using the processor

Example 1: OCR in a document pipeline

The most common pattern -- extract content, OCR embedded images, then chunk and embed all text:

PUT _ingest/pipeline/doc-with-ocr
{
  "processors": [
    {
      "content_extract": {
        "input_mode": "reference",
        "source_uri_field": "source_uri",
        "preserve_image_data": true,
        "reference_config": { "region": "us-east-2" }
      }
    },
    {
      "ocr": {
        "field": "extracted.blocks",
        "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
        "provider": "bedrock",
        "block_types": ["image"],
        "provider_config": { "region": "us-east-2" }
      }
    },
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "algorithm": "recursive",
        "chunk_size": 2000
      }
    },
    {
      "embed": {
        "field": "chunks",
        "model_id": "amazon.titan-embed-text-v2:0",
        "provider": "bedrock",
        "dimensions": 1024,
        "provider_config": { "region": "us-east-2" }
      }
    }
  ]
}

tip

Set preserve_image_data: true on the content_extract processor when you have a downstream ocr processor. Otherwise, image bytes are discarded during extraction and OCR will have nothing to process.

Example 2: OCR a single image from S3

PUT _ingest/pipeline/image-ocr
{
  "processors": [
    {
      "ocr": {
        "model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
        "provider": "bedrock",
        "source_uri_field": "image_uri",
        "target_text_field": "extracted_text",
        "provider_config": { "region": "us-east-2" }
      }
    }
  ]
}

PUT /scanned-docs/_doc/1?pipeline=image-ocr
{
  "image_uri": "s3://my-bucket/scans/contract-page-1.png",
  "title": "Contract Page 1"
}

Example 3: Custom prompt for engineering drawings

PUT _ingest/pipeline/engineering-ocr
{
  "processors": [
    {
      "ocr": {
        "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
        "provider": "bedrock",
        "prompt": "Extract all text, part numbers, dimensions, tolerances, and revision notes from this engineering drawing. Preserve the hierarchical structure of the title block.",
        "discard_image_data": false,
        "provider_config": { "region": "us-east-2" }
      }
    }
  ]
}

Syntax​

How it works​

Configuration parameters​

Reference mode parameters​

Supported models​

Default prompts​

Output structure​

Block mode (default)​

Reference mode​

Using the processor​

Example 1: OCR in a document pipeline​

Example 2: OCR a single image from S3​

Example 3: Custom prompt for engineering drawings​