Skip to main content
Version: 0.11.0

OCR processor

Introduced 0.11.0

The ocr processor performs optical character recognition on image blocks using a cloud vision model (Claude on Bedrock). It runs after content_extract and converts image blocks into text blocks, so that downstream chunk and embed processors can index the extracted text.

Unlike traditional OCR (Tesseract), the processor uses LLM vision models that understand charts, diagrams, tables, handwriting, and complex layouts -- not just printed text.

Syntax

{
"ocr": {
"field": "extracted.blocks",
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"provider_config": {
"region": "us-east-2"
}
}
}

How it works

content_extract output:
┌──────────────────────────────────────────────────────┐
│ blocks: [ │
│ { type: "text", text: "Chapter 1..." }, │
│ { type: "image", image_data: "<base64 chart>" }, ◄── OCR processes this
│ { type: "text", text: "As shown above..." }, │
│ { type: "image", image_data: "<base64 photo>" }, ◄── and this
│ ] │
└──────────────────────────────────────────────────────┘

ocr processor


┌──────────────────────────────────────────────────────┐
│ blocks: [ │
│ { type: "text", text: "Chapter 1..." }, │
│ { type: "image", text: "Revenue: $4.2M ↑12%..." },│ ◄── text extracted
│ { type: "text", text: "As shown above..." }, │
│ { type: "image", text: "Photo shows facility..." },│ ◄── text extracted
│ ] │
└──────────────────────────────────────────────────────┘

chunk processor


All text now gets chunked
and embedded together

The processor:

  1. Iterates over blocks in the field array
  2. Filters blocks by block_types (default: ["image"])
  3. Validates the image data exists and is within max_image_bytes
  4. Sends each image to the vision model with a system prompt
  5. Writes extracted text back into the block's text field
  6. Optionally discards image_data to save storage

Configuration parameters

ParameterData typeRequired/OptionalDescription
fieldStringOptionalSource field containing the content block array. Default is extracted.blocks.
model_idStringRequiredVision model identifier (e.g., anthropic.claude-3-haiku-20240307-v1:0).
providerStringRequiredInference provider: bedrock or http.
provider_configObjectOptionalProvider-specific configuration (region, credentials).
block_typesArrayOptionalBlock types to process. Default is ["image"].
promptStringOptionalCustom system prompt. If not set, uses the default OCR prompt.
on_failure_actionStringOptionalskip (log warning, leave block unchanged) or fail (reject document). Default is skip.
max_image_bytesIntegerOptionalMaximum decoded image size. Default is 20971520 (20 MB).
discard_image_dataBooleanOptionalRemove image_data from blocks after OCR. Saves storage but prevents re-processing. Default is true.
source_uri_fieldStringOptionalFor reference mode: document field containing image URI. Fetches image transiently from S3/HTTPS.
reference_configObjectOptionalReference resolver configuration for source_uri_field.
descriptionStringOptionalA brief description of the processor.
tagStringOptionalAn identifier tag for the processor.

Reference mode parameters

When source_uri_field is set, the processor fetches a single image from S3/HTTPS and writes OCR results to configurable document fields:

ParameterData typeRequired/OptionalDescription
target_text_fieldStringOptionalDocument field for extracted text. Default is ocr_text.
target_confidence_fieldStringOptionalDocument field for confidence score. Default is ocr_confidence.
target_boxes_fieldStringOptionalDocument field for bounding box array. Default is ocr_bounding_boxes.

Supported models

Any Claude 3 model accessible via Bedrock:

ModelModel IDSpeedQualityBest for
Claude 3 Haikuanthropic.claude-3-haiku-20240307-v1:0FastGoodHigh-throughput OCR at ingest time
Claude 3 Sonnetanthropic.claude-3-sonnet-20240229-v1:0MediumBetterComplex layouts, technical diagrams
Claude 3 Opusanthropic.claude-3-opus-20240229-v1:0SlowBestCritical accuracy requirements
tip

Use Claude 3 Haiku for ingest-time OCR. It offers the best throughput-to-cost ratio and handles most document images well. Reserve Sonnet or Opus for documents with complex tables, handwriting, or diagrams that require higher accuracy.

Default prompts

The processor uses task-specific system prompts when no custom prompt is provided:

OCR (default):

Extract all visible text from this image. Return only the extracted text, preserving the original layout and reading order as closely as possible. Do not add commentary or formatting instructions.

You can override this with a custom prompt tailored to your content:

{
"ocr": {
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"prompt": "Extract all text from this engineering diagram. Include part numbers, dimensions, and annotations. Format as structured text with clear section headers.",
"provider_config": { "region": "us-east-2" }
}
}

Output structure

Block mode (default)

After processing, image blocks gain a text field:

{
"type": "image",
"text": "Revenue Growth FY2024\nQ1: $4.2M (+12%)\nQ2: $4.8M (+14%)...",
"ocr_confidence": 0.94,
"ocr_bounding_boxes": [
{
"text": "Revenue Growth",
"x": 0.1,
"y": 0.05,
"width": 0.4,
"height": 0.08,
"confidence": 0.98
}
],
"image_mime_type": "image/png"
}
note

When discard_image_data is true (default), the image_data field is removed after OCR to save storage. Set it to false if downstream processors (such as multimodal embedding) need the raw image bytes.

Reference mode

When using source_uri_field, results are written to document-level fields:

{
"image_uri": "s3://bucket/scan.png",
"ocr_text": "Extracted text from the scanned page...",
"ocr_confidence": 0.92,
"ocr_bounding_boxes": [...]
}

Using the processor

Example 1: OCR in a document pipeline

The most common pattern -- extract content, OCR embedded images, then chunk and embed all text:

PUT _ingest/pipeline/doc-with-ocr
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"preserve_image_data": true,
"reference_config": { "region": "us-east-2" }
}
},
{
"ocr": {
"field": "extracted.blocks",
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"block_types": ["image"],
"provider_config": { "region": "us-east-2" }
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": { "region": "us-east-2" }
}
}
]
}
tip

Set preserve_image_data: true on the content_extract processor when you have a downstream ocr processor. Otherwise, image bytes are discarded during extraction and OCR will have nothing to process.

Example 2: OCR a single image from S3

PUT _ingest/pipeline/image-ocr
{
"processors": [
{
"ocr": {
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"provider": "bedrock",
"source_uri_field": "image_uri",
"target_text_field": "extracted_text",
"provider_config": { "region": "us-east-2" }
}
}
]
}
PUT /scanned-docs/_doc/1?pipeline=image-ocr
{
"image_uri": "s3://my-bucket/scans/contract-page-1.png",
"title": "Contract Page 1"
}

Example 3: Custom prompt for engineering drawings

PUT _ingest/pipeline/engineering-ocr
{
"processors": [
{
"ocr": {
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"prompt": "Extract all text, part numbers, dimensions, tolerances, and revision notes from this engineering drawing. Preserve the hierarchical structure of the title block.",
"discard_image_data": false,
"provider_config": { "region": "us-east-2" }
}
}
]
}