OCR processor
The ocr processor performs optical character recognition on image blocks using a cloud vision model (Claude on Bedrock). It runs after content_extract and converts image blocks into text blocks, so that downstream chunk and embed processors can index the extracted text.
Unlike traditional OCR (Tesseract), the processor uses LLM vision models that understand charts, diagrams, tables, handwriting, and complex layouts -- not just printed text.
Syntax
{
"ocr": {
"field": "extracted.blocks",
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"provider_config": {
"region": "us-east-2"
}
}
}
How it works
content_extract output:
┌──────────────────────────────────────────────────────┐
│ blocks: [ │
│ { type: "text", text: "Chapter 1..." }, │
│ { type: "image", image_data: "<base64 chart>" }, ◄── OCR processes this
│ { type: "text", text: "As shown above..." }, │
│ { type: "image", image_data: "<base64 photo>" }, ◄── and this
│ ] │
└──────────────────────────────────────────────────────┘
│
ocr processor
│
▼
┌──────────────────────────────────────────────────────┐
│ blocks: [ │
│ { type: "text", text: "Chapter 1..." }, │
│ { type: "image", text: "Revenue: $4.2M ↑12%..." },│ ◄── text extracted
│ { type: "text", text: "As shown above..." }, │
│ { type: "image", text: "Photo shows facility..." },│ ◄── text extracted
│ ] │
└──────────────────────────────────────────────────────┘
│
chunk processor
│
▼
All text now gets chunked
and embedded together
The processor:
- Iterates over blocks in the
fieldarray - Filters blocks by
block_types(default:["image"]) - Validates the image data exists and is within
max_image_bytes - Sends each image to the vision model with a system prompt
- Writes extracted text back into the block's
textfield - Optionally discards
image_datato save storage
Configuration parameters
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
field | String | Optional | Source field containing the content block array. Default is extracted.blocks. |
model_id | String | Required | Vision model identifier (e.g., anthropic.claude-3-haiku-20240307-v1:0). |
provider | String | Required | Inference provider: bedrock or http. |
provider_config | Object | Optional | Provider-specific configuration (region, credentials). |
block_types | Array | Optional | Block types to process. Default is ["image"]. |
prompt | String | Optional | Custom system prompt. If not set, uses the default OCR prompt. |
on_failure_action | String | Optional | skip (log warning, leave block unchanged) or fail (reject document). Default is skip. |
max_image_bytes | Integer | Optional | Maximum decoded image size. Default is 20971520 (20 MB). |
discard_image_data | Boolean | Optional | Remove image_data from blocks after OCR. Saves storage but prevents re-processing. Default is true. |
source_uri_field | String | Optional | For reference mode: document field containing image URI. Fetches image transiently from S3/HTTPS. |
reference_config | Object | Optional | Reference resolver configuration for source_uri_field. |
description | String | Optional | A brief description of the processor. |
tag | String | Optional | An identifier tag for the processor. |
Reference mode parameters
When source_uri_field is set, the processor fetches a single image from S3/HTTPS and writes OCR results to configurable document fields:
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
target_text_field | String | Optional | Document field for extracted text. Default is ocr_text. |
target_confidence_field | String | Optional | Document field for confidence score. Default is ocr_confidence. |
target_boxes_field | String | Optional | Document field for bounding box array. Default is ocr_bounding_boxes. |
Supported models
Any Claude 3 model accessible via Bedrock:
| Model | Model ID | Speed | Quality | Best for |
|---|---|---|---|---|
| Claude 3 Haiku | anthropic.claude-3-haiku-20240307-v1:0 | Fast | Good | High-throughput OCR at ingest time |
| Claude 3 Sonnet | anthropic.claude-3-sonnet-20240229-v1:0 | Medium | Better | Complex layouts, technical diagrams |
| Claude 3 Opus | anthropic.claude-3-opus-20240229-v1:0 | Slow | Best | Critical accuracy requirements |
Use Claude 3 Haiku for ingest-time OCR. It offers the best throughput-to-cost ratio and handles most document images well. Reserve Sonnet or Opus for documents with complex tables, handwriting, or diagrams that require higher accuracy.
Default prompts
The processor uses task-specific system prompts when no custom prompt is provided:
OCR (default):
Extract all visible text from this image. Return only the extracted text, preserving the original layout and reading order as closely as possible. Do not add commentary or formatting instructions.
You can override this with a custom prompt tailored to your content:
{
"ocr": {
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"prompt": "Extract all text from this engineering diagram. Include part numbers, dimensions, and annotations. Format as structured text with clear section headers.",
"provider_config": { "region": "us-east-2" }
}
}
Output structure
Block mode (default)
After processing, image blocks gain a text field:
{
"type": "image",
"text": "Revenue Growth FY2024\nQ1: $4.2M (+12%)\nQ2: $4.8M (+14%)...",
"ocr_confidence": 0.94,
"ocr_bounding_boxes": [
{
"text": "Revenue Growth",
"x": 0.1,
"y": 0.05,
"width": 0.4,
"height": 0.08,
"confidence": 0.98
}
],
"image_mime_type": "image/png"
}
When discard_image_data is true (default), the image_data field is removed after OCR to save storage. Set it to false if downstream processors (such as multimodal embedding) need the raw image bytes.
Reference mode
When using source_uri_field, results are written to document-level fields:
{
"image_uri": "s3://bucket/scan.png",
"ocr_text": "Extracted text from the scanned page...",
"ocr_confidence": 0.92,
"ocr_bounding_boxes": [...]
}
Using the processor
Example 1: OCR in a document pipeline
The most common pattern -- extract content, OCR embedded images, then chunk and embed all text:
PUT _ingest/pipeline/doc-with-ocr
{
"processors": [
{
"content_extract": {
"input_mode": "reference",
"source_uri_field": "source_uri",
"preserve_image_data": true,
"reference_config": { "region": "us-east-2" }
}
},
{
"ocr": {
"field": "extracted.blocks",
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"block_types": ["image"],
"provider_config": { "region": "us-east-2" }
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": { "region": "us-east-2" }
}
}
]
}
Set preserve_image_data: true on the content_extract processor when you have a downstream ocr processor. Otherwise, image bytes are discarded during extraction and OCR will have nothing to process.
Example 2: OCR a single image from S3
PUT _ingest/pipeline/image-ocr
{
"processors": [
{
"ocr": {
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"provider": "bedrock",
"source_uri_field": "image_uri",
"target_text_field": "extracted_text",
"provider_config": { "region": "us-east-2" }
}
}
]
}
PUT /scanned-docs/_doc/1?pipeline=image-ocr
{
"image_uri": "s3://my-bucket/scans/contract-page-1.png",
"title": "Contract Page 1"
}
Example 3: Custom prompt for engineering drawings
PUT _ingest/pipeline/engineering-ocr
{
"processors": [
{
"ocr": {
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"provider": "bedrock",
"prompt": "Extract all text, part numbers, dimensions, tolerances, and revision notes from this engineering drawing. Preserve the hierarchical structure of the title block.",
"discard_image_data": false,
"provider_config": { "region": "us-east-2" }
}
}
]
}