Version: 0.11.0

Embed processor

Introduced 0.11.0

The embed processor generates vector embeddings from text, image, or multimodal content. It reads chunk arrays produced by the chunk or image_tiling processors and writes a dense vector into each chunk, making them searchable via kNN.

Three embedding providers are supported: AWS Bedrock, OpenAI, and a generic HTTP endpoint for self-hosted models.

Syntax

{
  "embed": {
    "field": "chunks",
    "model_id": "amazon.titan-embed-text-v2:0",
    "provider": "bedrock",
    "dimensions": 1024,
    "provider_config": {
      "region": "us-east-2"
    }
  }
}

Architecture

                        ┌──────────────────────┐
                        │    embed processor    │
                        │                      │
chunks[] ──────────────►│  For each chunk:     │──────────► chunks[] + embedding
 [text, image_data]     │  1. Detect content   │            [text, embedding: [...]]
                        │     type (text/img/  │
                        │     multimodal)      │
                        │  2. Build input      │
                        │  3. Call provider    │
                        │  4. Store vector     │
                        └──────────┬───────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
              ┌──────────┐  ┌──────────┐  ┌──────────┐
              │ Bedrock  │  │  OpenAI  │  │   HTTP   │
              │ Provider │  │ Provider │  │ Provider │
              └────┬─────┘  └────┬─────┘  └────┬─────┘
                   │             │             │
              Titan Text    text-embedding   Your own
              Titan Multi   -3-small/large   model
              Cohere Embed  ada-002          endpoint

Providers

AWS Bedrock

Calls Bedrock InvokeModel in your AWS account. Data never leaves your VPC.

Model	Model ID	Dimensions	Content types
Titan Text Embeddings V2	`amazon.titan-embed-text-v2:0`	256, 512, 1024	Text
Titan Multimodal Embeddings G1	`amazon.titan-embed-image-v1`	1024 (fixed)	Text, image, multimodal
Cohere Embed English v3	`cohere.embed-english-v3`	1024	Text
Cohere Embed Multilingual v3	`cohere.embed-multilingual-v3`	1024	Text

note

Bedrock processes one input per API call (no batch API). For large document sets, consider using Titan Text v2 which is optimized for throughput.

Provider config:

Key	Required/Optional	Description
`region`	Required	AWS region (e.g., `us-east-1`, `us-west-2`).
`access_key`	Optional	AWS access key from keystore. Falls back to the default credential chain (instance profile, environment variables).
`secret_key`	Optional	AWS secret key from keystore.
`session_token`	Optional	STS session token for temporary credentials.

OpenAI

Calls the OpenAI Embeddings API. Supports batch processing (up to 100 inputs per call).

Model	Model ID	Dimensions
text-embedding-3-small	`text-embedding-3-small`	1536
text-embedding-3-large	`text-embedding-3-large`	3072
text-embedding-ada-002	`text-embedding-ada-002`	1536

note

OpenAI provider supports text-only embeddings. For multimodal content, use Bedrock or HTTP.

Provider config:

Key	Required/Optional	Description
`api_key_setting`	Required	Keystore setting name (e.g., `ingest.content.openai.api_key`). Raw API keys are not allowed in pipeline configuration.
`endpoint`	Optional	Custom endpoint for Azure OpenAI or proxies. Default is `https://api.openai.com/v1/embeddings`.

HTTP (self-hosted)

Calls any HTTP embedding service. Use this for self-hosted models where data sovereignty requires all inference to stay within your network.

Provider config:

Key	Required/Optional	Description
`endpoint`	Required	POST endpoint URL (e.g., `http://embedding-svc.internal:8080/embed`).
`response_embedding_path`	Optional	JSONPath to extract embeddings from response. Default is `$.embeddings`.
`auth_header`	Optional	Authorization header value (e.g., `Bearer <token>`).
`max_batch_size`	Optional	Maximum inputs per request. Default is `50`.

Request format:

{
  "inputs": [
    {"text": "hello world"},
    {"image": "<base64>", "image_mime_type": "image/jpeg"},
    {"text": "caption", "image": "<base64>", "image_mime_type": "image/png"}
  ],
  "model": "your-model-id"
}

Expected response:

{
  "embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]]
}

Configuration parameters

Parameter	Data type	Required/Optional	Description
`field`	String	Optional	Source field containing the chunk array. Default is `chunks`.
`model_id`	String	Required	Model identifier (provider-specific).
`provider`	String	Required	Provider name: `bedrock`, `openai`, or `http`.
`dimensions`	Integer	Optional	Embedding vector dimensions. Default is `1536`. Must be exactly `1024` for Titan Multimodal.
`content_type`	String	Optional	Provider validation hint: `text`, `image`, or `multimodal`. Actual content type is auto-detected per chunk. Default is `text`.
`batch_size`	Integer	Optional	Inputs per provider call. Bedrock ignores this (always 1). Default is `50`.
`on_failure_action`	String	Optional	`skip` (log warning, continue without embedding -- enables BM25 fallback) or `fail` (reject document). Default is `skip`.
`block_types`	Array	Optional	Chunk types to embed. Non-matching chunks are skipped. Default embeds all types.
`source_uri_field`	String	Optional	For direct image embedding: document field containing image URI. Image is fetched transiently and not stored.
`max_image_embed_bytes`	Integer	Optional	Maximum image size for transient fetch. Default is `26214400` (25 MB).
`reference_config`	Object	Optional	Reference resolver configuration for `source_uri_field`.
`provider_config`	Object	Optional	Provider-specific configuration. See provider sections above.
`description`	String	Optional	A brief description of the processor.
`tag`	String	Optional	An identifier tag for the processor.

Content type detection

The processor auto-detects the content type of each chunk:

Chunk has text + image_data?  →  multimodal embedding
Chunk has text only?          →  text embedding
Chunk has image_data only?    →  image embedding
Chunk has neither?            →  skipped

Inline image data (image_data field in chunks) is limited to 5 MB after base64 decoding. For larger images, use source_uri_field for transient fetch.

Output structure

The processor adds an embedding field to each eligible chunk:

{
  "chunks": [
    {
      "text": "First chunk of text...",
      "chunk_index": 0,
      "embedding": [0.0123, -0.0456, 0.0789, ...]
    },
    {
      "text": "Second chunk...",
      "chunk_index": 1,
      "embedding": [0.0234, -0.0567, 0.0891, ...]
    }
  ]
}

When using source_uri_field, the embedding is stored at the document root:

{
  "source_uri": "s3://bucket/image.jpg",
  "embedding": [0.0123, -0.0456, 0.0789, ...]
}

Security

warning

Raw API keys are not allowed in pipeline configuration. Use the Lucenia keystore to store credentials securely.

# Store an OpenAI API key
bin/lucenia-keystore add ingest.content.openai.api_key

# Store AWS credentials (alternative to instance profile)
bin/lucenia-keystore add ingest.content.bedrock.access_key
bin/lucenia-keystore add ingest.content.bedrock.secret_key

Then reference the keystore setting in your pipeline:

{
  "embed": {
    "provider_config": {
      "api_key_setting": "ingest.content.openai.api_key"
    }
  }
}

Using the processor

Example 1: Text embeddings with Bedrock Titan

PUT _ingest/pipeline/text-embed
{
  "processors": [
    {
      "embed": {
        "field": "chunks",
        "model_id": "amazon.titan-embed-text-v2:0",
        "provider": "bedrock",
        "dimensions": 1024,
        "provider_config": {
          "region": "us-east-2"
        }
      }
    }
  ]
}

Example 2: Multimodal embeddings with Titan Multimodal

PUT _ingest/pipeline/multimodal-embed
{
  "processors": [
    {
      "embed": {
        "field": "chunks",
        "model_id": "amazon.titan-embed-image-v1",
        "provider": "bedrock",
        "dimensions": 1024,
        "content_type": "multimodal",
        "provider_config": {
          "region": "us-east-1"
        }
      }
    }
  ]
}

Example 3: Self-hosted model via HTTP

PUT _ingest/pipeline/self-hosted-embed
{
  "processors": [
    {
      "embed": {
        "field": "chunks",
        "model_id": "sentence-transformers/all-MiniLM-L6-v2",
        "provider": "http",
        "dimensions": 384,
        "provider_config": {
          "endpoint": "http://embedding-service.internal:8080/embed",
          "max_batch_size": 32
        }
      }
    }
  ]
}

Example 4: Direct image embedding from S3

Embed an image without storing it in the document -- the image is fetched transiently from S3, embedded, and discarded:

PUT _ingest/pipeline/image-embed
{
  "processors": [
    {
      "embed": {
        "source_uri_field": "image_uri",
        "model_id": "amazon.titan-embed-image-v1",
        "provider": "bedrock",
        "dimensions": 1024,
        "provider_config": {
          "region": "us-east-1"
        }
      }
    }
  ]
}

PUT /images/_doc/1?pipeline=image-embed
{
  "image_uri": "s3://my-bucket/photos/landscape.jpg",
  "title": "Mountain landscape"
}

Syntax​

Architecture​

Providers​

AWS Bedrock​

OpenAI​

HTTP (self-hosted)​

Configuration parameters​

Content type detection​

Output structure​

Security​

Using the processor​

Example 1: Text embeddings with Bedrock Titan​

Example 2: Multimodal embeddings with Titan Multimodal​

Example 3: Self-hosted model via HTTP​

Example 4: Direct image embedding from S3​

Syntax

Architecture

Providers

AWS Bedrock

OpenAI

HTTP (self-hosted)

Configuration parameters

Content type detection

Output structure

Security

Using the processor

Example 1: Text embeddings with Bedrock Titan

Example 2: Multimodal embeddings with Titan Multimodal

Example 3: Self-hosted model via HTTP

Example 4: Direct image embedding from S3