Skip to main content
Version: 0.11.0

Embed processor

Introduced 0.11.0

The embed processor generates vector embeddings from text, image, or multimodal content. It reads chunk arrays produced by the chunk or image_tiling processors and writes a dense vector into each chunk, making them searchable via kNN.

Three embedding providers are supported: AWS Bedrock, OpenAI, and a generic HTTP endpoint for self-hosted models.

Syntax

{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}

Architecture

                        ┌──────────────────────┐
│ embed processor │
│ │
chunks[] ──────────────►│ For each chunk: │──────────► chunks[] + embedding
[text, image_data] │ 1. Detect content │ [text, embedding: [...]]
│ type (text/img/ │
│ multimodal) │
│ 2. Build input │
│ 3. Call provider │
│ 4. Store vector │
└──────────┬───────────┘

┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Bedrock │ │ OpenAI │ │ HTTP │
│ Provider │ │ Provider │ │ Provider │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
Titan Text text-embedding Your own
Titan Multi -3-small/large model
Cohere Embed ada-002 endpoint

Providers

AWS Bedrock

Calls Bedrock InvokeModel in your AWS account. Data never leaves your VPC.

ModelModel IDDimensionsContent types
Titan Text Embeddings V2amazon.titan-embed-text-v2:0256, 512, 1024Text
Titan Multimodal Embeddings G1amazon.titan-embed-image-v11024 (fixed)Text, image, multimodal
Cohere Embed English v3cohere.embed-english-v31024Text
Cohere Embed Multilingual v3cohere.embed-multilingual-v31024Text
note

Bedrock processes one input per API call (no batch API). For large document sets, consider using Titan Text v2 which is optimized for throughput.

Provider config:

KeyRequired/OptionalDescription
regionRequiredAWS region (e.g., us-east-1, us-west-2).
access_keyOptionalAWS access key from keystore. Falls back to the default credential chain (instance profile, environment variables).
secret_keyOptionalAWS secret key from keystore.
session_tokenOptionalSTS session token for temporary credentials.

OpenAI

Calls the OpenAI Embeddings API. Supports batch processing (up to 100 inputs per call).

ModelModel IDDimensions
text-embedding-3-smalltext-embedding-3-small1536
text-embedding-3-largetext-embedding-3-large3072
text-embedding-ada-002text-embedding-ada-0021536
note

OpenAI provider supports text-only embeddings. For multimodal content, use Bedrock or HTTP.

Provider config:

KeyRequired/OptionalDescription
api_key_settingRequiredKeystore setting name (e.g., ingest.content.openai.api_key). Raw API keys are not allowed in pipeline configuration.
endpointOptionalCustom endpoint for Azure OpenAI or proxies. Default is https://api.openai.com/v1/embeddings.

HTTP (self-hosted)

Calls any HTTP embedding service. Use this for self-hosted models where data sovereignty requires all inference to stay within your network.

Provider config:

KeyRequired/OptionalDescription
endpointRequiredPOST endpoint URL (e.g., http://embedding-svc.internal:8080/embed).
response_embedding_pathOptionalJSONPath to extract embeddings from response. Default is $.embeddings.
auth_headerOptionalAuthorization header value (e.g., Bearer <token>).
max_batch_sizeOptionalMaximum inputs per request. Default is 50.

Request format:

{
"inputs": [
{"text": "hello world"},
{"image": "<base64>", "image_mime_type": "image/jpeg"},
{"text": "caption", "image": "<base64>", "image_mime_type": "image/png"}
],
"model": "your-model-id"
}

Expected response:

{
"embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]]
}

Configuration parameters

ParameterData typeRequired/OptionalDescription
fieldStringOptionalSource field containing the chunk array. Default is chunks.
model_idStringRequiredModel identifier (provider-specific).
providerStringRequiredProvider name: bedrock, openai, or http.
dimensionsIntegerOptionalEmbedding vector dimensions. Default is 1536. Must be exactly 1024 for Titan Multimodal.
content_typeStringOptionalProvider validation hint: text, image, or multimodal. Actual content type is auto-detected per chunk. Default is text.
batch_sizeIntegerOptionalInputs per provider call. Bedrock ignores this (always 1). Default is 50.
on_failure_actionStringOptionalskip (log warning, continue without embedding -- enables BM25 fallback) or fail (reject document). Default is skip.
block_typesArrayOptionalChunk types to embed. Non-matching chunks are skipped. Default embeds all types.
source_uri_fieldStringOptionalFor direct image embedding: document field containing image URI. Image is fetched transiently and not stored.
max_image_embed_bytesIntegerOptionalMaximum image size for transient fetch. Default is 26214400 (25 MB).
reference_configObjectOptionalReference resolver configuration for source_uri_field.
provider_configObjectOptionalProvider-specific configuration. See provider sections above.
descriptionStringOptionalA brief description of the processor.
tagStringOptionalAn identifier tag for the processor.

Content type detection

The processor auto-detects the content type of each chunk:

Chunk has text + image_data?  →  multimodal embedding
Chunk has text only? → text embedding
Chunk has image_data only? → image embedding
Chunk has neither? → skipped

Inline image data (image_data field in chunks) is limited to 5 MB after base64 decoding. For larger images, use source_uri_field for transient fetch.

Output structure

The processor adds an embedding field to each eligible chunk:

{
"chunks": [
{
"text": "First chunk of text...",
"chunk_index": 0,
"embedding": [0.0123, -0.0456, 0.0789, ...]
},
{
"text": "Second chunk...",
"chunk_index": 1,
"embedding": [0.0234, -0.0567, 0.0891, ...]
}
]
}

When using source_uri_field, the embedding is stored at the document root:

{
"source_uri": "s3://bucket/image.jpg",
"embedding": [0.0123, -0.0456, 0.0789, ...]
}

Security

warning

Raw API keys are not allowed in pipeline configuration. Use the Lucenia keystore to store credentials securely.

# Store an OpenAI API key
bin/lucenia-keystore add ingest.content.openai.api_key

# Store AWS credentials (alternative to instance profile)
bin/lucenia-keystore add ingest.content.bedrock.access_key
bin/lucenia-keystore add ingest.content.bedrock.secret_key

Then reference the keystore setting in your pipeline:

{
"embed": {
"provider_config": {
"api_key_setting": "ingest.content.openai.api_key"
}
}
}

Using the processor

Example 1: Text embeddings with Bedrock Titan

PUT _ingest/pipeline/text-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}
]
}

Example 2: Multimodal embeddings with Titan Multimodal

PUT _ingest/pipeline/multimodal-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-image-v1",
"provider": "bedrock",
"dimensions": 1024,
"content_type": "multimodal",
"provider_config": {
"region": "us-east-1"
}
}
}
]
}

Example 3: Self-hosted model via HTTP

PUT _ingest/pipeline/self-hosted-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "sentence-transformers/all-MiniLM-L6-v2",
"provider": "http",
"dimensions": 384,
"provider_config": {
"endpoint": "http://embedding-service.internal:8080/embed",
"max_batch_size": 32
}
}
}
]
}

Example 4: Direct image embedding from S3

Embed an image without storing it in the document -- the image is fetched transiently from S3, embedded, and discarded:

PUT _ingest/pipeline/image-embed
{
"processors": [
{
"embed": {
"source_uri_field": "image_uri",
"model_id": "amazon.titan-embed-image-v1",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-1"
}
}
}
]
}
PUT /images/_doc/1?pipeline=image-embed
{
"image_uri": "s3://my-bucket/photos/landscape.jpg",
"title": "Mountain landscape"
}