Embed processor
The embed processor generates vector embeddings from text, image, or multimodal content. It reads chunk arrays produced by the chunk or image_tiling processors and writes a dense vector into each chunk, making them searchable via kNN.
Three embedding providers are supported: AWS Bedrock, OpenAI, and a generic HTTP endpoint for self-hosted models.
Syntax
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}
Architecture
┌──────────────────────┐
│ embed processor │
│ │
chunks[] ──────────────►│ For each chunk: │──────────► chunks[] + embedding
[text, image_data] │ 1. Detect content │ [text, embedding: [...]]
│ type (text/img/ │
│ multimodal) │
│ 2. Build input │
│ 3. Call provider │
│ 4. Store vector │
└──────────┬───────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Bedrock │ │ OpenAI │ │ HTTP │
│ Provider │ │ Provider │ │ Provider │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
Titan Text text-embedding Your own
Titan Multi -3-small/large model
Cohere Embed ada-002 endpoint
Providers
AWS Bedrock
Calls Bedrock InvokeModel in your AWS account. Data never leaves your VPC.
| Model | Model ID | Dimensions | Content types |
|---|---|---|---|
| Titan Text Embeddings V2 | amazon.titan-embed-text-v2:0 | 256, 512, 1024 | Text |
| Titan Multimodal Embeddings G1 | amazon.titan-embed-image-v1 | 1024 (fixed) | Text, image, multimodal |
| Cohere Embed English v3 | cohere.embed-english-v3 | 1024 | Text |
| Cohere Embed Multilingual v3 | cohere.embed-multilingual-v3 | 1024 | Text |
Bedrock processes one input per API call (no batch API). For large document sets, consider using Titan Text v2 which is optimized for throughput.
Provider config:
| Key | Required/Optional | Description |
|---|---|---|
region | Required | AWS region (e.g., us-east-1, us-west-2). |
access_key | Optional | AWS access key from keystore. Falls back to the default credential chain (instance profile, environment variables). |
secret_key | Optional | AWS secret key from keystore. |
session_token | Optional | STS session token for temporary credentials. |
OpenAI
Calls the OpenAI Embeddings API. Supports batch processing (up to 100 inputs per call).
| Model | Model ID | Dimensions |
|---|---|---|
| text-embedding-3-small | text-embedding-3-small | 1536 |
| text-embedding-3-large | text-embedding-3-large | 3072 |
| text-embedding-ada-002 | text-embedding-ada-002 | 1536 |
OpenAI provider supports text-only embeddings. For multimodal content, use Bedrock or HTTP.
Provider config:
| Key | Required/Optional | Description |
|---|---|---|
api_key_setting | Required | Keystore setting name (e.g., ingest.content.openai.api_key). Raw API keys are not allowed in pipeline configuration. |
endpoint | Optional | Custom endpoint for Azure OpenAI or proxies. Default is https://api.openai.com/v1/embeddings. |
HTTP (self-hosted)
Calls any HTTP embedding service. Use this for self-hosted models where data sovereignty requires all inference to stay within your network.
Provider config:
| Key | Required/Optional | Description |
|---|---|---|
endpoint | Required | POST endpoint URL (e.g., http://embedding-svc.internal:8080/embed). |
response_embedding_path | Optional | JSONPath to extract embeddings from response. Default is $.embeddings. |
auth_header | Optional | Authorization header value (e.g., Bearer <token>). |
max_batch_size | Optional | Maximum inputs per request. Default is 50. |
Request format:
{
"inputs": [
{"text": "hello world"},
{"image": "<base64>", "image_mime_type": "image/jpeg"},
{"text": "caption", "image": "<base64>", "image_mime_type": "image/png"}
],
"model": "your-model-id"
}
Expected response:
{
"embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]]
}
Configuration parameters
| Parameter | Data type | Required/Optional | Description |
|---|---|---|---|
field | String | Optional | Source field containing the chunk array. Default is chunks. |
model_id | String | Required | Model identifier (provider-specific). |
provider | String | Required | Provider name: bedrock, openai, or http. |
dimensions | Integer | Optional | Embedding vector dimensions. Default is 1536. Must be exactly 1024 for Titan Multimodal. |
content_type | String | Optional | Provider validation hint: text, image, or multimodal. Actual content type is auto-detected per chunk. Default is text. |
batch_size | Integer | Optional | Inputs per provider call. Bedrock ignores this (always 1). Default is 50. |
on_failure_action | String | Optional | skip (log warning, continue without embedding -- enables BM25 fallback) or fail (reject document). Default is skip. |
block_types | Array | Optional | Chunk types to embed. Non-matching chunks are skipped. Default embeds all types. |
source_uri_field | String | Optional | For direct image embedding: document field containing image URI. Image is fetched transiently and not stored. |
max_image_embed_bytes | Integer | Optional | Maximum image size for transient fetch. Default is 26214400 (25 MB). |
reference_config | Object | Optional | Reference resolver configuration for source_uri_field. |
provider_config | Object | Optional | Provider-specific configuration. See provider sections above. |
description | String | Optional | A brief description of the processor. |
tag | String | Optional | An identifier tag for the processor. |
Content type detection
The processor auto-detects the content type of each chunk:
Chunk has text + image_data? → multimodal embedding
Chunk has text only? → text embedding
Chunk has image_data only? → image embedding
Chunk has neither? → skipped
Inline image data (image_data field in chunks) is limited to 5 MB after base64 decoding. For larger images, use source_uri_field for transient fetch.
Output structure
The processor adds an embedding field to each eligible chunk:
{
"chunks": [
{
"text": "First chunk of text...",
"chunk_index": 0,
"embedding": [0.0123, -0.0456, 0.0789, ...]
},
{
"text": "Second chunk...",
"chunk_index": 1,
"embedding": [0.0234, -0.0567, 0.0891, ...]
}
]
}
When using source_uri_field, the embedding is stored at the document root:
{
"source_uri": "s3://bucket/image.jpg",
"embedding": [0.0123, -0.0456, 0.0789, ...]
}
Security
Raw API keys are not allowed in pipeline configuration. Use the Lucenia keystore to store credentials securely.
# Store an OpenAI API key
bin/lucenia-keystore add ingest.content.openai.api_key
# Store AWS credentials (alternative to instance profile)
bin/lucenia-keystore add ingest.content.bedrock.access_key
bin/lucenia-keystore add ingest.content.bedrock.secret_key
Then reference the keystore setting in your pipeline:
{
"embed": {
"provider_config": {
"api_key_setting": "ingest.content.openai.api_key"
}
}
}
Using the processor
Example 1: Text embeddings with Bedrock Titan
PUT _ingest/pipeline/text-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}
]
}
Example 2: Multimodal embeddings with Titan Multimodal
PUT _ingest/pipeline/multimodal-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-image-v1",
"provider": "bedrock",
"dimensions": 1024,
"content_type": "multimodal",
"provider_config": {
"region": "us-east-1"
}
}
}
]
}
Example 3: Self-hosted model via HTTP
PUT _ingest/pipeline/self-hosted-embed
{
"processors": [
{
"embed": {
"field": "chunks",
"model_id": "sentence-transformers/all-MiniLM-L6-v2",
"provider": "http",
"dimensions": 384,
"provider_config": {
"endpoint": "http://embedding-service.internal:8080/embed",
"max_batch_size": 32
}
}
}
]
}
Example 4: Direct image embedding from S3
Embed an image without storing it in the document -- the image is fetched transiently from S3, embedded, and discarded:
PUT _ingest/pipeline/image-embed
{
"processors": [
{
"embed": {
"source_uri_field": "image_uri",
"model_id": "amazon.titan-embed-image-v1",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-1"
}
}
}
]
}
PUT /images/_doc/1?pipeline=image-embed
{
"image_uri": "s3://my-bucket/photos/landscape.jpg",
"title": "Mountain landscape"
}