End-to-end content processing
Lucenia provides a complete content processing pipeline built directly into its ingest framework. Raw documents — PDFs, Word files, HTML pages, images, and even satellite imagery — are transformed into vector-searchable content through a series of composable processors that you configure as a single ingest pipeline.
Pipeline overview
A typical content processing pipeline chains the following processors:
Document ──► content_extract ──► chunk ──► embed ──► Index
│
├──► ocr (for images/charts)
└──► image_tiling (for GeoTIFF/satellite imagery)
Each processor reads the output of the previous one. You define the entire pipeline in a single API call, and Lucenia handles the orchestration.
Content extraction
The content_extract processor extracts structured content blocks from documents in a wide range of formats.
Supported formats:
| Category | Formats |
|---|---|
| Documents | PDF, Word (DOCX), Excel (XLSX), PowerPoint (PPTX) |
| Web | HTML |
| Text | Plain text, Markdown, JSON |
| Images | JPEG, PNG, TIFF, GeoTIFF (with geospatial metadata) |
Input modes:
| Mode | Description | Use case |
|---|---|---|
reference | Stream from S3 or HTTPS URI | Large documents in cloud storage (recommended) |
inline | Text embedded directly in the request | Small text payloads via API |
stream | Multipart binary upload | Direct upload from client applications |
attachment | Base64-encoded content | Legacy compatibility |
The reference mode is recommended for most use cases — it lets the cluster stream content directly from S3 or HTTPS without uploading data through the API, supporting both private and public S3 buckets with multi-region support.
Chunking
The chunk processor splits extracted content into smaller, overlapping segments optimized for vector search and RAG. Four algorithms are available:
| Algorithm | Strategy | Best for |
|---|---|---|
| recursive (default) | Splits by paragraphs, then sentences, then words | General-purpose documents |
| fixed | Splits at fixed character intervals with overlap | Uniform chunk sizes |
| semantic | Splits where embedding similarity drops below threshold | Topic-coherent chunks |
| topic_shift | Splits where vocabulary overlap between windows drops | Detecting topic boundaries |
The semantic and topic_shift algorithms use embedding similarity and vocabulary analysis respectively to find natural breakpoints in the text, producing chunks that are more meaningful for retrieval.
Embedding
The embed processor generates vector embeddings from text, image, or multimodal content. Three embedding providers are supported:
| Provider | Models | Data privacy |
|---|---|---|
| AWS Bedrock | Titan Text v2, Titan Multimodal G1, Cohere Embed v3 | Data stays in your VPC |
| OpenAI | text-embedding-3-small, text-embedding-3-large, ada-002 | Sent to OpenAI API |
| HTTP (self-hosted) | Any model with a REST endpoint | Fully private |
For maximum privacy, use Bedrock (data never leaves your VPC) or a self-hosted model via the HTTP provider.
OCR
The ocr processor extracts text from images, charts, diagrams, and tables using LLM-powered inference. Unlike traditional OCR engines, it understands the semantic structure of visual content — extracting table data as structured text, reading diagram labels, and interpreting chart values.
Image tiling
The image_tiling processor splits large images into a grid of tiles for multimodal search. For geospatial imagery (Cloud Optimized GeoTIFFs), it extracts spatial metadata and indexes each tile with its geographic bounding box, enabling spatially-aware image search.
Rerank preparation
The rerank_prepare processor annotates chunks with metadata and position scores needed by the downstream multimodal rerank search processor. This enables the search pipeline to rerank results using the full context of the original document.
Example: Full pipeline from PDF to searchable vectors
PUT _ingest/pipeline/ai-retrieval
{
"description": "Extract, chunk, and embed PDF documents",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri"
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}
]
}
Then ingest a document:
PUT /knowledge-base/_doc/1?pipeline=ai-retrieval
{
"title": "Quarterly Report Q4 2024",
"source_uri": "s3://my-docs-bucket/reports/q4-2024.pdf"
}
Lucenia streams the PDF from S3, extracts the text and images, chunks the content into overlapping segments, generates vector embeddings for each chunk, and indexes everything — all in a single request.