Skip to main content
Version: 0.11.0

End-to-end content processing

Lucenia provides a complete content processing pipeline built directly into its ingest framework. Raw documents — PDFs, Word files, HTML pages, images, and even satellite imagery — are transformed into vector-searchable content through a series of composable processors that you configure as a single ingest pipeline.

Pipeline overview

A typical content processing pipeline chains the following processors:

Document ──► content_extract ──► chunk ──► embed ──► Index

├──► ocr (for images/charts)
└──► image_tiling (for GeoTIFF/satellite imagery)

Each processor reads the output of the previous one. You define the entire pipeline in a single API call, and Lucenia handles the orchestration.

Content extraction

The content_extract processor extracts structured content blocks from documents in a wide range of formats.

Supported formats:

CategoryFormats
DocumentsPDF, Word (DOCX), Excel (XLSX), PowerPoint (PPTX)
WebHTML
TextPlain text, Markdown, JSON
ImagesJPEG, PNG, TIFF, GeoTIFF (with geospatial metadata)

Input modes:

ModeDescriptionUse case
referenceStream from S3 or HTTPS URILarge documents in cloud storage (recommended)
inlineText embedded directly in the requestSmall text payloads via API
streamMultipart binary uploadDirect upload from client applications
attachmentBase64-encoded contentLegacy compatibility

The reference mode is recommended for most use cases — it lets the cluster stream content directly from S3 or HTTPS without uploading data through the API, supporting both private and public S3 buckets with multi-region support.

Chunking

The chunk processor splits extracted content into smaller, overlapping segments optimized for vector search and RAG. Four algorithms are available:

AlgorithmStrategyBest for
recursive (default)Splits by paragraphs, then sentences, then wordsGeneral-purpose documents
fixedSplits at fixed character intervals with overlapUniform chunk sizes
semanticSplits where embedding similarity drops below thresholdTopic-coherent chunks
topic_shiftSplits where vocabulary overlap between windows dropsDetecting topic boundaries

The semantic and topic_shift algorithms use embedding similarity and vocabulary analysis respectively to find natural breakpoints in the text, producing chunks that are more meaningful for retrieval.

Embedding

The embed processor generates vector embeddings from text, image, or multimodal content. Three embedding providers are supported:

ProviderModelsData privacy
AWS BedrockTitan Text v2, Titan Multimodal G1, Cohere Embed v3Data stays in your VPC
OpenAItext-embedding-3-small, text-embedding-3-large, ada-002Sent to OpenAI API
HTTP (self-hosted)Any model with a REST endpointFully private

For maximum privacy, use Bedrock (data never leaves your VPC) or a self-hosted model via the HTTP provider.

OCR

The ocr processor extracts text from images, charts, diagrams, and tables using LLM-powered inference. Unlike traditional OCR engines, it understands the semantic structure of visual content — extracting table data as structured text, reading diagram labels, and interpreting chart values.

Image tiling

The image_tiling processor splits large images into a grid of tiles for multimodal search. For geospatial imagery (Cloud Optimized GeoTIFFs), it extracts spatial metadata and indexes each tile with its geographic bounding box, enabling spatially-aware image search.

Rerank preparation

The rerank_prepare processor annotates chunks with metadata and position scores needed by the downstream multimodal rerank search processor. This enables the search pipeline to rerank results using the full context of the original document.

Example: Full pipeline from PDF to searchable vectors

PUT _ingest/pipeline/ai-retrieval
{
"description": "Extract, chunk, and embed PDF documents",
"processors": [
{
"content_extract": {
"field": "content",
"target_field": "extracted",
"input_mode": "reference",
"source_uri_field": "source_uri"
}
},
{
"chunk": {
"field": "extracted.blocks",
"target_field": "chunks",
"algorithm": "recursive",
"chunk_size": 2000,
"chunk_overlap": 200
}
},
{
"embed": {
"field": "chunks",
"model_id": "amazon.titan-embed-text-v2:0",
"provider": "bedrock",
"dimensions": 1024,
"provider_config": {
"region": "us-east-2"
}
}
}
]
}

Then ingest a document:

PUT /knowledge-base/_doc/1?pipeline=ai-retrieval
{
"title": "Quarterly Report Q4 2024",
"source_uri": "s3://my-docs-bucket/reports/q4-2024.pdf"
}

Lucenia streams the PDF from S3, extracts the text and images, chunks the content into overlapping segments, generates vector embeddings for each chunk, and indexes everything — all in a single request.