Version: 0.11.1

End-to-end content processing

Lucenia provides a complete content processing pipeline built directly into its ingest framework. Raw documents — PDFs, Word files, HTML pages, images, and even satellite imagery — are transformed into vector-searchable content through a series of composable processors that you configure as a single ingest pipeline.

Pipeline overview

A typical content processing pipeline chains the following processors:

Document ──► content_extract ──► chunk ──► embed ──► Index
                  │
                  ├──► ocr (for images/charts)
                  └──► image_tiling (for GeoTIFF/satellite imagery)

Each processor reads the output of the previous one. You define the entire pipeline in a single API call, and Lucenia handles the orchestration.

Content extraction

The content_extract processor extracts structured content blocks from documents in a wide range of formats.

Supported formats:

Category	Formats
Documents	PDF, Word (DOCX), Excel (XLSX), PowerPoint (PPTX)
Web	HTML
Text	Plain text, Markdown, JSON
Images	JPEG, PNG, TIFF, GeoTIFF (with geospatial metadata)

Input modes:

Mode	Description	Use case
`reference`	Stream from S3 or HTTPS URI	Large documents in cloud storage (recommended)
`inline`	Text embedded directly in the request	Small text payloads via API
`stream`	Multipart binary upload	Direct upload from client applications
`attachment`	Base64-encoded content	Legacy compatibility

The reference mode is recommended for most use cases — it lets the cluster stream content directly from S3 or HTTPS without uploading data through the API, supporting both private and public S3 buckets with multi-region support.

Chunking

The chunk processor splits extracted content into smaller, overlapping segments optimized for vector search and RAG. Four algorithms are available:

Algorithm	Strategy	Best for
recursive (default)	Splits by paragraphs, then sentences, then words	General-purpose documents
fixed	Splits at fixed character intervals with overlap	Uniform chunk sizes
semantic	Splits where embedding similarity drops below threshold	Topic-coherent chunks
topic_shift	Splits where vocabulary overlap between windows drops	Detecting topic boundaries

The semantic and topic_shift algorithms use embedding similarity and vocabulary analysis respectively to find natural breakpoints in the text, producing chunks that are more meaningful for retrieval.

Embedding

The embed processor generates vector embeddings from text, image, or multimodal content. Three embedding providers are supported:

Provider	Models	Data privacy
AWS Bedrock	Titan Text v2, Titan Multimodal G1, Cohere Embed v3	Data stays in your VPC
OpenAI	text-embedding-3-small, text-embedding-3-large, ada-002	Sent to OpenAI API
HTTP (self-hosted)	Any model with a REST endpoint	Fully private

For maximum privacy, use Bedrock (data never leaves your VPC) or a self-hosted model via the HTTP provider.

OCR

The ocr processor extracts text from images, charts, diagrams, and tables using LLM-powered inference. Unlike traditional OCR engines, it understands the semantic structure of visual content — extracting table data as structured text, reading diagram labels, and interpreting chart values.

Image tiling

The image_tiling processor splits large images into a grid of tiles for multimodal search. For geospatial imagery (Cloud Optimized GeoTIFFs), it extracts spatial metadata and indexes each tile with its geographic bounding box, enabling spatially-aware image search.

Rerank preparation

The rerank_prepare processor annotates chunks with metadata and position scores needed by the downstream multimodal rerank search processor. This enables the search pipeline to rerank results using the full context of the original document.

Example: Full pipeline from PDF to searchable vectors

PUT _ingest/pipeline/ai-retrieval
{
  "description": "Extract, chunk, and embed PDF documents",
  "processors": [
    {
      "content_extract": {
        "field": "content",
        "target_field": "extracted",
        "input_mode": "reference",
        "source_uri_field": "source_uri"
      }
    },
    {
      "chunk": {
        "field": "extracted.blocks",
        "target_field": "chunks",
        "algorithm": "recursive",
        "chunk_size": 2000,
        "chunk_overlap": 200
      }
    },
    {
      "embed": {
        "field": "chunks",
        "model_id": "amazon.titan-embed-text-v2:0",
        "provider": "bedrock",
        "dimensions": 1024,
        "provider_config": {
          "region": "us-east-2"
        }
      }
    }
  ]
}

Then ingest a document:

PUT /knowledge-base/_doc/1?pipeline=ai-retrieval
{
  "title": "Quarterly Report Q4 2024",
  "source_uri": "s3://my-docs-bucket/reports/q4-2024.pdf"
}

Lucenia streams the PDF from S3, extracts the text and images, chunks the content into overlapping segments, generates vector embeddings for each chunk, and indexes everything — all in a single request.

Pipeline overview​

Content extraction​

Chunking​

Embedding​

OCR​

Image tiling​

Rerank preparation​

Example: Full pipeline from PDF to searchable vectors​