Skip to main content
Version: 0.11.0

Extensibility and custom integrations

Lucenia's AI retrieval pipeline is built on a service provider interface (SPI) architecture that lets you extend every stage of the pipeline with custom implementations. Whether you need to extract content from a proprietary format, connect to an in-house embedding model, or process specialized imagery, you can plug into the pipeline without modifying core code.

Extension points

The following SPI interfaces are available for customization:

InterfacePurposeExample
ContentExtractorExtract content from custom document formatsNITF news format, proprietary internal formats
RasterSourceRead and decode custom raster/image formatsSpecialized satellite imagery, medical imaging
EmbeddingProviderGenerate embeddings from custom model endpointsPrivate fine-tuned models, on-premise GPU clusters
InferenceProviderRun inference against custom LLM endpointsPrivate LLMs for OCR, reranking, or classification

How it works

Each extension point follows the same pattern:

  1. Implement the interface — Create a Java class that implements the SPI interface
  2. Register the provider — Package your implementation as a Lucenia plugin with META-INF/services registration
  3. Configure in pipelines — Reference your custom provider by name in ingest or search pipeline configurations

For detailed implementation guides and code examples, see Extending ingest-content.

Custom content extractors

Implement the ContentExtractor interface to add support for any document format. Your extractor receives the raw bytes and MIME type, and returns structured ContentBlock objects that integrate seamlessly with downstream chunking, embedding, and OCR processors.

Example use cases:

  • NITF (news industry text format) extraction
  • CAD file metadata extraction
  • Proprietary internal document formats
  • Custom XML/JSON schema parsing

Custom embedding providers

Implement the EmbeddingProvider interface to connect to any embedding model. This is ideal for organizations running fine-tuned models or models with specialized tokenization.

Your provider receives text or image content and returns dense vectors. It plugs directly into both the embed ingest processor and the query embedding search processor, ensuring consistent embeddings at both index and search time.

Custom inference providers

Implement the InferenceProvider interface to connect the OCR processor and multimodal rerank processor to any LLM endpoint. This enables:

  • OCR powered by your own vision models
  • Reranking with privately deployed LLMs
  • Custom classification or enrichment during ingest

Custom raster sources

Implement the RasterSource interface to add support for custom raster image formats in the image tiling processor. Your implementation handles reading, decoding, and providing spatial metadata for the raster data.