Extensibility and custom integrations
Lucenia's AI retrieval pipeline is built on a service provider interface (SPI) architecture that lets you extend every stage of the pipeline with custom implementations. Whether you need to extract content from a proprietary format, connect to an in-house embedding model, or process specialized imagery, you can plug into the pipeline without modifying core code.
Extension points
The following SPI interfaces are available for customization:
| Interface | Purpose | Example |
|---|---|---|
| ContentExtractor | Extract content from custom document formats | NITF news format, proprietary internal formats |
| RasterSource | Read and decode custom raster/image formats | Specialized satellite imagery, medical imaging |
| EmbeddingProvider | Generate embeddings from custom model endpoints | Private fine-tuned models, on-premise GPU clusters |
| InferenceProvider | Run inference against custom LLM endpoints | Private LLMs for OCR, reranking, or classification |
How it works
Each extension point follows the same pattern:
- Implement the interface — Create a Java class that implements the SPI interface
- Register the provider — Package your implementation as a Lucenia plugin with
META-INF/servicesregistration - Configure in pipelines — Reference your custom provider by name in ingest or search pipeline configurations
For detailed implementation guides and code examples, see Extending ingest-content.
Custom content extractors
Implement the ContentExtractor interface to add support for any document format. Your extractor receives the raw bytes and MIME type, and returns structured ContentBlock objects that integrate seamlessly with downstream chunking, embedding, and OCR processors.
Example use cases:
- NITF (news industry text format) extraction
- CAD file metadata extraction
- Proprietary internal document formats
- Custom XML/JSON schema parsing
Custom embedding providers
Implement the EmbeddingProvider interface to connect to any embedding model. This is ideal for organizations running fine-tuned models or models with specialized tokenization.
Your provider receives text or image content and returns dense vectors. It plugs directly into both the embed ingest processor and the query embedding search processor, ensuring consistent embeddings at both index and search time.
Custom inference providers
Implement the InferenceProvider interface to connect the OCR processor and multimodal rerank processor to any LLM endpoint. This enables:
- OCR powered by your own vision models
- Reranking with privately deployed LLMs
- Custom classification or enrichment during ingest
Custom raster sources
Implement the RasterSource interface to add support for custom raster image formats in the image tiling processor. Your implementation handles reading, decoding, and providing spatial metadata for the raster data.