Extending ingest-content
The ingest-content module provides four SPI (Service Provider Interface) extension points that let developers add custom capabilities without modifying the core module. Extensions are discovered at runtime via META-INF/services registration.
Extension points
┌──────────────────────────────────────────────────────────────┐
│ ingest-content module │
│ │
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ content_extract │───►│ chunk │───►│ embed │ │
│ └────────┬────────┘ └──────────────┘ └─────┬──────┘ │
│ │ │ │
│ ┌─────┴──────┐ ┌─────────────────┴────┐ │
│ │ Content │ │ Embedding │ │
│ │ Extractor │ │ Provider │ │
│ │ SPI ◄──────┼──── Your │ SPI ◄────────────────┼──── Your custom
│ │ │ custom │ │ provider
│ └────────────┘ format └──────────────────────┘ │
│ parser │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ image_tiling │ │ ocr / rerank │ │
│ └────────┬────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌─────┴──────┐ ┌────────────┴───────┐ │
│ │ Raster │ │ Inference │ │
│ │ Source │ │ Provider │ │
│ │ SPI ◄──────┼──── Your │ SPI ◄──────────────┼────── Your custom
│ │ │ custom │ │ model endpoint
│ └────────────┘ decoder └────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
| Extension point | Interface | Used by | Purpose |
|---|---|---|---|
| ContentExtractor | io.lucenia.ingest.content.extract.ContentExtractor | content_extract | Parse custom document formats |
| RasterSource | io.lucenia.ingest.content.enrich.raster.RasterSource | image_tiling | Decode custom image/raster formats |
| EmbeddingProvider | io.lucenia.ingest.content.enrich.provider.EmbeddingProvider | embed, chunk (semantic) | Connect to custom embedding services |
| InferenceProvider | io.lucenia.ingest.content.enrich.inference.InferenceProvider | ocr, multimodal_rerank | Connect to custom inference services |
ContentExtractor SPI
Implement this interface to add support for custom document formats (e.g., NITF, DICOM, CAD, audio transcripts).
Interface
public interface ContentExtractor {
/**
* Whether this extractor handles the given MIME type.
*/
boolean supports(String mimeType, Map<String, Object> metadata);
/**
* Extract content from the input stream.
* Do NOT close the stream -- the caller manages its lifecycle.
*/
ExtractedContent extract(InputStream stream, ExtractionContext context)
throws ExtractionException;
/**
* Priority for MIME type conflicts. Higher values win.
* Default is 0.
*/
default int priority() {
return 0;
}
}
Built-in extractors
| Extractor | MIME types | Priority |
|---|---|---|
TikaExtractor | PDF, DOCX, XLSX, PPTX, HTML | 0 |
PlainTextExtractor | text/plain, text/markdown, application/json | 0 |
ImageExtractor | image/* | 0 |
Example: Custom NITF extractor
package com.example.nitf;
import io.lucenia.ingest.content.extract.ContentExtractor;
import io.lucenia.ingest.content.extract.ExtractedContent;
import io.lucenia.ingest.content.extract.ExtractionContext;
public class NitfExtractor implements ContentExtractor {
@Override
public boolean supports(String mimeType, Map<String, Object> metadata) {
return "application/vnd.nitf".equals(mimeType)
|| "image/nitf".equals(mimeType);
}
@Override
public ExtractedContent extract(InputStream stream, ExtractionContext context)
throws ExtractionException {
// Parse NITF headers, TREs, image segments
NitfFile nitf = NitfParser.parse(stream);
ExtractedContent content = new ExtractedContent();
// Add text blocks from TRE metadata
content.addTextBlock(nitf.getFileHeader().toString());
// Add image segments as image blocks
for (ImageSegment seg : nitf.getImageSegments()) {
content.addImageBlock(
seg.getImageData(),
"image/jpeg",
Map.of(
"icat", seg.getImageCategory(),
"irep", seg.getImageRepresentation()
)
);
}
// Add format-specific metadata
content.setMetadata(Map.of(
"classification", nitf.getSecurityClassification(),
"originating_station", nitf.getOriginatingStationId()
));
return content;
}
@Override
public int priority() {
return 10; // Override any default handler for NITF
}
}
Registration
Create META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor:
com.example.nitf.NitfExtractor
RasterSource SPI
Implement this interface to add support for custom raster/image formats with efficient windowed reads (e.g., NITF via GDAL, MrSID, ECW).
Interface
public interface RasterSource {
boolean supports(String mimeType, Map<String, Object> metadata);
/**
* Extract metadata (dimensions, CRS, geo-transform, bands).
*/
RasterMetadata metadata(InputStream stream, Map<String, Object> context)
throws IOException;
/**
* Read a pixel window from the raster.
*/
BufferedImage readWindow(InputStream stream, TileWindow window)
throws IOException;
int priority();
/**
* Open a stateful session for efficient multi-tile reads.
* Return null to use the per-tile stream approach.
*/
default RasterReadSession openSession(
InputStream stream, Map<String, Object> context) throws IOException {
return null;
}
}
RasterReadSession
For formats that benefit from keeping state between tile reads (HTTP connections, file handles, grid geometry caches), implement RasterReadSession:
public interface RasterReadSession extends Closeable {
RasterMetadata metadata();
BufferedImage readWindow(TileWindow window) throws IOException;
}
Built-in implementations
| Implementation | MIME types | Priority | Session support |
|---|---|---|---|
ImageIORasterSource | JPEG, PNG, standard TIFF | 0 | No |
SisGeoTiffRasterSource | GeoTIFF, COG (image/tiff with geo metadata) | 10 | Yes (HTTP Range reads) |
Example: Custom GDAL raster source
package com.example.gdal;
import io.lucenia.ingest.content.enrich.raster.*;
public class GdalRasterSource implements RasterSource {
@Override
public boolean supports(String mimeType, Map<String, Object> metadata) {
return Set.of("image/nitf", "image/x-mrsid", "image/ecw")
.contains(mimeType);
}
@Override
public RasterMetadata metadata(InputStream stream, Map<String, Object> context)
throws IOException {
// Use GDAL JNI bindings to read metadata
Dataset ds = gdal.Open(context.get("uri").toString());
double[] geoTransform = ds.GetGeoTransform();
String crs = ds.GetProjection();
return new RasterMetadata(
ds.getRasterXSize(), // width
ds.getRasterYSize(), // height
ds.getRasterCount(), // bands
geoTransform,
crs
);
}
@Override
public BufferedImage readWindow(InputStream stream, TileWindow window)
throws IOException {
// Read pixel region via GDAL
Dataset ds = gdal.Open(/* ... */);
Band band = ds.GetRasterBand(1);
int[] data = new int[window.width() * window.height()];
band.ReadRaster(window.x(), window.y(),
window.width(), window.height(), data);
// Convert to BufferedImage...
return image;
}
@Override
public int priority() {
return 20; // Override SIS for NITF/MrSID formats
}
@Override
public RasterReadSession openSession(InputStream stream, Map<String, Object> context)
throws IOException {
// Open a persistent GDAL dataset for multi-tile reads
return new GdalReadSession(context.get("uri").toString());
}
}
EmbeddingProvider SPI
Implement this interface to connect the embed and chunk (semantic mode) processors to custom embedding services.
Interface
public interface EmbeddingProvider {
String name();
/**
* Embed a batch of inputs (text, image, or multimodal).
*/
List<float[]> embedInputs(
List<EmbeddingInput> inputs,
String modelId,
int dimensions,
Map<String, String> providerConfig
) throws EmbeddingException;
/**
* Maximum inputs per batch call.
*/
int maxBatchSize();
/**
* Validate configuration at pipeline creation time.
*/
void validate(
String modelId,
int dimensions,
String contentType,
Map<String, String> providerConfig
);
}
EmbeddingInput
// Text-only
EmbeddingInput.text("hello world")
// Image-only
EmbeddingInput.image(imageBytes, "image/jpeg")
// Multimodal (text + image)
EmbeddingInput.multimodal("describe this", imageBytes, "image/png")
Built-in providers
| Provider | Name | Batch support | Content types |
|---|---|---|---|
BedrockProvider | bedrock | No (1 per call) | Text, image, multimodal |
OpenAiProvider | openai | Yes (up to 100) | Text only |
HttpProvider | http | Yes (configurable) | Text, image, multimodal |
Example: Custom ONNX Runtime provider
package com.example.onnx;
import io.lucenia.ingest.content.enrich.provider.*;
public class OnnxEmbeddingProvider implements EmbeddingProvider {
@Override
public String name() {
return "onnx";
}
@Override
public List<float[]> embedInputs(
List<EmbeddingInput> inputs,
String modelId,
int dimensions,
Map<String, String> providerConfig) throws EmbeddingException {
OrtSession session = loadModel(modelId, providerConfig);
List<float[]> results = new ArrayList<>();
for (EmbeddingInput input : inputs) {
OnnxTensor tensor = tokenize(input.getText());
OrtSession.Result result = session.run(Map.of("input", tensor));
float[] embedding = ((float[][]) result.get(0).getValue())[0];
results.add(normalize(embedding, dimensions));
}
return results;
}
@Override
public int maxBatchSize() {
return 32;
}
@Override
public void validate(String modelId, int dimensions,
String contentType, Map<String, String> providerConfig) {
if (!"text".equals(contentType)) {
throw new IllegalArgumentException("ONNX provider supports text only");
}
}
}
InferenceProvider SPI
Implement this interface to connect the ocr and multimodal_rerank processors to custom inference services.
Interface
public interface InferenceProvider {
String name();
/**
* Run inference on a batch of inputs.
*/
InferenceResult infer(
List<InferenceInput> inputs,
String modelId,
String taskType,
Map<String, String> providerConfig
) throws InferenceException;
void validate(
String modelId,
String taskType,
Map<String, String> providerConfig
);
}
Supported task types
| Task type | Input | Output | Used by |
|---|---|---|---|
ocr | Image | Text + bounding boxes + confidence | ocr processor |
caption | Image | Descriptive text | Custom pipelines |
rerank | Text (query + passage) | Relevance score [0.0, 1.0] | multimodal_rerank |
summarize | Text | Summary text | Custom pipelines |
classify | Text or image | Labels + scores | Custom pipelines |
extract_layout | Image | Layout regions + bounding boxes | Custom pipelines |
Built-in providers
| Provider | Name | Models |
|---|---|---|
BedrockInferenceProvider | bedrock | Claude 3 family (Haiku, Sonnet, Opus) |
HttpInferenceProvider | http | Any HTTP endpoint |
Packaging and deployment
1. Build a plugin JAR
Your extension must be packaged as a Lucenia plugin or included in the module classpath.
2. Register via META-INF/services
Create the appropriate service file in your JAR:
META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor
META-INF/services/io.lucenia.ingest.content.enrich.raster.RasterSource
META-INF/services/io.lucenia.ingest.content.enrich.provider.EmbeddingProvider
META-INF/services/io.lucenia.ingest.content.enrich.inference.InferenceProvider
Each file contains one fully-qualified class name per line.
3. Priority-based selection
When multiple implementations support the same MIME type or provider name, the one with the highest priority() value wins. Use this to override built-in handlers:
| Priority | Meaning |
|---|---|
| 0 | Built-in default |
| 1-9 | Community extensions |
| 10-19 | Vendor-specific overrides |
| 20+ | Customer-specific customizations |
4. Security constraints
- No raw API keys in pipeline configuration. Use
api_key_settingto reference Lucenia keystore entries. - Input streams passed to extractors must not be closed by the implementation (the caller manages lifecycle).
- Credentials for custom providers should follow the same keystore pattern as built-in providers.