Skip to main content
Version: 0.11.0

Extending ingest-content

Introduced 0.11.0

The ingest-content module provides four SPI (Service Provider Interface) extension points that let developers add custom capabilities without modifying the core module. Extensions are discovered at runtime via META-INF/services registration.

Extension points

┌──────────────────────────────────────────────────────────────┐
│ ingest-content module │
│ │
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ content_extract │───►│ chunk │───►│ embed │ │
│ └────────┬────────┘ └──────────────┘ └─────┬──────┘ │
│ │ │ │
│ ┌─────┴──────┐ ┌─────────────────┴────┐ │
│ │ Content │ │ Embedding │ │
│ │ Extractor │ │ Provider │ │
│ │ SPI ◄──────┼──── Your │ SPI ◄────────────────┼──── Your custom
│ │ │ custom │ │ provider
│ └────────────┘ format └──────────────────────┘ │
│ parser │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ image_tiling │ │ ocr / rerank │ │
│ └────────┬────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌─────┴──────┐ ┌────────────┴───────┐ │
│ │ Raster │ │ Inference │ │
│ │ Source │ │ Provider │ │
│ │ SPI ◄──────┼──── Your │ SPI ◄──────────────┼────── Your custom
│ │ │ custom │ │ model endpoint
│ └────────────┘ decoder └────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Extension pointInterfaceUsed byPurpose
ContentExtractorio.lucenia.ingest.content.extract.ContentExtractorcontent_extractParse custom document formats
RasterSourceio.lucenia.ingest.content.enrich.raster.RasterSourceimage_tilingDecode custom image/raster formats
EmbeddingProviderio.lucenia.ingest.content.enrich.provider.EmbeddingProviderembed, chunk (semantic)Connect to custom embedding services
InferenceProviderio.lucenia.ingest.content.enrich.inference.InferenceProviderocr, multimodal_rerankConnect to custom inference services

ContentExtractor SPI

Implement this interface to add support for custom document formats (e.g., NITF, DICOM, CAD, audio transcripts).

Interface

public interface ContentExtractor {

/**
* Whether this extractor handles the given MIME type.
*/
boolean supports(String mimeType, Map<String, Object> metadata);

/**
* Extract content from the input stream.
* Do NOT close the stream -- the caller manages its lifecycle.
*/
ExtractedContent extract(InputStream stream, ExtractionContext context)
throws ExtractionException;

/**
* Priority for MIME type conflicts. Higher values win.
* Default is 0.
*/
default int priority() {
return 0;
}
}

Built-in extractors

ExtractorMIME typesPriority
TikaExtractorPDF, DOCX, XLSX, PPTX, HTML0
PlainTextExtractortext/plain, text/markdown, application/json0
ImageExtractorimage/*0

Example: Custom NITF extractor

package com.example.nitf;

import io.lucenia.ingest.content.extract.ContentExtractor;
import io.lucenia.ingest.content.extract.ExtractedContent;
import io.lucenia.ingest.content.extract.ExtractionContext;

public class NitfExtractor implements ContentExtractor {

@Override
public boolean supports(String mimeType, Map<String, Object> metadata) {
return "application/vnd.nitf".equals(mimeType)
|| "image/nitf".equals(mimeType);
}

@Override
public ExtractedContent extract(InputStream stream, ExtractionContext context)
throws ExtractionException {
// Parse NITF headers, TREs, image segments
NitfFile nitf = NitfParser.parse(stream);

ExtractedContent content = new ExtractedContent();
// Add text blocks from TRE metadata
content.addTextBlock(nitf.getFileHeader().toString());

// Add image segments as image blocks
for (ImageSegment seg : nitf.getImageSegments()) {
content.addImageBlock(
seg.getImageData(),
"image/jpeg",
Map.of(
"icat", seg.getImageCategory(),
"irep", seg.getImageRepresentation()
)
);
}

// Add format-specific metadata
content.setMetadata(Map.of(
"classification", nitf.getSecurityClassification(),
"originating_station", nitf.getOriginatingStationId()
));

return content;
}

@Override
public int priority() {
return 10; // Override any default handler for NITF
}
}

Registration

Create META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor:

com.example.nitf.NitfExtractor

RasterSource SPI

Implement this interface to add support for custom raster/image formats with efficient windowed reads (e.g., NITF via GDAL, MrSID, ECW).

Interface

public interface RasterSource {

boolean supports(String mimeType, Map<String, Object> metadata);

/**
* Extract metadata (dimensions, CRS, geo-transform, bands).
*/
RasterMetadata metadata(InputStream stream, Map<String, Object> context)
throws IOException;

/**
* Read a pixel window from the raster.
*/
BufferedImage readWindow(InputStream stream, TileWindow window)
throws IOException;

int priority();

/**
* Open a stateful session for efficient multi-tile reads.
* Return null to use the per-tile stream approach.
*/
default RasterReadSession openSession(
InputStream stream, Map<String, Object> context) throws IOException {
return null;
}
}

RasterReadSession

For formats that benefit from keeping state between tile reads (HTTP connections, file handles, grid geometry caches), implement RasterReadSession:

public interface RasterReadSession extends Closeable {

RasterMetadata metadata();

BufferedImage readWindow(TileWindow window) throws IOException;
}

Built-in implementations

ImplementationMIME typesPrioritySession support
ImageIORasterSourceJPEG, PNG, standard TIFF0No
SisGeoTiffRasterSourceGeoTIFF, COG (image/tiff with geo metadata)10Yes (HTTP Range reads)

Example: Custom GDAL raster source

package com.example.gdal;

import io.lucenia.ingest.content.enrich.raster.*;

public class GdalRasterSource implements RasterSource {

@Override
public boolean supports(String mimeType, Map<String, Object> metadata) {
return Set.of("image/nitf", "image/x-mrsid", "image/ecw")
.contains(mimeType);
}

@Override
public RasterMetadata metadata(InputStream stream, Map<String, Object> context)
throws IOException {
// Use GDAL JNI bindings to read metadata
Dataset ds = gdal.Open(context.get("uri").toString());
double[] geoTransform = ds.GetGeoTransform();
String crs = ds.GetProjection();

return new RasterMetadata(
ds.getRasterXSize(), // width
ds.getRasterYSize(), // height
ds.getRasterCount(), // bands
geoTransform,
crs
);
}

@Override
public BufferedImage readWindow(InputStream stream, TileWindow window)
throws IOException {
// Read pixel region via GDAL
Dataset ds = gdal.Open(/* ... */);
Band band = ds.GetRasterBand(1);
int[] data = new int[window.width() * window.height()];
band.ReadRaster(window.x(), window.y(),
window.width(), window.height(), data);
// Convert to BufferedImage...
return image;
}

@Override
public int priority() {
return 20; // Override SIS for NITF/MrSID formats
}

@Override
public RasterReadSession openSession(InputStream stream, Map<String, Object> context)
throws IOException {
// Open a persistent GDAL dataset for multi-tile reads
return new GdalReadSession(context.get("uri").toString());
}
}

EmbeddingProvider SPI

Implement this interface to connect the embed and chunk (semantic mode) processors to custom embedding services.

Interface

public interface EmbeddingProvider {

String name();

/**
* Embed a batch of inputs (text, image, or multimodal).
*/
List<float[]> embedInputs(
List<EmbeddingInput> inputs,
String modelId,
int dimensions,
Map<String, String> providerConfig
) throws EmbeddingException;

/**
* Maximum inputs per batch call.
*/
int maxBatchSize();

/**
* Validate configuration at pipeline creation time.
*/
void validate(
String modelId,
int dimensions,
String contentType,
Map<String, String> providerConfig
);
}

EmbeddingInput

// Text-only
EmbeddingInput.text("hello world")

// Image-only
EmbeddingInput.image(imageBytes, "image/jpeg")

// Multimodal (text + image)
EmbeddingInput.multimodal("describe this", imageBytes, "image/png")

Built-in providers

ProviderNameBatch supportContent types
BedrockProviderbedrockNo (1 per call)Text, image, multimodal
OpenAiProvideropenaiYes (up to 100)Text only
HttpProviderhttpYes (configurable)Text, image, multimodal

Example: Custom ONNX Runtime provider

package com.example.onnx;

import io.lucenia.ingest.content.enrich.provider.*;

public class OnnxEmbeddingProvider implements EmbeddingProvider {

@Override
public String name() {
return "onnx";
}

@Override
public List<float[]> embedInputs(
List<EmbeddingInput> inputs,
String modelId,
int dimensions,
Map<String, String> providerConfig) throws EmbeddingException {

OrtSession session = loadModel(modelId, providerConfig);
List<float[]> results = new ArrayList<>();

for (EmbeddingInput input : inputs) {
OnnxTensor tensor = tokenize(input.getText());
OrtSession.Result result = session.run(Map.of("input", tensor));
float[] embedding = ((float[][]) result.get(0).getValue())[0];
results.add(normalize(embedding, dimensions));
}

return results;
}

@Override
public int maxBatchSize() {
return 32;
}

@Override
public void validate(String modelId, int dimensions,
String contentType, Map<String, String> providerConfig) {
if (!"text".equals(contentType)) {
throw new IllegalArgumentException("ONNX provider supports text only");
}
}
}

InferenceProvider SPI

Implement this interface to connect the ocr and multimodal_rerank processors to custom inference services.

Interface

public interface InferenceProvider {

String name();

/**
* Run inference on a batch of inputs.
*/
InferenceResult infer(
List<InferenceInput> inputs,
String modelId,
String taskType,
Map<String, String> providerConfig
) throws InferenceException;

void validate(
String modelId,
String taskType,
Map<String, String> providerConfig
);
}

Supported task types

Task typeInputOutputUsed by
ocrImageText + bounding boxes + confidenceocr processor
captionImageDescriptive textCustom pipelines
rerankText (query + passage)Relevance score [0.0, 1.0]multimodal_rerank
summarizeTextSummary textCustom pipelines
classifyText or imageLabels + scoresCustom pipelines
extract_layoutImageLayout regions + bounding boxesCustom pipelines

Built-in providers

ProviderNameModels
BedrockInferenceProviderbedrockClaude 3 family (Haiku, Sonnet, Opus)
HttpInferenceProviderhttpAny HTTP endpoint

Packaging and deployment

1. Build a plugin JAR

Your extension must be packaged as a Lucenia plugin or included in the module classpath.

2. Register via META-INF/services

Create the appropriate service file in your JAR:

META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor
META-INF/services/io.lucenia.ingest.content.enrich.raster.RasterSource
META-INF/services/io.lucenia.ingest.content.enrich.provider.EmbeddingProvider
META-INF/services/io.lucenia.ingest.content.enrich.inference.InferenceProvider

Each file contains one fully-qualified class name per line.

3. Priority-based selection

When multiple implementations support the same MIME type or provider name, the one with the highest priority() value wins. Use this to override built-in handlers:

PriorityMeaning
0Built-in default
1-9Community extensions
10-19Vendor-specific overrides
20+Customer-specific customizations

4. Security constraints

  • No raw API keys in pipeline configuration. Use api_key_setting to reference Lucenia keystore entries.
  • Input streams passed to extractors must not be closed by the implementation (the caller manages lifecycle).
  • Credentials for custom providers should follow the same keystore pattern as built-in providers.