Version: 0.11.0

Extending ingest-content

Introduced 0.11.0

The ingest-content module provides four SPI (Service Provider Interface) extension points that let developers add custom capabilities without modifying the core module. Extensions are discovered at runtime via META-INF/services registration.

Extension points

┌──────────────────────────────────────────────────────────────┐
│                     ingest-content module                    │
│                                                              │
│  ┌─────────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ content_extract  │───►│    chunk     │───►│   embed    │  │
│  └────────┬────────┘    └──────────────┘    └─────┬──────┘  │
│           │                                       │          │
│     ┌─────┴──────┐              ┌─────────────────┴────┐     │
│     │ Content    │              │ Embedding            │     │
│     │ Extractor  │              │ Provider             │     │
│     │ SPI ◄──────┼──── Your    │ SPI ◄────────────────┼──── Your custom
│     │            │     custom   │                      │     provider
│     └────────────┘     format   └──────────────────────┘     │
│                        parser                                │
│  ┌─────────────────┐              ┌──────────────────────┐   │
│  │ image_tiling    │              │ ocr / rerank         │   │
│  └────────┬────────┘              └──────────┬───────────┘   │
│           │                                  │               │
│     ┌─────┴──────┐              ┌────────────┴───────┐       │
│     │ Raster     │              │ Inference          │       │
│     │ Source     │              │ Provider           │       │
│     │ SPI ◄──────┼──── Your    │ SPI ◄──────────────┼────── Your custom
│     │            │     custom   │                    │       model endpoint
│     └────────────┘     decoder  └────────────────────┘       │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Extension point	Interface	Used by	Purpose
ContentExtractor	`io.lucenia.ingest.content.extract.ContentExtractor`	`content_extract`	Parse custom document formats
RasterSource	`io.lucenia.ingest.content.enrich.raster.RasterSource`	`image_tiling`	Decode custom image/raster formats
EmbeddingProvider	`io.lucenia.ingest.content.enrich.provider.EmbeddingProvider`	`embed`, `chunk` (semantic)	Connect to custom embedding services
InferenceProvider	`io.lucenia.ingest.content.enrich.inference.InferenceProvider`	`ocr`, `multimodal_rerank`	Connect to custom inference services

ContentExtractor SPI

Implement this interface to add support for custom document formats (e.g., NITF, DICOM, CAD, audio transcripts).

Interface

public interface ContentExtractor {

    /**
     * Whether this extractor handles the given MIME type.
     */
    boolean supports(String mimeType, Map<String, Object> metadata);

    /**
     * Extract content from the input stream.
     * Do NOT close the stream -- the caller manages its lifecycle.
     */
    ExtractedContent extract(InputStream stream, ExtractionContext context)
        throws ExtractionException;

    /**
     * Priority for MIME type conflicts. Higher values win.
     * Default is 0.
     */
    default int priority() {
        return 0;
    }
}

Built-in extractors

Extractor	MIME types	Priority
`TikaExtractor`	PDF, DOCX, XLSX, PPTX, HTML	0
`PlainTextExtractor`	`text/plain`, `text/markdown`, `application/json`	0
`ImageExtractor`	`image/*`	0

Example: Custom NITF extractor

package com.example.nitf;

import io.lucenia.ingest.content.extract.ContentExtractor;
import io.lucenia.ingest.content.extract.ExtractedContent;
import io.lucenia.ingest.content.extract.ExtractionContext;

public class NitfExtractor implements ContentExtractor {

    @Override
    public boolean supports(String mimeType, Map<String, Object> metadata) {
        return "application/vnd.nitf".equals(mimeType)
            || "image/nitf".equals(mimeType);
    }

    @Override
    public ExtractedContent extract(InputStream stream, ExtractionContext context)
            throws ExtractionException {
        // Parse NITF headers, TREs, image segments
        NitfFile nitf = NitfParser.parse(stream);

        ExtractedContent content = new ExtractedContent();
        // Add text blocks from TRE metadata
        content.addTextBlock(nitf.getFileHeader().toString());

        // Add image segments as image blocks
        for (ImageSegment seg : nitf.getImageSegments()) {
            content.addImageBlock(
                seg.getImageData(),
                "image/jpeg",
                Map.of(
                    "icat", seg.getImageCategory(),
                    "irep", seg.getImageRepresentation()
                )
            );
        }

        // Add format-specific metadata
        content.setMetadata(Map.of(
            "classification", nitf.getSecurityClassification(),
            "originating_station", nitf.getOriginatingStationId()
        ));

        return content;
    }

    @Override
    public int priority() {
        return 10; // Override any default handler for NITF
    }
}

Registration

Create META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor:

com.example.nitf.NitfExtractor

RasterSource SPI

Implement this interface to add support for custom raster/image formats with efficient windowed reads (e.g., NITF via GDAL, MrSID, ECW).

Interface

public interface RasterSource {

    boolean supports(String mimeType, Map<String, Object> metadata);

    /**
     * Extract metadata (dimensions, CRS, geo-transform, bands).
     */
    RasterMetadata metadata(InputStream stream, Map<String, Object> context)
        throws IOException;

    /**
     * Read a pixel window from the raster.
     */
    BufferedImage readWindow(InputStream stream, TileWindow window)
        throws IOException;

    int priority();

    /**
     * Open a stateful session for efficient multi-tile reads.
     * Return null to use the per-tile stream approach.
     */
    default RasterReadSession openSession(
        InputStream stream, Map<String, Object> context) throws IOException {
        return null;
    }
}

RasterReadSession

For formats that benefit from keeping state between tile reads (HTTP connections, file handles, grid geometry caches), implement RasterReadSession:

public interface RasterReadSession extends Closeable {

    RasterMetadata metadata();

    BufferedImage readWindow(TileWindow window) throws IOException;
}

Built-in implementations

Implementation	MIME types	Priority	Session support
`ImageIORasterSource`	JPEG, PNG, standard TIFF	0	No
`SisGeoTiffRasterSource`	GeoTIFF, COG (`image/tiff` with geo metadata)	10	Yes (HTTP Range reads)

Example: Custom GDAL raster source

package com.example.gdal;

import io.lucenia.ingest.content.enrich.raster.*;

public class GdalRasterSource implements RasterSource {

    @Override
    public boolean supports(String mimeType, Map<String, Object> metadata) {
        return Set.of("image/nitf", "image/x-mrsid", "image/ecw")
            .contains(mimeType);
    }

    @Override
    public RasterMetadata metadata(InputStream stream, Map<String, Object> context)
            throws IOException {
        // Use GDAL JNI bindings to read metadata
        Dataset ds = gdal.Open(context.get("uri").toString());
        double[] geoTransform = ds.GetGeoTransform();
        String crs = ds.GetProjection();

        return new RasterMetadata(
            ds.getRasterXSize(),    // width
            ds.getRasterYSize(),    // height
            ds.getRasterCount(),    // bands
            geoTransform,
            crs
        );
    }

    @Override
    public BufferedImage readWindow(InputStream stream, TileWindow window)
            throws IOException {
        // Read pixel region via GDAL
        Dataset ds = gdal.Open(/* ... */);
        Band band = ds.GetRasterBand(1);
        int[] data = new int[window.width() * window.height()];
        band.ReadRaster(window.x(), window.y(),
            window.width(), window.height(), data);
        // Convert to BufferedImage...
        return image;
    }

    @Override
    public int priority() {
        return 20; // Override SIS for NITF/MrSID formats
    }

    @Override
    public RasterReadSession openSession(InputStream stream, Map<String, Object> context)
            throws IOException {
        // Open a persistent GDAL dataset for multi-tile reads
        return new GdalReadSession(context.get("uri").toString());
    }
}

EmbeddingProvider SPI

Implement this interface to connect the embed and chunk (semantic mode) processors to custom embedding services.

Interface

public interface EmbeddingProvider {

    String name();

    /**
     * Embed a batch of inputs (text, image, or multimodal).
     */
    List<float[]> embedInputs(
        List<EmbeddingInput> inputs,
        String modelId,
        int dimensions,
        Map<String, String> providerConfig
    ) throws EmbeddingException;

    /**
     * Maximum inputs per batch call.
     */
    int maxBatchSize();

    /**
     * Validate configuration at pipeline creation time.
     */
    void validate(
        String modelId,
        int dimensions,
        String contentType,
        Map<String, String> providerConfig
    );
}

EmbeddingInput

// Text-only
EmbeddingInput.text("hello world")

// Image-only
EmbeddingInput.image(imageBytes, "image/jpeg")

// Multimodal (text + image)
EmbeddingInput.multimodal("describe this", imageBytes, "image/png")

Built-in providers

Provider	Name	Batch support	Content types
`BedrockProvider`	`bedrock`	No (1 per call)	Text, image, multimodal
`OpenAiProvider`	`openai`	Yes (up to 100)	Text only
`HttpProvider`	`http`	Yes (configurable)	Text, image, multimodal

Example: Custom ONNX Runtime provider

package com.example.onnx;

import io.lucenia.ingest.content.enrich.provider.*;

public class OnnxEmbeddingProvider implements EmbeddingProvider {

    @Override
    public String name() {
        return "onnx";
    }

    @Override
    public List<float[]> embedInputs(
            List<EmbeddingInput> inputs,
            String modelId,
            int dimensions,
            Map<String, String> providerConfig) throws EmbeddingException {

        OrtSession session = loadModel(modelId, providerConfig);
        List<float[]> results = new ArrayList<>();

        for (EmbeddingInput input : inputs) {
            OnnxTensor tensor = tokenize(input.getText());
            OrtSession.Result result = session.run(Map.of("input", tensor));
            float[] embedding = ((float[][]) result.get(0).getValue())[0];
            results.add(normalize(embedding, dimensions));
        }

        return results;
    }

    @Override
    public int maxBatchSize() {
        return 32;
    }

    @Override
    public void validate(String modelId, int dimensions,
            String contentType, Map<String, String> providerConfig) {
        if (!"text".equals(contentType)) {
            throw new IllegalArgumentException("ONNX provider supports text only");
        }
    }
}

InferenceProvider SPI

Implement this interface to connect the ocr and multimodal_rerank processors to custom inference services.

Interface

public interface InferenceProvider {

    String name();

    /**
     * Run inference on a batch of inputs.
     */
    InferenceResult infer(
        List<InferenceInput> inputs,
        String modelId,
        String taskType,
        Map<String, String> providerConfig
    ) throws InferenceException;

    void validate(
        String modelId,
        String taskType,
        Map<String, String> providerConfig
    );
}

Supported task types

Task type	Input	Output	Used by
`ocr`	Image	Text + bounding boxes + confidence	`ocr` processor
`caption`	Image	Descriptive text	Custom pipelines
`rerank`	Text (query + passage)	Relevance score [0.0, 1.0]	`multimodal_rerank`
`summarize`	Text	Summary text	Custom pipelines
`classify`	Text or image	Labels + scores	Custom pipelines
`extract_layout`	Image	Layout regions + bounding boxes	Custom pipelines

Built-in providers

Provider	Name	Models
`BedrockInferenceProvider`	`bedrock`	Claude 3 family (Haiku, Sonnet, Opus)
`HttpInferenceProvider`	`http`	Any HTTP endpoint

Packaging and deployment

1. Build a plugin JAR

Your extension must be packaged as a Lucenia plugin or included in the module classpath.

2. Register via META-INF/services

Create the appropriate service file in your JAR:

META-INF/services/io.lucenia.ingest.content.extract.ContentExtractor
META-INF/services/io.lucenia.ingest.content.enrich.raster.RasterSource
META-INF/services/io.lucenia.ingest.content.enrich.provider.EmbeddingProvider
META-INF/services/io.lucenia.ingest.content.enrich.inference.InferenceProvider

Each file contains one fully-qualified class name per line.

3. Priority-based selection

When multiple implementations support the same MIME type or provider name, the one with the highest priority() value wins. Use this to override built-in handlers:

Priority	Meaning
0	Built-in default
1-9	Community extensions
10-19	Vendor-specific overrides
20+	Customer-specific customizations

4. Security constraints

No raw API keys in pipeline configuration. Use api_key_setting to reference Lucenia keystore entries.
Input streams passed to extractors must not be closed by the implementation (the caller manages lifecycle).
Credentials for custom providers should follow the same keystore pattern as built-in providers.

Extension points​

ContentExtractor SPI​

Interface​

Built-in extractors​

Example: Custom NITF extractor​

Registration​

RasterSource SPI​

Interface​

RasterReadSession​

Built-in implementations​

Example: Custom GDAL raster source​

EmbeddingProvider SPI​

Interface​

EmbeddingInput​

Built-in providers​

Example: Custom ONNX Runtime provider​

InferenceProvider SPI​

Interface​

Supported task types​

Built-in providers​

Packaging and deployment​

1. Build a plugin JAR​

2. Register via META-INF/services​

3. Priority-based selection​

4. Security constraints​

Extension points

ContentExtractor SPI

Interface

Built-in extractors

Example: Custom NITF extractor

Registration

RasterSource SPI

Interface

RasterReadSession

Built-in implementations

Example: Custom GDAL raster source

EmbeddingProvider SPI

Interface

EmbeddingInput

Built-in providers

Example: Custom ONNX Runtime provider

InferenceProvider SPI

Interface

Supported task types

Built-in providers

Packaging and deployment

1. Build a plugin JAR

2. Register via META-INF/services

3. Priority-based selection

4. Security constraints