Skip to content

Document Enrichment Processors #460

@medcl

Description

@medcl

🧠 Document Enrichment Processors

Implement a set of processors that can transform, enrich, or normalize documents after they are fetched by a connector and before indexing.

These processors operate independently of the data source and can be chained in a pipeline to apply transformations such as field renaming, extraction, normalization, tagging, and more.

✅ Objectives

  • Define a standard processor interface
  • Support common field-level transformations
  • Allow flexible and composable enrichment pipelines

🛠️ Example Processor Types

Processor Type Description
rename_field Rename fields in the document (e.g., titledoc.title)
extract_regex Extract substrings using regex from text fields
set_value Set or override a field with a constant value
timestamp_parser Convert string timestamps to a unified format
truncate Limit string or array field lengths
add_tags Append static or dynamic tags to a document
lowercase Normalize text to lowercase

🔧 Configuration Example

pipeline:
  - name: enrich_documents
    auto_start: false
    keep_running: true
    processor:
      - consumer:
          auto_commit_offset: true
          queue_selector:
            keys:
              - indexing_documents
          consumer:
            group: enriched_documents
            fetch_max_messages: 10
          processor:
            - document_summarization:
                model: $[[env.ENRICHMENT_MODEL]]
                input_queue: "indexing_documents"
                min_input_document_length: 500
                output_queue:
                  name: "enriched_documents"
                  label:
                    tag: "enriched"

📁 Reference

Some existing legacy code can be migrated to processors: https://github.com/infinilabs/crawler/tree/master/pipeline/joints

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions