-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
🧠 Document Enrichment Processors
Implement a set of processors that can transform, enrich, or normalize documents after they are fetched by a connector and before indexing.
These processors operate independently of the data source and can be chained in a pipeline to apply transformations such as field renaming, extraction, normalization, tagging, and more.
✅ Objectives
- Define a standard processor interface
- Support common field-level transformations
- Allow flexible and composable enrichment pipelines
🛠️ Example Processor Types
Processor Type | Description |
---|---|
rename_field |
Rename fields in the document (e.g., title → doc.title ) |
extract_regex |
Extract substrings using regex from text fields |
set_value |
Set or override a field with a constant value |
timestamp_parser |
Convert string timestamps to a unified format |
truncate |
Limit string or array field lengths |
add_tags |
Append static or dynamic tags to a document |
lowercase |
Normalize text to lowercase |
🔧 Configuration Example
pipeline:
- name: enrich_documents
auto_start: false
keep_running: true
processor:
- consumer:
auto_commit_offset: true
queue_selector:
keys:
- indexing_documents
consumer:
group: enriched_documents
fetch_max_messages: 10
processor:
- document_summarization:
model: $[[env.ENRICHMENT_MODEL]]
input_queue: "indexing_documents"
min_input_document_length: 500
output_queue:
name: "enriched_documents"
label:
tag: "enriched"
📁 Reference
Some existing legacy code can be migrated to processors: https://github.com/infinilabs/crawler/tree/master/pipeline/joints
Metadata
Metadata
Assignees
Labels
No labels