-
-
Notifications
You must be signed in to change notification settings - Fork 93
Description
Summary
OpenContracts has a consistency issue when the DEFAULT_EMBEDDER
setting changes. The system creates embeddings with different embedders at different times, leading to inconsistent vector search results and potential data invisibility. Currently, there's no mechanism to re-embed existing content when embedders change.
Current Embedder Selection Logic
Document Creation Pipeline
# opencontractserver/documents/signals.py:53-54
# Removed embedding calculation from document creation
# Embeddings will now be calculated only when document is linked to a corpus
Structural Annotation Creation
# opencontractserver/annotations/signals.py:41-43
calculate_embedding_for_annotation_text.si(
annotation_id=instance.id # No embedder_path specified - uses DEFAULT_EMBEDDER
).apply_async()
Document-to-Corpus Addition
# opencontractserver/documents/signals.py:88-90
embedder_path = instance.preferred_embedder or getattr(
settings, "DEFAULT_EMBEDDER", None
)
The Problem: Orphaned Embeddings
When DEFAULT_EMBEDDER
changes in config/settings/base.py:604
, the system exhibits inconsistent behavior:
- Existing structural annotations retain embeddings created with the old embedder
- New structural annotations get embeddings created with the new embedder
- Vector searches filter by
embedder_path
, missing content embedded with different embedders
Embedder Tracking
# opencontractserver/annotations/models.py:330-335
embedder_path: str = django.db.models.CharField(
max_length=256,
null=True,
blank=True,
help_text="Identifier for the embedding model or pipeline used (e.g. 'openai/text-embedding-ada-002').",
)
Vector Store Filtering
# opencontractserver/llms/vector_stores/core_vector_stores.py:120-125
embedder_class, detected_embedder_path = get_embedder(
corpus_id=corpus_id,
embedder_path=embedder_path,
)
self.embedder_path = detected_embedder_path
Corner Cases & Impact Scenarios
1. Mixed-Era Documents
Scenario: A document with structural annotations created before and after an embedder change.
- Annotation A:
embedder_path="old-embedder"
- Annotation B:
embedder_path="new-embedder"
- Impact: Vector searches only find annotations matching the current embedder
2. Standalone Document Chat
Scenario: Document uploaded before embedder change, accessed via standalone consumer.
# Proposed workaround in standalone consumer
vector_store = CoreAnnotationVectorStore(
document_id=self.document_id,
embedder_path=settings.DEFAULT_EMBEDDER, # May not match document's existing embeddings
corpus_id=None
)
- Impact: Vector search returns empty results despite document having embeddings
3. Cross-Corpus Document Movement
Scenario: Document moves between corpuses with different preferred_embedder
settings.
# opencontractserver/annotations/signals.py:94-100
for corpus_id, preferred_embedder in corpus_embedders:
embedder_path = preferred_embedder or getattr(
settings, "DEFAULT_EMBEDDER", None
)
- Impact: Same annotation gets multiple embeddings with different embedders, causing search inconsistency
4. System Migration/Upgrade
Scenario: Deploying with updated DEFAULT_EMBEDDER
after system has been running.
- Current behavior: No automatic re-embedding occurs
- Impact: All existing structural annotations become invisible to new vector searches
Proposed Solution
- Embedder metadata tracking in database
- Automatic migration triggers when embedders change
- Background re-embedding tasks for large datasets