Skip to content

Embedder Consistency Issue: Need for Re-embedding Support #437

@JSv4

Description

@JSv4

Summary

OpenContracts has a consistency issue when the DEFAULT_EMBEDDER setting changes. The system creates embeddings with different embedders at different times, leading to inconsistent vector search results and potential data invisibility. Currently, there's no mechanism to re-embed existing content when embedders change.

Current Embedder Selection Logic

Document Creation Pipeline

# opencontractserver/documents/signals.py:53-54
# Removed embedding calculation from document creation
# Embeddings will now be calculated only when document is linked to a corpus

Structural Annotation Creation

# opencontractserver/annotations/signals.py:41-43
calculate_embedding_for_annotation_text.si(
    annotation_id=instance.id  # No embedder_path specified - uses DEFAULT_EMBEDDER
).apply_async()

Document-to-Corpus Addition

# opencontractserver/documents/signals.py:88-90
embedder_path = instance.preferred_embedder or getattr(
    settings, "DEFAULT_EMBEDDER", None
)

The Problem: Orphaned Embeddings

When DEFAULT_EMBEDDER changes in config/settings/base.py:604, the system exhibits inconsistent behavior:

  1. Existing structural annotations retain embeddings created with the old embedder
  2. New structural annotations get embeddings created with the new embedder
  3. Vector searches filter by embedder_path, missing content embedded with different embedders

Embedder Tracking

# opencontractserver/annotations/models.py:330-335
embedder_path: str = django.db.models.CharField(
    max_length=256,
    null=True,
    blank=True,
    help_text="Identifier for the embedding model or pipeline used (e.g. 'openai/text-embedding-ada-002').",
)

Vector Store Filtering

# opencontractserver/llms/vector_stores/core_vector_stores.py:120-125
embedder_class, detected_embedder_path = get_embedder(
    corpus_id=corpus_id,
    embedder_path=embedder_path,
)
self.embedder_path = detected_embedder_path

Corner Cases & Impact Scenarios

1. Mixed-Era Documents

Scenario: A document with structural annotations created before and after an embedder change.

  • Annotation A: embedder_path="old-embedder"
  • Annotation B: embedder_path="new-embedder"
  • Impact: Vector searches only find annotations matching the current embedder

2. Standalone Document Chat

Scenario: Document uploaded before embedder change, accessed via standalone consumer.

# Proposed workaround in standalone consumer
vector_store = CoreAnnotationVectorStore(
    document_id=self.document_id,
    embedder_path=settings.DEFAULT_EMBEDDER,  # May not match document's existing embeddings
    corpus_id=None
)
  • Impact: Vector search returns empty results despite document having embeddings

3. Cross-Corpus Document Movement

Scenario: Document moves between corpuses with different preferred_embedder settings.

# opencontractserver/annotations/signals.py:94-100
for corpus_id, preferred_embedder in corpus_embedders:
    embedder_path = preferred_embedder or getattr(
        settings, "DEFAULT_EMBEDDER", None
    )
  • Impact: Same annotation gets multiple embeddings with different embedders, causing search inconsistency

4. System Migration/Upgrade

Scenario: Deploying with updated DEFAULT_EMBEDDER after system has been running.

  • Current behavior: No automatic re-embedding occurs
  • Impact: All existing structural annotations become invisible to new vector searches

Proposed Solution

  1. Embedder metadata tracking in database
  2. Automatic migration triggers when embedders change
  3. Background re-embedding tasks for large datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions