Embedder Consistency Issue: Need for Re-embedding Support

## Summary

OpenContracts has a consistency issue when the `DEFAULT_EMBEDDER` setting changes. The system creates embeddings with different embedders at different times, leading to inconsistent vector search results and potential data invisibility. Currently, there's no mechanism to re-embed existing content when embedders change.

## Current Embedder Selection Logic

### Document Creation Pipeline
```python
# opencontractserver/documents/signals.py:53-54
# Removed embedding calculation from document creation
# Embeddings will now be calculated only when document is linked to a corpus
```

### Structural Annotation Creation
```python
# opencontractserver/annotations/signals.py:41-43
calculate_embedding_for_annotation_text.si(
    annotation_id=instance.id  # No embedder_path specified - uses DEFAULT_EMBEDDER
).apply_async()
```

### Document-to-Corpus Addition
```python
# opencontractserver/documents/signals.py:88-90
embedder_path = instance.preferred_embedder or getattr(
    settings, "DEFAULT_EMBEDDER", None
)
```

## The Problem: Orphaned Embeddings

When `DEFAULT_EMBEDDER` changes in `config/settings/base.py:604`, the system exhibits inconsistent behavior:

1. **Existing structural annotations** retain embeddings created with the old embedder
2. **New structural annotations** get embeddings created with the new embedder  
3. **Vector searches** filter by `embedder_path`, missing content embedded with different embedders

### Embedder Tracking
```python
# opencontractserver/annotations/models.py:330-335
embedder_path: str = django.db.models.CharField(
    max_length=256,
    null=True,
    blank=True,
    help_text="Identifier for the embedding model or pipeline used (e.g. 'openai/text-embedding-ada-002').",
)
```

### Vector Store Filtering
```python
# opencontractserver/llms/vector_stores/core_vector_stores.py:120-125
embedder_class, detected_embedder_path = get_embedder(
    corpus_id=corpus_id,
    embedder_path=embedder_path,
)
self.embedder_path = detected_embedder_path
```

## Corner Cases & Impact Scenarios

### 1. **Mixed-Era Documents**
**Scenario**: A document with structural annotations created before and after an embedder change.
- Annotation A: `embedder_path="old-embedder"` 
- Annotation B: `embedder_path="new-embedder"`
- **Impact**: Vector searches only find annotations matching the current embedder

### 2. **Standalone Document Chat** 
**Scenario**: Document uploaded before embedder change, accessed via standalone consumer.
```python
# Proposed workaround in standalone consumer
vector_store = CoreAnnotationVectorStore(
    document_id=self.document_id,
    embedder_path=settings.DEFAULT_EMBEDDER,  # May not match document's existing embeddings
    corpus_id=None
)
```
- **Impact**: Vector search returns empty results despite document having embeddings

### 3. **Cross-Corpus Document Movement**
**Scenario**: Document moves between corpuses with different `preferred_embedder` settings.
```python
# opencontractserver/annotations/signals.py:94-100
for corpus_id, preferred_embedder in corpus_embedders:
    embedder_path = preferred_embedder or getattr(
        settings, "DEFAULT_EMBEDDER", None
    )
```
- **Impact**: Same annotation gets multiple embeddings with different embedders, causing search inconsistency

### 4. **System Migration/Upgrade**
**Scenario**: Deploying with updated `DEFAULT_EMBEDDER` after system has been running.
- **Current behavior**: No automatic re-embedding occurs
- **Impact**: All existing structural annotations become invisible to new vector searches

## Proposed Solution
1. **Embedder metadata tracking** in database
2. **Automatic migration triggers** when embedders change
4. **Background re-embedding tasks** for large datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Embedder Consistency Issue: Need for Re-embedding Support #437

Summary

Current Embedder Selection Logic

Document Creation Pipeline

Structural Annotation Creation

Document-to-Corpus Addition

The Problem: Orphaned Embeddings

Embedder Tracking

Vector Store Filtering

Corner Cases & Impact Scenarios

1. Mixed-Era Documents

2. Standalone Document Chat

3. Cross-Corpus Document Movement

4. System Migration/Upgrade

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Embedder Consistency Issue: Need for Re-embedding Support #437

Description

Summary

Current Embedder Selection Logic

Document Creation Pipeline

Structural Annotation Creation

Document-to-Corpus Addition

The Problem: Orphaned Embeddings

Embedder Tracking

Vector Store Filtering

Corner Cases & Impact Scenarios

1. Mixed-Era Documents

2. Standalone Document Chat

3. Cross-Corpus Document Movement

4. System Migration/Upgrade

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions