-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH]: Globalize dead letter queue in SysDB #6003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
e5ea351 to
ba6bd66
Compare
|
Persist dead-letter state for compaction jobs in SysDB Introduces a global DLQ by persisting compaction failure counts per collection in SysDB and wiring the compactor to use that state. The change removes in-memory failure tracking, adds a Key Changes• Replace the scheduler’s in-memory Affected Areas• This summary was automatically generated by @propel-code-bot |
ba6bd66 to
0eda4dd
Compare
0eda4dd to
1881b77
Compare
1881b77 to
31b3834
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
31b3834 to
19d8c5d
Compare
| lineage_file_path: None, | ||
| updated_at: SystemTime::UNIX_EPOCH, | ||
| database_id, | ||
| compaction_failure_count: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Logic] The compaction_failure_count is hardcoded to 0 here. It should be fetched from the database row to correctly reflect the persisted failure count for the collection. This will cause the DLQ logic to not work as intended for the SQLite backend, as it will always report 0 failures.
To fix this, you'll need to:
- Update the
sea_querybuilder inget_collections_with_connto select thecompaction_failure_countcolumn. - Parse the value from the
SqliteRowand use it when constructing theCollectionstruct.
This will likely also require changes in chroma-sqlite/src/table.rs to add CompactionFailureCount to the Collections enum, and an update to the SQLite schema creation logic to include the new column.
Example of what needs to change in get_collections_with_conn:
// In the query builder
// ...
.column((table::Collections::Table, table::Collections::SchemaStr))
.column((table::Collections::Table, table::Collections::CompactionFailureCount)) // Add this
// ...
// In the row processing loop
// ...
let compaction_failure_count: i32 = first_row.get("compaction_failure_count"); // Get the value
// ...
Some(Ok(Collection {
// ...
compaction_failure_count, // Use the fetched value
}))
// ...Context for Agents
The `compaction_failure_count` is hardcoded to 0 here. It should be fetched from the database row to correctly reflect the persisted failure count for the collection. This will cause the DLQ logic to not work as intended for the SQLite backend, as it will always report 0 failures.
To fix this, you'll need to:
1. Update the `sea_query` builder in `get_collections_with_conn` to select the `compaction_failure_count` column.
2. Parse the value from the `SqliteRow` and use it when constructing the `Collection` struct.
This will likely also require changes in `chroma-sqlite/src/table.rs` to add `CompactionFailureCount` to the `Collections` enum, and an update to the SQLite schema creation logic to include the new column.
Example of what needs to change in `get_collections_with_conn`:
```rust
// In the query builder
// ...
.column((table::Collections::Table, table::Collections::SchemaStr))
.column((table::Collections::Table, table::Collections::CompactionFailureCount)) // Add this
// ...
// In the row processing loop
// ...
let compaction_failure_count: i32 = first_row.get("compaction_failure_count"); // Get the value
// ...
Some(Ok(Collection {
// ...
compaction_failure_count, // Use the fetched value
}))
// ...
```
File: rust/sysdb/src/sqlite.rs
Line: 86445971d3 to
cbc5830
Compare
cbc5830 to
8acf4d9
Compare

Description of changes
Summarize the changes made by this PR.
compaction_failure_countcolumn on thecollectionstable to implement a global DLQ.failing_jobsmap has been removed.get_dead_jobs()endpoint has been removed as this information should be derived from the SysDB now.Test plan
How are these changes tested?
test_k8s_integration_scheduleralready tests this feature.pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
The new column in
collectionsis defaulted to 0 in the database, Go model and proto definitions.Observability plan
What is the plan to instrument and monitor this change?
The DLQ is more easily observable from the SysDB now.
compactor_job_failure_countreplaces thecompactor_dead_jobs_countmetric and we increment that on every failure instead of every time a job is "killed".Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_