-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
A note for the community
No response
Use Cases
When someone needs to debug what offset has been read for a specific file by taking a look in the checkpoint file, based on your fingerprint
strategy, you either have to resolve the device
-inode
pair, or compute the checksum for each file and do extra steps to see which file corresponds to which fingerprint. This causes immense trouble during debugging scenarios.
One other use case is to delete/move files based on checkpoint information. You take a look at the checkpoint, move the file to the s3 and go on if the checkpoint offset points to the end of the file.
In these scenarios, whatever the services are written in, whole fingerprinting logic needs to be ported.
We propose that including the file path with the fingerprint in the checkpoint file will be highly beneficial.
Attempted Solutions
No response
Proposal
Even though I have no proper Rust experience, looking at the codebase, we can add a new field to this mapping:
vector/lib/file-source-common/src/checkpointer.rs
Lines 57 to 65 in be2dde4
/// A thread-safe handle for reading and writing checkpoints in-memory across | |
/// multiple threads. | |
#[derive(Debug, Default)] | |
pub struct CheckpointsView { | |
checkpoints: DashMap<FileFingerprint, FilePosition>, | |
modified_times: DashMap<FileFingerprint, DateTime<Utc>>, | |
removed_times: DashMap<FileFingerprint, DateTime<Utc>>, | |
} | |
Say something like filepaths: DashMap<FileFingerprint, Path>
(Lifetime of a Path
will cause some problems I believe)
And adding a field path: Path
to this struct:
vector/lib/file-source-common/src/checkpointer.rs
Lines 38 to 47 in be2dde4
/// A simple JSON-friendly struct of the fingerprint/position pair, since | |
/// fingerprints as objects cannot be keys in a plain JSON map. | |
#[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq, Ord, PartialOrd)] | |
#[serde(rename_all = "snake_case")] | |
struct Checkpoint { | |
fingerprint: FileFingerprint, | |
position: FilePosition, | |
modified: DateTime<Utc>, | |
} | |
Resolving the path in this method:
vector/lib/file-source-common/src/checkpointer.rs
Lines 148 to 169 in be2dde4
State::V1 { | |
checkpoints: self | |
.checkpoints | |
.iter() | |
.map(|entry| { | |
let fingerprint = entry.key(); | |
let position = entry.value(); | |
Checkpoint { | |
fingerprint: *fingerprint, | |
position: *position, | |
modified: self | |
.modified_times | |
.get(fingerprint) | |
.map(|r| *r.value()) | |
.unwrap_or_else(Utc::now), | |
} | |
}) | |
.collect(), | |
} | |
} | |
async fn maybe_upgrade( |
And writing it to the checkpoint shouldn't break anything. The new checkpoint would look like this:
{
"version": "1",
"checkpoints": [
{
"fingerprint": {
"dev_inode": [
66305,
1073741955
]
},
"path": "/home/developer/tracked_1.log",
"position": 51413,
"modified": "2025-09-23T06:59:42.422470050Z"
},
{
"fingerprint": {
"dev_inode": [
66305,
1073741956
]
},
"path": "/home/developer/tracked_2.log",
"position": 10737750,
"modified": "2025-09-23T12:59:59.914721134Z"
}
]
}
References
No response
Version
vector 0.50.0 (aarch64-apple-darwin 9053198 2025-09-23 14:18:50.944442940)