Skip to content

Adding paths to the checkpoint file #23840

@denizsurmeli

Description

@denizsurmeli

A note for the community

No response

Use Cases

When someone needs to debug what offset has been read for a specific file by taking a look in the checkpoint file, based on your fingerprint strategy, you either have to resolve the device-inode pair, or compute the checksum for each file and do extra steps to see which file corresponds to which fingerprint. This causes immense trouble during debugging scenarios.

One other use case is to delete/move files based on checkpoint information. You take a look at the checkpoint, move the file to the s3 and go on if the checkpoint offset points to the end of the file.

In these scenarios, whatever the services are written in, whole fingerprinting logic needs to be ported.

We propose that including the file path with the fingerprint in the checkpoint file will be highly beneficial.

Attempted Solutions

No response

Proposal

Even though I have no proper Rust experience, looking at the codebase, we can add a new field to this mapping:

/// A thread-safe handle for reading and writing checkpoints in-memory across
/// multiple threads.
#[derive(Debug, Default)]
pub struct CheckpointsView {
checkpoints: DashMap<FileFingerprint, FilePosition>,
modified_times: DashMap<FileFingerprint, DateTime<Utc>>,
removed_times: DashMap<FileFingerprint, DateTime<Utc>>,
}

Say something like filepaths: DashMap<FileFingerprint, Path>(Lifetime of a Path will cause some problems I believe)

And adding a field path: Path to this struct:

/// A simple JSON-friendly struct of the fingerprint/position pair, since
/// fingerprints as objects cannot be keys in a plain JSON map.
#[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq, Ord, PartialOrd)]
#[serde(rename_all = "snake_case")]
struct Checkpoint {
fingerprint: FileFingerprint,
position: FilePosition,
modified: DateTime<Utc>,
}

Resolving the path in this method:

State::V1 {
checkpoints: self
.checkpoints
.iter()
.map(|entry| {
let fingerprint = entry.key();
let position = entry.value();
Checkpoint {
fingerprint: *fingerprint,
position: *position,
modified: self
.modified_times
.get(fingerprint)
.map(|r| *r.value())
.unwrap_or_else(Utc::now),
}
})
.collect(),
}
}
async fn maybe_upgrade(

And writing it to the checkpoint shouldn't break anything. The new checkpoint would look like this:

{
  "version": "1",
  "checkpoints": [
    {
      "fingerprint": {
        "dev_inode": [
          66305,
          1073741955
        ]
      },
      "path": "/home/developer/tracked_1.log",
      "position": 51413,
      "modified": "2025-09-23T06:59:42.422470050Z"
    },
    {
      "fingerprint": {
        "dev_inode": [
          66305,
          1073741956
        ]
      },
      "path": "/home/developer/tracked_2.log",
      "position": 10737750,
      "modified": "2025-09-23T12:59:59.914721134Z"
    }
  ]
}

References

No response

Version

vector 0.50.0 (aarch64-apple-darwin 9053198 2025-09-23 14:18:50.944442940)

Metadata

Metadata

Assignees

No one assigned

    Labels

    source: fileAnything `file` source relatedtype: featureA value-adding code addition that introduce new functionality.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions