-
Notifications
You must be signed in to change notification settings - Fork 204
Improved: clean-nfs-cache performance #1572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
also, do not ignore .terraform.lock.hcl, move it to iac/provider-gcp
- remove some obsolete settings - upgrade golangci-lint to v2 - auto configure extensions using mise - remove gopls from alternate tools
…nto lev-improve-clean-nfs
…nto lev-improve-clean-nfs
| Logger: log, | ||
| mu: sync.RWMutex{}, | ||
| basePath: filepath.Dir(opts.Path), | ||
| cacheRoot: Dir{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory unbounded cache: The PR description mentions "not memory-bounded yet" for the global caching. The cacheRoot Dir struct recursively stores all directories and files in memory. For large filesystems with millions of files, this could consume excessive memory and potentially cause OOM. Consider implementing a memory limit or LRU eviction policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, will be discussed in the review.
|
|
||
| // reinsert the "younger" candidates back into the directory tree | ||
| c.timeit(ctx, "reinserting candidates", func() { | ||
| c.reinsertCandidates(batch[c.DeleteN:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete bounds check: When n reaches BatchN and the batch is processed, the code slices batch[c.DeleteN:] and batch[:c.DeleteN] without verifying that c.DeleteN <= len(batch). If c.DeleteN > c.BatchN due to misconfiguration, this will panic. Add validation in validateOptions() to ensure DeleteN <= BatchN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already
if c.BatchN < c.DeleteN {
return errors.New("files-per-loop must be >= deletions-per-loop")
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| d.state = scanning | ||
| d.mu.Unlock() | ||
|
|
||
| absPath := c.abs(path, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directory state not cleared on scan errors
When scanDir sets d.state = scanning, any subsequent failure while opening or reading the directory (e.g., os.Open errors at line 52 or stat errors later in the function) returns without resetting the state. Later attempts to scan the same directory will always see ErrBusy, so that subtree is never reconsidered and the cleaner can stall or stop short of the target when a single stat/open error happens.
Useful? React with 👍 / 👎.
| // file needs to be unlinked before it's returned | ||
| f := d.Files[len(d.Files)-1] | ||
| d.Files = d.Files[:len(d.Files)-1] | ||
|
|
||
| return &f, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newest files picked for deletion instead of oldest
Directories are sorted by access time ascending (d.sort), but randomSubdirOrOldestFile removes the last element from d.Files and returns it. Because the last element is the most recently accessed file after the sort, the cleaner will preferentially delete the newest files in each directory, leaving the oldest files untouched until all newer ones are gone, which inverts the intended eviction order.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was not exactly true, but I added tests and refactored to make it cleaner
| SeenFileC atomic.Uint64 | ||
|
|
||
| DeletedBytes atomic.Uint64 | ||
| DeletedAges []time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djeebus We are not doing much with this, we could just maintain the running average without expensing the memory. Or, implement a log-scale histogram to keep track of the deleted ages?
(ditto for btime, not currently using it at all)
| // file needs to be unlinked before it's returned | ||
| f := d.Files[len(d.Files)-1] | ||
| d.Files = d.Files[:len(d.Files)-1] | ||
|
|
||
| return &f, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was not exactly true, but I added tests and refactored to make it cleaner
Results from a run on my dev environment from a snapshot-restored identical filesystem:
Old version:
New version with (randomly-picked) concurrency
"max_concurrent_stat": 32, "max_concurrent_scan": 16, "max_concurrent_delete": 4: