-
Notifications
You must be signed in to change notification settings - Fork 132
Feat: add kmesh-daemon upgrades traffic without disruption #1503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant feature enabling seamless, traffic-undisrupted upgrades for the kmesh-daemon. It achieves this by persisting BPF map specifications, comparing them during daemon restarts, and intelligently migrating map data to new versions. This ensures that ongoing network traffic is not interrupted when the daemon is updated.
Highlights
- Persistent BPF MapSpecs: Implemented logic for storing and reading persistent BPF MapSpecs, which are crucial for maintaining state across kmesh-daemon upgrades.
- MapSpec Comparison Logic: Added retrieval of the current MapSpec and implemented robust comparison logic for BPF Map structures to detect changes between versions.
- Seamless Daemon Upgrades: Introduced an 'Update' mode for kmesh-daemon, allowing BPF programs and maps to be updated without traffic disruption by migrating existing map data.
- Comprehensive Testing: Included new unit tests for MapSpec loading and comparison, as well as an end-to-end test to validate the traffic-undisrupted upgrade process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
BPF maps persist, Upgrades flow, no traffic lost, Code ensures smooth ride.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant feature for non-disruptive upgrades of the kmesh-daemon. The core logic involves persisting BPF map specifications (MapSpecs), comparing them across versions, and migrating maps when their structure changes. The implementation adds new files for handling BPF updates and loading MapSpecs, along with extensive unit and e2e tests.
My review focuses on the new update logic. I've found a few critical issues related to error handling that could lead to panics, and a bug where changes in struct bitfield sizes are not detected, potentially causing data corruption. I've also identified areas with unhandled errors and code duplication that should be addressed to improve robustness and maintainability. The suggestions provided aim to fix these issues and refactor the code for better clarity.
pkg/bpf/restart/bpf_update.go
Outdated
| if map_old.Offset != uint32(map_new.Offset) { | ||
| log.Info(map_old.BitfieldSize, uint32(map_new.BitfieldSize)) | ||
| diff.OffsetChanged = true | ||
| break | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BitfieldSize of a struct member is not being compared. This is a critical bug, as a change in bitfield size alters the struct layout and should trigger a map migration to avoid data corruption. The existing log.Info call should be replaced with a proper comparison.
if map_old.Offset != uint32(map_new.Offset) || map_old.BitfieldSize != uint32(map_new.BitfieldSize) {
diff.OffsetChanged = true
break
}
pkg/bpf/bpf.go
Outdated
| mapspec, _ := restart.LoadCompileTimeSpecs(config) | ||
| restart.SnapshotSpecsByPkg(mapspec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned by restart.LoadCompileTimeSpecs is ignored. This could lead to restart.SnapshotSpecsByPkg being called with a nil mapspec, which might cause a panic. The error should be checked and handled to prevent this.
| mapspec, _ := restart.LoadCompileTimeSpecs(config) | |
| restart.SnapshotSpecsByPkg(mapspec) | |
| mapspec, err := restart.LoadCompileTimeSpecs(config) | |
| if err != nil { | |
| log.Errorf("failed to load compile time specs: %v", err) | |
| return nil | |
| } | |
| restart.SnapshotSpecsByPkg(mapspec) |
pkg/bpf/restart/bpf_update.go
Outdated
| specsByPkg, err := LoadCompileTimeSpecs(config) | ||
| if specsByPkg == nil { | ||
| log.Errorf("load oldSpecsByPkg failed") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned by LoadCompileTimeSpecs is not handled. If an error occurs, specsByPkg could be nil and the function might proceed, potentially causing a panic later. It's crucial to check and handle this error. Also, the log message on line 68 refers to oldSpecsByPkg which seems to be a typo.
| specsByPkg, err := LoadCompileTimeSpecs(config) | |
| if specsByPkg == nil { | |
| log.Errorf("load oldSpecsByPkg failed") | |
| return nil | |
| } | |
| specsByPkg, err := LoadCompileTimeSpecs(config) | |
| if err != nil { | |
| log.Errorf("load compile-time map specs failed: %v", err) | |
| return nil | |
| } |
| func migrateMap( | ||
| oldMapSpec *PersistedMapSpec, | ||
| newMapSpec *ebpf.MapSpec, | ||
| pkgName, mapName, pinPath string, | ||
| ) (*ebpf.Map, error) { | ||
| if oldMapSpec == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| if oldMapSpec.Type != newMapSpec.Type.String() || | ||
| oldMapSpec.KeySize != newMapSpec.KeySize || | ||
| oldMapSpec.ValueSize != newMapSpec.ValueSize || | ||
| oldMapSpec.MaxEntries != newMapSpec.MaxEntries { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
|
|
||
| oldKeyStruct := oldMapSpec.KeyInfo | ||
| newKeyStruct, okNewKey := newMapSpec.Key.(*btf.Struct) | ||
| if okNewKey { | ||
| diff := diffStructInfoAgainstBTF(oldKeyStruct, newKeyStruct, make(map[string]bool)) | ||
| if diff.Added || diff.Removed || diff.TypeChanged || | ||
| diff.OffsetChanged || diff.NestedChanged { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| if newMapSpec.Key == nil { | ||
| if len(oldKeyStruct.Members) != 0 || oldKeyStruct.Name != "" { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| newKeyTypeName := newMapSpec.Key.TypeName() | ||
| if len(oldKeyStruct.Members) != 0 || newKeyTypeName != oldKeyStruct.Name { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| oldValueStruct := oldMapSpec.ValueInfo | ||
| newValueStruct, okNewValue := newMapSpec.Value.(*btf.Struct) | ||
| if okNewValue { | ||
| diff := diffStructInfoAgainstBTF(oldValueStruct, newValueStruct, make(map[string]bool)) | ||
| if diff.Added || diff.Removed || diff.TypeChanged || | ||
| diff.OffsetChanged || diff.NestedChanged { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| log.Info(diff.Added ,diff.Removed,diff.TypeChanged , | ||
| diff.OffsetChanged,diff.NestedChanged) | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| if newMapSpec.Value == nil { | ||
| if len(oldValueStruct.Members) != 0 || oldValueStruct.Name != "" { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| newValTypeName := newMapSpec.Value.TypeName() | ||
| if len(oldValueStruct.Members) != 0 || newValTypeName != oldValueStruct.Name { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| log.Info("pass all check") | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migrateMap function has high complexity and significant code duplication. The logic to decide whether to recreate a map is repeated in multiple branches. This makes the code hard to read and maintain. Consider refactoring this function to extract the map recreation logic into a helper function and simplify the conditional checks.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (0.43%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
... and 2 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned this is a very big change, could introduce regression. Can you please add a feature flag, we can turn on until we have enough tests and documents
|
|
||
| // LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages. | ||
| // Returns a nested map: packageName to mapName to *ebpf.MapSpec | ||
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by CompileTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is to describe that these specs are fixed when the binary is compiled (via bpf2go), as opposed to the specs loaded from the persisted on-disk snapshot. Maybe LoadEmbeddedSpecs will be clearer to reflect its role?
| } else { | ||
| specs["KmeshCgroupSockWorkload"] = coll.Maps | ||
| } | ||
| if coll, err := dualengine.LoadKmeshCgroupSockWorkloadCompat(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only call either LoadKmeshCgroupSockWorkload or LoadKmeshCgroupSockWorkloadCompat
please check
if helper.KernelVersionLowerThan5_13() {
spec, err = bpf2go.LoadKmeshSockopsCompat()
} else {
spec, err = bpf2go.LoadKmeshSockops()
}
|
|
||
| // LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages. | ||
| // Returns a nested map: packageName to mapName to *ebpf.MapSpec | ||
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how we can not miss one map, even after we add a new map later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a go file for gen-bpf_stpecs.go in the/back directory, which can automatically generate LoadCompileTimeSpecs by analyzing the bpf2Go.go file. If there are new maps or progs in the future, it can be easily kept synchronized.
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { | ||
| specs := make(map[string]map[string]*ebpf.MapSpec) | ||
|
|
||
| if config.KernelNativeEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the tag //go:build enhanced, it is for kernel native mode, so we donot need the below else branch
pkg/bpf/bpf.go
Outdated
| return nil | ||
| } | ||
|
|
||
| mapspec, err := restart.LoadCompileTimeSpecs(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| mapspec, err := restart.LoadCompileTimeSpecs(config) | |
| mapSpecs, err := restart.LoadCompileTimeSpecs(config) |
| // It will migrate any BPF maps whose on‑disk pin already exists but whose | ||
| // compiled MapSpec has changed | ||
| func UpdateMapHandler(versionMap *ebpf.Map, kmBpfPath string, config *options.BpfConfig) *ebpf.Map { | ||
| persistedSpecs, err := LoadPersistedSnapshot() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persistedSpecs can be nil, which cause panic later, please check all
pkg/bpf/restart/bpf_update.go
Outdated
| log.Warnf("failed to remove old map pinpath: %v (continuing)", err) | ||
| } | ||
| case hasNew && !hasOld: | ||
| if _, err := migrateMap(nil, newSpec, pkgName, mapName, pinPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer directly use call the map create function
| return versionMap | ||
| } | ||
|
|
||
| func updateVersionInfo(versionMap *ebpf.Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can share with that one storeVersionInfo in pkg/bpf/bpf.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to import package restart in pkg/bpf/bpf.go. If i call storeVersionInfo in pkg/bpf/bpf.go, this will result in recursive import.
test/e2e/upgarde_test.go
Outdated
| } | ||
|
|
||
| // patchKmesh applies a strategic merge patch to the Kmesh DaemonSet and waits for rollout completion. | ||
| func patchKmesh_upgarde(t framework.TestContext, patchData string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo upgrade
| func upgradeKmesh(t framework.TestContext) { | ||
| newImage := os.Getenv("KMESH_UPGRADE_IMAGE") | ||
| if newImage == "" { | ||
| newImage = "localhost:5000/kmesh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnot this same image with old daemonset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attached a new image which is build in run_test.sh set_daemonupgarde_testcase_image() to KMESH_UPGRADE_IMAGE. But E2e test istio 1.23~1.25 report no space left in device. I think the reason is there are two images existing ((localhost:5000/kmesh) and localhost:5000/kmesh/test-upgrade-map-change)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a21a29c to
aee4427
Compare
ba67a19 to
5ffbc87
Compare
|
We update configuration of e2e to fix this errors. Please rebase and then resubmit the code. |
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1409
Special notes for your reviewer:
Does this PR introduce a user-facing change?: