Skip to content

Conversation

@uniqueni
Copy link
Contributor

@uniqueni uniqueni commented Oct 21, 2025

问题描述

当前的 getBestGidIndex() 函数会选择第一个符合条件的 GID(IPv4-mapped + RoCE v2 或 IB 类型)。但在有多个符合条件的 GID 时,可能会选到不属于本机网络接口的 GID。

实际场景:

GID[3]: ::ffff:26.208.8.250, RoCE v2 (不是本机 IP)
GID[7]: ::ffff:26.208.8.252, RoCE v2 (本机 eth1 的 IP)
当前逻辑会选择 GID[3] 并立即返回,但实际应该选择 GID[7]。

###解决方案
我考虑了两种修复方式:
方式一:从 sysfs 读取网络设备信息(已采用)

方式二:匹配本机 IP 地址

  • 提取 GID 中的 IPv4 地址
  • 通过 getifaddrs() 获取本机所有 IP
  • 选择 IP 匹配的 GID
  • 更直接但需要额外系统调用

为什么选择方式一

  • 与内核子系统对齐,使用与标准工具相同的数据源
  • 更通用,不受 IP 配置变化影响
  • 实现简单,改动最小
  • 性能更好,只需一次文件读取

请求 Review
请帮忙看看这个方案是否有问题,或者有更好的解决方式?

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @uniqueni, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在修复一个错误(issue 946),通过改进 RDMA 上下文中最佳 GID 索引的选择机制来提高系统的鲁棒性。核心改动是引入了一种从系统文件读取网络设备信息的方法,以确保所选的 GID 索引与一个活跃且可用的网络接口相关联,从而避免潜在的连接问题。

Highlights

  • 新增辅助函数 readGidNdev: 引入了一个新的静态辅助函数 readGidNdev,用于从 /sys/class/infiniband 文件系统路径读取与给定 GID 索引关联的网络设备名称。
  • 优化 getBestGidIndex 逻辑: 修改了 RdmaContext::getBestGidIndex 函数,使其在选择最佳 GID 索引时,优先选择那些具有实际网络设备关联的 GID。如果找到符合类型要求且关联了网络设备的 GID,则立即返回;否则,将符合类型要求但无网络设备关联的 GID 作为备选,并继续查找。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

你好,这个 PR 通过读取 sysfs 来检查 GID 是否关联了网络设备,从而改进了 getBestGidIndex 的逻辑,优先选择有网络接口的 GID。这是一个合理的修复方向。我发现一个小问题,即在没有 GID 关联网络设备时的回退逻辑发生了变化,可能会从选择“第一个”符合条件的 GID 变为“最后一个”。我在代码中留下了具体的建议,希望能帮助你完善这个修复。

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in getBestGidIndex() where the function would select the first matching GID (IPv4-mapped + RoCE v2 or IB type) without verifying if it belongs to a local network interface. The fix ensures that GIDs associated with actual network devices are preferred.

Key changes:

  • Added readGidNdev() helper function to read network device information from sysfs
  • Modified getBestGidIndex() to prioritize GIDs with associated network devices
  • Implemented fallback logic to retain original behavior when no network device is found

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will merge after fixing building problems

@uniqueni
Copy link
Contributor Author

Looks good, will merge after fixing building problems

thank you! It's fixed

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uniqueni please fix format, thank you.

@ShangmingCai ShangmingCai merged commit 5116017 into kvcache-ai:main Oct 23, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants