Skip to content
This repository was archived by the owner on Dec 9, 2025. It is now read-only.

Conversation

@vsoch
Copy link

@vsoch vsoch commented Jul 5, 2025

Problem: we only see rdma devices associated with a netlink.
Solution: do a discovery loop that also lists all rdma devices.

This pull request will add support to see "raw" rdma devices, meaning those that aren't exposed by netlink. I have confirmation that this works to see our IB devices that don't have IPOIB:

image

Here is the entire discovery
[sochat1@corona188:usernetes]$ kubectl get resourceslices -o yaml
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta2
  kind: ResourceSlice
  metadata:
    creationTimestamp: "2025-07-05T23:05:52Z"
    generateName: u7s-corona188-dra.net-
    generation: 2
    name: u7s-corona188-dra.net-jmbn7
    ownerReferences:
    - apiVersion: v1
      controller: true
      kind: Node
      name: u7s-corona188
      uid: 0af8c760-5b17-468d-9fb0-da9c3df1c490
    resourceVersion: "501"
    uid: 9b5e048e-7636-49b5-aafc-3a8e24976bf5
  spec:
    devices:
    - attributes:
        dra.net/alias:
          string: ""
        dra.net/ebpf:
          bool: false
        dra.net/encapsulation:
          string: ether
        dra.net/ifName:
          string: eno1
        dra.net/ipv4:
          string: 192.168.64.188
        dra.net/mac:
          string: ac:1f:6b:48:8d:9c
        dra.net/mtu:
          int: 1500
        dra.net/numaNode:
          int: 1
        dra.net/pciVendor:
          string: Intel Corporation
        dra.net/rdma:
          bool: false
        dra.net/sriov:
          bool: true
        dra.net/sriovVfs:
          int: 0
        dra.net/state:
          string: up
        dra.net/type:
          string: device
        dra.net/virtual:
          bool: false
        resource.kubernetes.io/pcieRoot:
          string: pci0000:e0
      name: eno1
    - attributes:
        dra.net/alias:
          string: ""
        dra.net/ebpf:
          bool: false
        dra.net/encapsulation:
          string: ether
        dra.net/ifName:
          string: eno2
        dra.net/numaNode:
          int: 1
        dra.net/pciVendor:
          string: Intel Corporation
        dra.net/rdma:
          bool: false
        dra.net/sriov:
          bool: true
        dra.net/sriovVfs:
          int: 0
        dra.net/state:
          string: down
        dra.net/type:
          string: device
        dra.net/virtual:
          bool: false
        resource.kubernetes.io/pcieRoot:
          string: pci0000:e0
      name: eno2
    - attributes:
        dra.net/alias:
          string: ""
        dra.net/ebpf:
          bool: false
        dra.net/encapsulation:
          string: ether
        dra.net/ifName:
          string: enp41s0f3u1u2c2
        dra.net/rdma:
          bool: false
        dra.net/sriov:
          bool: false
        dra.net/state:
          string: unknown
        dra.net/type:
          string: device
        dra.net/virtual:
          bool: false
        resource.kubernetes.io/pcieRoot:
          string: pci0000:20
      name: enp41s0f3u1u2c2
    - attributes:
        dra.net/ifName:
          string: mlx5_0
        dra.net/numaNode:
          int: 1
        dra.net/pciVendor:
          string: Mellanox Technologies
        dra.net/rdma:
          bool: true
        dra.net/type:
          string: infiniband
        dra.net/virtual:
          bool: false
        resource.kubernetes.io/pcieRoot:
          string: pci0000:c0
      name: normalized-nvwhqnk7ga
    driver: dra.net
    nodeName: u7s-corona188
    pool:
      generation: 1
      name: u7s-corona188
      resourceSliceCount: 1
kind: List
metadata:
  resourceVersion: ""

I'm new to working in this space, so there are a few points I think we should discuss.

Design of Device Discovery

I wanted to break the Run function into separate functions for discovering different device types, which eventually can be useful for testing or using in other contexts. The challenge I ran into is that there doesn't seem to be a strong way to derive uniqueness - we would not want to discover an rdma device via netlink and then again as raw rdma and expose it twice. I thought it would be the PCI address, but then I saw the logic (or edge cases) for missing addresses. This is a current flaw in the design that I don't like - we need a definitive way to get a unique ID we can compare between the two means to discover.

This is likely incorrect, but I made an assumption if a pci address is missing it might be missing for both, so I allow netlink devices that are missing an address, but not rdma. Then we only allow a new rdma "raw" device if we didn't see the pci name yet. If this doesn't work (or there is a more solid way to derive uniqueness) let's talk about other strategies for this check. There is also the addition of the normalized name (I didn't normalize when I first deployed and got the RFC error). We would want to make sure that we are using the correct name when we check if it already was found.

Other things I want to check for my changes are:

  • I didn't mess up the notification channel logic
  • Cases of returning an error (that bubbles up, and we don't continue) vs. just logging it. I tried to maintain the current logic but a second pair of eyes would be helpful.
  • The correct logging level (klog) to use for new content

Question(s) I Have

  1. Why are there so many different names? I think this is a source of error. There are the labels, the attribute name, plus the many discovered names.
  2. Should we care about RDMA devices be on the same NUMA nodes that we have compute running? I realize that the device is probably just associated with one, and I also realize our typical apps will span, e.g., ids 0 and 1. I never thought about what that meant practically.
  3. Should the type be "ib" or "infiniband" or something else? I found infiniband here but I've seen "ib" in the wild.
  4. Was there a reason for doing iface.Attrs().Name instead of getting the attributes once and then referencing? If that result can change, my inclination is to get the metadata at one point in time to represent a common state.

Let's discuss some of the above - I just tested this with a control plane to see the devices. I'm going to test deployment across worker nodes, and @aojea I'm wondering how this would help us in usernetes where we are using flannel and then bypassing slirp4netns with Infiniband and UCX. If all the nodes have these devices and they are found by UCX does dranet add anything? My sense is that it's going to help with device discovery and scheduling devices to pods but it wouldn't help in our use case? I wanted to check, because it would be nice to better optimize the usernetes network setup. Even with Infiniband bypass, we currently have still a little bit overhead added and I haven't figured out what it is yet.

@michaelasp
Copy link
Collaborator

Thanks for the PR, this looks awesome! I'll let @aojea answer some more of the design questions above but I approved the workflow and will review 😄

Copy link
Collaborator

@michaelasp michaelasp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review, thanks for the work done here!

"dra.net/ifName": {StringValue: &rdmaName},
// https://github.com/vishvananda/netlink/blob/master/nl/nl_linux.go#L143
// This could also be ib, but "infiniband" is more clear
"dra.net/type": {StringValue: ptr.To("infiniband")},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea to comment on what the type of this device should be. Since they are dra.net specific it should be less important than standard attributes.

Problem: we only see rdma devices associated with a netlink.
Solution: do a discovery loop that also lists all rdma devices.
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the add/rdma-raw-devices branch from 244fa79 to da16351 Compare July 8, 2025 05:04
@vsoch
Copy link
Author

vsoch commented Jul 8, 2025

Thank you for the speedy review! I've fixed the small issues above, and we can discuss some of the design details before the next round.

@michaelasp @aojea to set your expectations - I'm on a road trip this week, and will be on the road 12-13 hours and likely only briefly at my computer in the late evening. I'm excited to work on this (and discuss details above) so likely I will check my email in case there is discussion. Looking forward to it!

@aojea
Copy link
Collaborator

aojea commented Jul 8, 2025

Thanks @vsoch , previous logic listed both rdma and netdevices but I changed that because devices were duplicated, but now I see there are cases they are not.
This seems to fix the discovery logic in an elegant way, but I wonder how this is translated to the attachment logic, since there is a lot of assumptions in the code everything depends on the nic interfaces.

@michaelasp
Copy link
Collaborator

Thanks @vsoch , previous logic listed both rdma and netdevices but I changed that because devices were duplicated, but now I see there are cases they are not. This seems to fix the discovery logic in an elegant way, but I wonder how this is translated to the attachment logic, since there is a lot of assumptions in the code everything depends on the nic interfaces.

+1 on this, also from your comment in the initial PR. The major reason point is to remove the hassle of an end user from having to manually allocate these devices and instead be able to plug and play different device types with the DRA api.

I'd like to see an example of this at work, i.e spin up a pod with a resource request and see that the RDMA device gets properly mounted to the pod. Since a lot of that logic depends on NIC devices there may need to be some work done there for proper allocation/release of the device from the host to the pod.

@vsoch
Copy link
Author

vsoch commented Jul 12, 2025

I'd like to see an example of this at work, i.e spin up a pod with a resource request and see that the RDMA device gets properly mounted to the pod. Since a lot of that logic depends on NIC devices there may need to be some work done there for proper allocation/release of the device from the host to the pod.

I arrived after my trip! I likely will still be offline for a bit, but I wanted to follow up on the above. What specifically would you like me to do here?

@michaelasp
Copy link
Collaborator

I arrived after my trip! I likely will still be offline for a bit, but I wanted to follow up on the above. What specifically would you like me to do here?

Welcome back! Something similar to this IMO but with the RDMA raw devices, https://dranet.dev/docs/quick-start/#how-to-use-it just to validate that it mounts the device to pod properly and moves it back when the pod is deleted. We don't have these devices, @aojea correct me if I'm wrong, so it would be good to have evidence this works as expected. Creation of a DeviceClass ResourceClaim and Pod should be enough to validate the functionality.

@vsoch
Copy link
Author

vsoch commented Jul 14, 2025

I already did that in my testing environment, Usernetes (see "Here is the entire discovery") are you looking for something to run in CI, documentation, or something else?

@michaelasp
Copy link
Collaborator

I already did that in my testing environment, Usernetes (see "Here is the entire discovery") are you looking for something to run in CI, documentation, or something else?

That is the discovery part, after discovery and publishing the resource slice we should be able to claim that resource and pass it to the pod. Without that portion we are advertising devices but without the ability of the driver to actually handle them right? What I'd like to see is the a resource claim made and referenced by a pod which then has the device mounted to it.

If we just do the discovery portion, my concern is that we will be showing these devices as valid to claim which they might not be. This would leave any pod that tries to claim the resource to get stuck on creating and probably errors to pop up in the dranet pods. I'd like to see what the current behavior of creating the pods are so we can see if there's any new logic that needs to be written for raw devices to pass them off properly.

@vsoch
Copy link
Author

vsoch commented Jul 14, 2025

Ah gotcha - understood 👍

@michaelasp
Copy link
Collaborator

Ah gotcha - understood 👍

It may just run out of the box but since netlink doesn't expose it I have my hesitations.

At minimum we should add a message about this being unimplemented and pass some readable error to the user from the DraNet logs. I wouldn't mind running this myself but I don't have a testbed with devices like those myself to run it 🤣

@vsoch
Copy link
Author

vsoch commented Jul 16, 2025

@michaelasp I have an equivalent example started - but I have a catch22. The environment I'm working in is user space kubernetes. We can't see any devices without a privileged pod. When we add that, we see everything. This means the selector is irrelevant. I need an environment with production Kubernetes that I can run without rootless. I had Google Cloud credits but they expired, and I'd need to pay out of pocket. @aojea can you help? It shouldn't be too many to test this out. Once I have that, I can add the full analogous example to the docs.

@BenTheElder
Copy link
Collaborator

Can we test it against the CI?

I don't think googlers have a good way to hand out credits, at a very high level it exists but it's not something done easily.

There is however a permanent free tier which can run a very small cluster.
https://cloud.google.com/free/docs/free-cloud-features#free-tier

@vsoch
Copy link
Author

vsoch commented Jul 28, 2025

@BenTheElder I have credits now!

Does Google Cloud have any Inifiniband devices that would work for this case? I have typically just used Tier 1 networking.

@michaelasp
Copy link
Collaborator

@BenTheElder I have credits now!

Does Google Cloud have any Inifiniband devices that would work for this case? I have typically just used Tier 1 networking.

Doesn't seem like it, from at least one source

Google Cloud does not currently support InfiniBand. Instead, Google Cloud relies on its high-performance network stack, including custom Titanium networking infrastructure and GPU-optimized instances like the A3 VMs, which use NVIDIA's H100 GPUs with NVLink and NVSwitch. Google Cloud emphasizes its proprietary networking solutions for AI and HPC workloads rather than InfiniBand.

I've only used Mellanox RDMA links from my experience.

@vsoch
Copy link
Author

vsoch commented Jul 29, 2025

I have an idea - I think this could be scoped into some of our Azure experiments and development (with Infiniband). I need to check our budget and chat with my team on Wednesday update next Wednesday August 6th. I will report back!

@aojea
Copy link
Collaborator

aojea commented Jul 29, 2025

@gauravkghildiyal has to modify also the discovery logic to use the physical devices to make it work with autoscaler #178 so let us know if we better schedule a sync meeting to coordinate better the efforts

@vsoch
Copy link
Author

vsoch commented Jul 29, 2025

@aojea that is good thinking. I am not sure I can give direct access to the cluster, but if we did a hackathon I could drive whatever orchestration that @gauravkghildiyal needs. I am in Pacific time now - could you give me the window of hours / days that would work for the meeting (August 6th or after)? I will want to test the AKS setup first and spec our the costs.

@gauravkghildiyal
Copy link
Member

@vsoch That sounds good to me. Please continue your testing and we can meet once you've had an opportunity to verify things. In the mean time, I'll also find some time to prepare some changes and think things through.

@vsoch
Copy link
Author

vsoch commented Jul 30, 2025

Some quick updates. Azure doesn't seem to expose any ability to customize features - their default version is 1.32.5 and their newest is 1.33.1, so I don't see a way to enable DynamicResourceAllocation. We can only get to beta (disabled still) in a 2 node cluster HBv3 120rs.

image

From a quick read, they don't allow customization, but if anyone knows of otherwise, please speak up. I'm reading that DRA is expected to be provided with 1.34 (end of August) so we could at worst wait for that. I could also deploy rootful Usernetes across VM nodes, but the last time I tried Usernetes with rootful it didn't work. That said, it might be worth another shot! I can try again later this week.

@aojea
Copy link
Collaborator

aojea commented Aug 1, 2025

Some quick updates. Azure doesn't seem to expose any ability to customize features - their default version is 1.32.5 and their newest is 1.33.1, so I don't see a way to enable DynamicResourceAllocation. We can only get to beta (disabled still) in a 2 node cluster HBv3 120rs.

maybe @ritazh or @keithmattix can help with this to better integrate with azure?

@keithmattix
Copy link

AKS itself doesn't support DRA yet since beta APIs aren't on by default. You could use something like Cluster API Provider Azure to build your own Kubernetes cluster and enable DRA on that

@vsoch
Copy link
Author

vsoch commented Aug 1, 2025

@keithmattix will it be turned on for the 1.34 release at the end of August?

@keithmattix
Copy link

Indeed it will!

@vsoch
Copy link
Author

vsoch commented Aug 3, 2025

Thanks @keithmattix. I think the easiest thing would be to wait for that release to continue testing.

@aojea aojea changed the title feat: discovery of rdma raw devices feat: discovery of native rdma devices Aug 16, 2025
@aojea
Copy link
Collaborator

aojea commented Aug 16, 2025

I think the key differentiator here is the link_layer attribute

Native InfiniBand: For a traditional InfiniBand device, the link_layer attribute will be set to "InfiniBand".

RoCE: For a RoCE-capable Ethernet adapter, the link_layer will be set to "Ethernet".

We are testing with RoCE only but we want to support native too

Copy link

@stevapple stevapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just some typos found in the comments)


devices := []resourceapi.Device{}
ifaces, err := nlHandle.LinkList()
// Device lookup map is used to prevent duplicated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Device lookup map is used to prevent duplicated.
// Device lookup map is used to prevent duplication.

device.Name = names.SetDeviceName(ifName)
// expose the real interface name as an attribute in case it is normalized.

// expose the real interface name as an attribute in case it is normalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// expose the real interface name as an attribute in case it is normalized.
// Expose the real interface name as an attribute in case it is normalized.

}
// Resources are published periodically or if there is a netlink notification
// indicating a new interfaces was added or changed
// indicating a new interfaces was added or changed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// indicating a new interfaces was added or changed.
// indicating a new interface was added or changed.


if ignoredInterfaceNames.Has(iface.Attrs().Name) {
klog.V(4).Infof("iface %s is in the list of ignored interfaces", iface.Attrs().Name)
// Do not add un-named netowrk device interfaces.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Do not add un-named netowrk device interfaces.
// Do not add unnamed network device interfaces.

// Wait for the next event or timeout.
select {
// trigger a reconcile
// Trigger a reconcile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Trigger a reconcile.
// Trigger a reconciliation.

Note: reconcile may be acceptable since it's an established term.


### Development

To build your own image for testing. Here is an example with a custom registry `ghcr.io/converged-computing`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only necessary when developing on "real" clusters, right? We might clarify that.

A lot of this is developed locally, where the image name is ~arbitrary, because we don't push it to a remote host anyhow.

@aojea
Copy link
Collaborator

aojea commented Aug 26, 2025

once we get this we have all the necessary to solve the discovery problem of infiniband devices #194 , what is missing is the attachment logic, IIUIC those are char devices and should be added in

func (np *NetworkDriver) CreateContainer(_ context.Context, pod *api.PodSandbox, ctr *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) {
klog.V(2).Infof("CreateContainer Pod %s/%s UID %s Container %s", pod.Namespace, pod.Name, pod.Uid, ctr.Name)
podConfig, ok := np.podConfigStore.GetPodConfigs(types.UID(pod.GetUid()))
if !ok {
return nil, nil, nil
}
// Containers only cares about the RDMA char devices
devPaths := set.Set[string]{}
adjust := &api.ContainerAdjustment{}
for _, config := range podConfig {
for _, dev := range config.RDMADevice.DevChars {
// do not insert the same path multiple times
if devPaths.Has(dev.Path) {
continue
}
devPaths.Insert(dev.Path)
// TODO check the file permissions and uid and gid fields
adjust.AddDevice(&api.LinuxDevice{
Path: dev.Path,
Type: dev.Type,
Major: dev.Major,
Minor: dev.Minor,
})
}
}
return adjust, nil, nil
,

Those are discovered and populated in

// Get RDMA configuration: link and char devices
if rdmaDev, _ := rdmamap.GetRdmaDeviceForNetdevice(ifName); rdmaDev != "" {
klog.V(2).Infof("RunPodSandbox processing RDMA device: %s", rdmaDev)
podCfg.RDMADevice.LinkDev = rdmaDev
// Obtain the char devices associated to the rdma device
charDevices.Insert(rdmaCmPath)
charDevices.Insert(rdmamap.GetRdmaCharDevices(rdmaDev)...)
for _, devpath := range charDevices.UnsortedList() {
dev, err := GetDeviceInfo(devpath)
if err != nil {
klog.Infof("fail to get device info for %s : %v", devpath, err)
} else {
podCfg.RDMADevice.DevChars = append(podCfg.RDMADevice.DevChars, dev)
}
}
}

@gauravkghildiyal
Copy link
Member

@vsoch -- As Antonio pointed out in the previous comment, I think we've merged the discovery part for the devices. Now what requires further investigation on actual hardware is the correct way to assign them to the pods. I don't have access to the specific hardware (non-RoCE)

@vsoch
Copy link
Author

vsoch commented Aug 30, 2025

Azure intended to release Kubernetes 1.34 on the 27th, so I can take a look this weekend. I'll investigate in the context of those code snippets and report back! If you want to prepare for me some basic checks or tests that you'd want to see, just post here and I'll make sure to do them.

@vsoch
Copy link
Author

vsoch commented Aug 31, 2025

@aojea @gauravkghildiyal Azure still has not released Kubernetes 1.34 - it' s not an option for me when I try to deploy a cluster. Not sure if their docs are wrong or misleading. The highest version I can select is 1.33.2.

image

@keithmattix
Copy link

I will check in on this on Tuesday (after US holiday) and see what I can find out

@vsoch
Copy link
Author

vsoch commented Aug 31, 2025

Thank you @keithmattix much appreciated. Enjoy the holiday!

@keithmattix
Copy link

Current ETA is a couple of weeks!

@vsoch
Copy link
Author

vsoch commented Sep 2, 2025

Thanks for checking @keithmattix - do you think you'd be able to post here when it's out?

@keithmattix
Copy link

Absolutely - will do 👍🏾

@BenTheElder
Copy link
Collaborator

Looks like early november:
https://github.com/Azure/AKS/releases/tag/2025-10-12

AKS Version 1.34 Preview is being rolled out to multiple regions and is expected to complete by early November.

@keithmattix
Copy link

Looks like early november: https://github.com/Azure/AKS/releases/tag/2025-10-12

AKS Version 1.34 Preview is being rolled out to multiple regions and is expected to complete by early November.

Yep was just coming here to say this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants