-
Notifications
You must be signed in to change notification settings - Fork 24
feat: discovery of native rdma devices #151
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the PR, this looks awesome! I'll let @aojea answer some more of the design questions above but I approved the workflow and will review 😄 |
michaelasp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review, thanks for the work done here!
pkg/inventory/db.go
Outdated
| "dra.net/ifName": {StringValue: &rdmaName}, | ||
| // https://github.com/vishvananda/netlink/blob/master/nl/nl_linux.go#L143 | ||
| // This could also be ib, but "infiniband" is more clear | ||
| "dra.net/type": {StringValue: ptr.To("infiniband")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea to comment on what the type of this device should be. Since they are dra.net specific it should be less important than standard attributes.
Problem: we only see rdma devices associated with a netlink. Solution: do a discovery loop that also lists all rdma devices. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
244fa79 to
da16351
Compare
|
Thank you for the speedy review! I've fixed the small issues above, and we can discuss some of the design details before the next round. @michaelasp @aojea to set your expectations - I'm on a road trip this week, and will be on the road 12-13 hours and likely only briefly at my computer in the late evening. I'm excited to work on this (and discuss details above) so likely I will check my email in case there is discussion. Looking forward to it! |
|
Thanks @vsoch , previous logic listed both rdma and netdevices but I changed that because devices were duplicated, but now I see there are cases they are not. |
+1 on this, also from your comment in the initial PR. The major reason point is to remove the hassle of an end user from having to manually allocate these devices and instead be able to plug and play different device types with the DRA api. I'd like to see an example of this at work, i.e spin up a pod with a resource request and see that the RDMA device gets properly mounted to the pod. Since a lot of that logic depends on NIC devices there may need to be some work done there for proper allocation/release of the device from the host to the pod. |
I arrived after my trip! I likely will still be offline for a bit, but I wanted to follow up on the above. What specifically would you like me to do here? |
Welcome back! Something similar to this IMO but with the RDMA raw devices, https://dranet.dev/docs/quick-start/#how-to-use-it just to validate that it mounts the device to pod properly and moves it back when the pod is deleted. We don't have these devices, @aojea correct me if I'm wrong, so it would be good to have evidence this works as expected. Creation of a |
|
I already did that in my testing environment, Usernetes (see "Here is the entire discovery") are you looking for something to run in CI, documentation, or something else? |
That is the discovery part, after discovery and publishing the resource slice we should be able to claim that resource and pass it to the pod. Without that portion we are advertising devices but without the ability of the driver to actually handle them right? What I'd like to see is the a resource claim made and referenced by a pod which then has the device mounted to it. If we just do the discovery portion, my concern is that we will be showing these devices as valid to claim which they might not be. This would leave any pod that tries to claim the resource to get stuck on creating and probably errors to pop up in the dranet pods. I'd like to see what the current behavior of creating the pods are so we can see if there's any new logic that needs to be written for raw devices to pass them off properly. |
|
Ah gotcha - understood 👍 |
It may just run out of the box but since netlink doesn't expose it I have my hesitations. At minimum we should add a message about this being unimplemented and pass some readable error to the user from the DraNet logs. I wouldn't mind running this myself but I don't have a testbed with devices like those myself to run it 🤣 |
|
@michaelasp I have an equivalent example started - but I have a catch22. The environment I'm working in is user space kubernetes. We can't see any devices without a privileged pod. When we add that, we see everything. This means the selector is irrelevant. I need an environment with production Kubernetes that I can run without rootless. I had Google Cloud credits but they expired, and I'd need to pay out of pocket. @aojea can you help? It shouldn't be too many to test this out. Once I have that, I can add the full analogous example to the docs. |
|
Can we test it against the CI? I don't think googlers have a good way to hand out credits, at a very high level it exists but it's not something done easily. There is however a permanent free tier which can run a very small cluster. |
|
@BenTheElder I have credits now! Does Google Cloud have any Inifiniband devices that would work for this case? I have typically just used Tier 1 networking. |
Doesn't seem like it, from at least one source I've only used Mellanox RDMA links from my experience. |
|
I have an idea - I think this could be scoped into some of our Azure experiments and development (with Infiniband). I need to check our budget and chat with my team on Wednesday update next Wednesday August 6th. I will report back! |
|
@gauravkghildiyal has to modify also the discovery logic to use the physical devices to make it work with autoscaler #178 so let us know if we better schedule a sync meeting to coordinate better the efforts |
|
@aojea that is good thinking. I am not sure I can give direct access to the cluster, but if we did a hackathon I could drive whatever orchestration that @gauravkghildiyal needs. I am in Pacific time now - could you give me the window of hours / days that would work for the meeting (August 6th or after)? I will want to test the AKS setup first and spec our the costs. |
|
@vsoch That sounds good to me. Please continue your testing and we can meet once you've had an opportunity to verify things. In the mean time, I'll also find some time to prepare some changes and think things through. |
|
Some quick updates. Azure doesn't seem to expose any ability to customize features - their default version is 1.32.5 and their newest is 1.33.1, so I don't see a way to enable DynamicResourceAllocation. We can only get to beta (disabled still) in a 2 node cluster HBv3 120rs.
From a quick read, they don't allow customization, but if anyone knows of otherwise, please speak up. I'm reading that DRA is expected to be provided with 1.34 (end of August) so we could at worst wait for that. I could also deploy rootful Usernetes across VM nodes, but the last time I tried Usernetes with rootful it didn't work. That said, it might be worth another shot! I can try again later this week. |
maybe @ritazh or @keithmattix can help with this to better integrate with azure? |
|
AKS itself doesn't support DRA yet since beta APIs aren't on by default. You could use something like Cluster API Provider Azure to build your own Kubernetes cluster and enable DRA on that |
|
@keithmattix will it be turned on for the 1.34 release at the end of August? |
|
Indeed it will! |
|
Thanks @keithmattix. I think the easiest thing would be to wait for that release to continue testing. |
|
I think the key differentiator here is the Native InfiniBand: For a traditional InfiniBand device, the link_layer attribute will be set to "InfiniBand". RoCE: For a RoCE-capable Ethernet adapter, the link_layer will be set to "Ethernet". We are testing with RoCE only but we want to support native too |
stevapple
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just some typos found in the comments)
|
|
||
| devices := []resourceapi.Device{} | ||
| ifaces, err := nlHandle.LinkList() | ||
| // Device lookup map is used to prevent duplicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Device lookup map is used to prevent duplicated. | |
| // Device lookup map is used to prevent duplication. |
| device.Name = names.SetDeviceName(ifName) | ||
| // expose the real interface name as an attribute in case it is normalized. | ||
|
|
||
| // expose the real interface name as an attribute in case it is normalized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // expose the real interface name as an attribute in case it is normalized. | |
| // Expose the real interface name as an attribute in case it is normalized. |
| } | ||
| // Resources are published periodically or if there is a netlink notification | ||
| // indicating a new interfaces was added or changed | ||
| // indicating a new interfaces was added or changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // indicating a new interfaces was added or changed. | |
| // indicating a new interface was added or changed. |
|
|
||
| if ignoredInterfaceNames.Has(iface.Attrs().Name) { | ||
| klog.V(4).Infof("iface %s is in the list of ignored interfaces", iface.Attrs().Name) | ||
| // Do not add un-named netowrk device interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Do not add un-named netowrk device interfaces. | |
| // Do not add unnamed network device interfaces. |
| // Wait for the next event or timeout. | ||
| select { | ||
| // trigger a reconcile | ||
| // Trigger a reconcile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Trigger a reconcile. | |
| // Trigger a reconciliation. |
Note: reconcile may be acceptable since it's an established term.
|
|
||
| ### Development | ||
|
|
||
| To build your own image for testing. Here is an example with a custom registry `ghcr.io/converged-computing`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only necessary when developing on "real" clusters, right? We might clarify that.
A lot of this is developed locally, where the image name is ~arbitrary, because we don't push it to a remote host anyhow.
|
once we get this we have all the necessary to solve the discovery problem of infiniband devices #194 , what is missing is the attachment logic, IIUIC those are char devices and should be added in dranet/pkg/driver/nri_hooks.go Lines 63 to 88 in 3b38ea6
Those are discovered and populated in dranet/pkg/driver/dra_hooks.go Lines 273 to 288 in 3b38ea6
|
|
@vsoch -- As Antonio pointed out in the previous comment, I think we've merged the discovery part for the devices. Now what requires further investigation on actual hardware is the correct way to assign them to the pods. I don't have access to the specific hardware (non-RoCE) |
|
Azure intended to release Kubernetes 1.34 on the 27th, so I can take a look this weekend. I'll investigate in the context of those code snippets and report back! If you want to prepare for me some basic checks or tests that you'd want to see, just post here and I'll make sure to do them. |
|
@aojea @gauravkghildiyal Azure still has not released Kubernetes 1.34 - it' s not an option for me when I try to deploy a cluster. Not sure if their docs are wrong or misleading. The highest version I can select is 1.33.2.
|
|
I will check in on this on Tuesday (after US holiday) and see what I can find out |
|
Thank you @keithmattix much appreciated. Enjoy the holiday! |
|
Current ETA is a couple of weeks! |
|
Thanks for checking @keithmattix - do you think you'd be able to post here when it's out? |
|
Absolutely - will do 👍🏾 |
|
Looks like early november:
|
Yep was just coming here to say this |


Problem: we only see rdma devices associated with a netlink.
Solution: do a discovery loop that also lists all rdma devices.
This pull request will add support to see "raw" rdma devices, meaning those that aren't exposed by netlink. I have confirmation that this works to see our IB devices that don't have IPOIB:
Here is the entire discovery
I'm new to working in this space, so there are a few points I think we should discuss.
Design of Device Discovery
I wanted to break the
Runfunction into separate functions for discovering different device types, which eventually can be useful for testing or using in other contexts. The challenge I ran into is that there doesn't seem to be a strong way to derive uniqueness - we would not want to discover an rdma device via netlink and then again as raw rdma and expose it twice. I thought it would be the PCI address, but then I saw the logic (or edge cases) for missing addresses. This is a current flaw in the design that I don't like - we need a definitive way to get a unique ID we can compare between the two means to discover.This is likely incorrect, but I made an assumption if a pci address is missing it might be missing for both, so I allow netlink devices that are missing an address, but not rdma. Then we only allow a new rdma "raw" device if we didn't see the pci name yet. If this doesn't work (or there is a more solid way to derive uniqueness) let's talk about other strategies for this check. There is also the addition of the normalized name (I didn't normalize when I first deployed and got the RFC error). We would want to make sure that we are using the correct name when we check if it already was found.
Other things I want to check for my changes are:
Question(s) I Have
iface.Attrs().Nameinstead of getting the attributes once and then referencing? If that result can change, my inclination is to get the metadata at one point in time to represent a common state.Let's discuss some of the above - I just tested this with a control plane to see the devices. I'm going to test deployment across worker nodes, and @aojea I'm wondering how this would help us in usernetes where we are using flannel and then bypassing slirp4netns with Infiniband and UCX. If all the nodes have these devices and they are found by UCX does dranet add anything? My sense is that it's going to help with device discovery and scheduling devices to pods but it wouldn't help in our use case? I wanted to check, because it would be nice to better optimize the usernetes network setup. Even with Infiniband bypass, we currently have still a little bit overhead added and I haven't figured out what it is yet.