Skip to content

Conversation

alimaazamat
Copy link
Contributor

@alimaazamat alimaazamat commented Aug 4, 2025

fixes #115

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 4, 2025
@k8s-ci-robot k8s-ci-robot requested review from elezar and nojnhuh August 4, 2025 21:56
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 4, 2025
@alimaazamat alimaazamat changed the title initial commit: all issue description instructions Updated DRA APIs to v1beta2 Aug 4, 2025
@alimaazamat alimaazamat changed the title Updated DRA APIs to v1beta2 WIP: Updated DRA APIs to v1beta2 Aug 15, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2025
@alimaazamat alimaazamat changed the title WIP: Updated DRA APIs to v1beta2 [WIP] Updated DRA APIs to v1beta2 Aug 15, 2025
@alimaazamat
Copy link
Contributor Author

/assign

@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Aug 19, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 19, 2025
@klueska
Copy link
Contributor

klueska commented Aug 19, 2025

Please see the following for how to gt the DRA driver to work with all of v1, v1alpha1, and v1alpha2: #115 (comment)

@alimaazamat alimaazamat changed the title [WIP] Updated DRA APIs to v1beta2 [WIP] Updated DRA APIs to v1 Aug 19, 2025
@alimaazamat alimaazamat changed the title [WIP] Updated DRA APIs to v1 Updated DRA APIs to v1 Aug 19, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025
@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Aug 20, 2025
@lauralorenz
Copy link

I am using the example driver in some kubernetes documentation, and want to queue up a PR that references the tag for the image that will be released with this (in kubernetes/website#51979). Will the new image released be registry.k8s.io/dra-example-driver/dra-example-driver:v0.2.0?

Also for what it's worth, I did run through that tutorial with a local build of the example driver from this branch and it worked great so, I can confirm that it works with k/k 1.34 and the v1 APIs haha 😇

@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 20, 2025

I am using the example driver in some kubernetes documentation, and want to queue up a PR that references the tag for the image that will be released with this (in kubernetes/website#51979). Will the new image released be registry.k8s.io/dra-example-driver/dra-example-driver:v0.2.0?

Yes, I'd like to publish a v0.2.0 image and chart soon after this merges.

Also for what it's worth, I did run through that tutorial with a local build of the example driver from this branch and it worked great so, I can confirm that it works with k/k 1.34 and the v1 APIs haha 😇

Thanks for checking!

@klueska
Copy link
Contributor

klueska commented Aug 20, 2025

I know I saw the changes from NVIDIA/k8s-dra-driver-gpu@920a287 integrated in the PR at some point, but they seem to have been dropped...

@alimaazamat
Copy link
Contributor Author

alimaazamat commented Aug 20, 2025

I know I saw the changes from NVIDIA/k8s-dra-driver-gpu@920a287 integrated in the PR at some point, but they seem to have been dropped...

I dropped them because I was thinking the multiversion compatability would be handled with runtime-config in kind-cluster-config.yaml?
DRA client is used only for controller components for Nvidia like here https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/compute-domain-controller/main.go#L178
But we only use kubelet plugin which only uses coreclient https://github.com/alimaazamat/dra-example-driver/blob/update-to-v1beta2/cmd/dra-example-kubeletplugin/driver.go#L58

@alimaazamat
Copy link
Contributor Author

alimaazamat commented Aug 20, 2025

Also having runtime-config: "resource.k8s.io/all=true" causes e2e tests to fail (regardless of including/excluding changes from NVIDIA/k8s-dra-driver-gpu@920a287)

[control-plane-check] kube-apiserver is not healthy after 4m0.00028186s
A control plane component may have crashed or exited when started by the container runtime.
error: error execution phase wait-control-plane: failed while waiting for the control plane to start: kube-apiserver check failed at https://172.18.0.2:6443/livez: client rate limiter Wait returned an error: context deadline exceeded

When excluding changes from NVIDIA/k8s-dra-driver-gpu@920a287 and with runtime-config: "resource.k8s.io/all=v1beta1" we pass

@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 21, 2025

Also having runtime-config: "resource.k8s.io/all=true" causes e2e tests to fail (regardless of including/excluding changes from NVIDIA/k8s-dra-driver-gpu@920a287)

It looks like all only works as api/all, but not for a particular API group. Here we probably have to enable each one individually. Let's enable both v1beta1 and v1beta2 (leaving v1 enabled by default).
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#:~:text=%2D%2Druntime%2Dconfig%20%3Ccomma%2Dseparated%20%27key%3Dvalue%27%20pairs%3E

@alimaazamat
Copy link
Contributor Author

alimaazamat commented Aug 21, 2025

It looks like all only works as api/all, but not for a particular API group. Here we probably have to enable each one individually. Let's enable both v1beta1 and v1beta2 (leaving v1 enabled by default). https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#:~:text=%2D%2Druntime%2Dconfig%20%3Ccomma%2Dseparated%20%27key%3Dvalue%27%20pairs%3E

With runtime-config: "resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true" I tested this with:
1.32 cluster errors out

[api-check] The API server is not healthy after 4m0.001581692s

Unfortunately, an error has occurred:
       context deadline exceeded

This error is likely caused by:
       - The kubelet is not running
       - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

1.33 cluster (correctly uses v1beta2)
1.34 cluster (correctly uses v1)

To recap:
1.32 allows v1beta1
1.33 allows v1beta1, v1beta2
1.34 allows v1beta1, v1beta2, v1

So I think defining both v1beta1 and v1beta2 in runtime config doesn't work becasue v1beta2 not available for 1.32
Because with just runtime-config: "resource.k8s.io/v1beta1=true 1.32 cluster works

@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 26, 2025

I ran through all the valid combinations of v1beta1/v1beta2/v1 and 1.32/1.33/1.34 and things are working as I expect, so overall LGTM.

@alimaazamat As soon as the k8s.io dependencies and kind image for the final 1.34.0 release are available let's get a clean CI run before squashing and merging this. Feel free to squash earlier too if that's easier.

This was referenced Aug 26, 2025
Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alimaazamat, nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2025
@k8s-ci-robot k8s-ci-robot merged commit b252c27 into kubernetes-sigs:main Aug 27, 2025
6 checks passed
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update DRA APIs to v1
5 participants