-
Notifications
You must be signed in to change notification settings - Fork 431
Description
Hello,
When installing the vgpu-driver-manager, I've noticed that the openshift driver toolkit container tries to download the lspci package which is a hard requirement for vgpu setup.
Here's the concerned line for RHEL9 based vgpu manager:
https://github.com/NVIDIA/gpu-driver-container/blob/9999eb06e44aa75d88587074adddd42bde61404c/vgpu-manager/rhel9/ocp_dtk_entrypoint#L117
Unfortunately this causes an issue when the container is airgapped / or access the internet through a proxy since you can't run a package installation.
To be able to continue, I had to patch the daemonset manually to add two things:
- The corresponding proxy variables (HTTPS_PROXY...) since the env overrides in the ClusterPolicy object are only applied to the
nvidia-vgpu-manager-ctrcontainer and not theopenshift-driver-toolkit-ctr
gpu-operator/controllers/object_controls.go
Line 3043 in 4011723
| func transformOpenShiftDriverToolkitContainer(obj *appsv1.DaemonSet, config *gpuv1.ClusterPolicySpec, n ClusterPolicyController, operandContainerName string) error { |
- A volume containing our CA certificate bundle and an associated volume mount as some proxies use TLS interception and require additional CAs.
From my point of view, this could be fixed quite easily using one of two ways:
- Dynamically. Change the
transformOpenShiftDriverToolkitContainerfunction to also overload the container environment variables just liketransformVGPUManagerContainerdoes. And then add another field in the ClusterPolicy object to allow arbitrary volume mounts. - Statically. Instead do these changes directly during the image build by transforming the Dockerfiles and Makefile in https://github.com/NVIDIA/gpu-driver-container and adding an option to specify a proxy and CA volume override.
Let me know what you think, good day 😄