Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.

Changes:

  • Add forward-compatibility.yaml workflow (weekly + manual trigger)
  • Create get-latest-images.sh to fetch latest commit-based images
  • Extend e2e-tests.yaml with optional component image inputs
  • Update install-operator.sh to support device-plugin and mig-manager overrides
  • Add Slack notifications for test failures
  • Create variables.yaml as reusable workflow for shared CI variables

Components tested:

  • ghcr.io/nvidia/container-toolkit
  • ghcr.io/nvidia/k8s-device-plugin
  • ghcr.io/nvidia/k8s-mig-manager

We could add other operands later, but I wanted to start with the core ones.

Schedule: Every Monday at 2 AM UTC

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from d55dd82 to 9dab00e Compare November 10, 2025 17:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.

Key changes:

  • Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
  • Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
  • Created get-latest-images.sh script to fetch latest commit-based images from component repositories
  • Extended install-operator.sh to support device-plugin and mig-manager image overrides
  • Added Slack notifications for test failures

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/forward-compatibility.yaml New workflow for weekly forward compatibility testing with Slack notifications
.github/workflows/ci.yaml Refactored to use reusable workflows instead of monolithic job definitions
.github/workflows/variables.yaml New reusable workflow for shared CI variables
.github/workflows/e2e-tests.yaml New reusable e2e test workflow with optional component image inputs
.github/workflows/image-builds.yaml New reusable workflow for GPU operator image builds
.github/workflows/release.yaml New reusable workflow for release processes
.github/workflows/golang-checks.yaml New reusable workflow for Go checks and tests
.github/workflows/config-checks.yaml New reusable workflow for configuration validation
.github/workflows/code-scanning.yaml New reusable workflow for CodeQL scanning
.github/scripts/get-latest-images.sh Script to fetch latest component images from NVIDIA repositories
tests/scripts/install-operator.sh Extended to support device-plugin and mig-manager image overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 3 times, most recently from 84a0c31 to 54d6fc9 Compare November 11, 2025 13:27
@ArangoGutierrez
Copy link
Collaborator Author

PING @rahulait @rajathagasthya @tariq1890

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Use workflow_dispatch inputs if provided, otherwise fetch latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.

Comment on lines 101 to 103
toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }}
device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }}
mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at this comment, please let me know what you think

Comment on lines 22 to 33
toolkit_image:
description: 'Override container-toolkit image'
required: false
type: string
device_plugin_image:
description: 'Override device-plugin image'
required: false
type: string
mig_manager_image:
description: 'Override mig-manager image'
required: false
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 54 to 56
with:
operator_image: ${{ needs.variables.outputs.operator_image_base }}
operator_version: ${{ needs.variables.outputs.operator_version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done!

Comment on lines +63 to +66
with:
commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
operator_version: ${{ needs.variables.outputs.operator_version }}
operator_image_base: ${{ needs.variables.outputs.operator_image_base }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, done!

Comment on lines 80 to 85
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
toolkit_image: ${{ inputs.toolkit_image }}
device_plugin_image: ${{ inputs.device_plugin_image }}
mig_manager_image: ${{ inputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for passing these variables as input to the variables workflow?

Split monolithic ci.yaml (448 lines) into 6 specialized workflows.
Refactored ci.yaml as orchestrator (118 lines) with centralized variable
calculation. All workflows support workflow_call, workflow_dispatch, and
standalone execution. Eliminates duplicate variable calculations.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implement automated weekly tests validating GPU Operator against latest
container-toolkit, device-plugin, and mig-manager images from GHCR.

- Add forward-compatibility.yaml workflow with Slack alerts
- Create get-latest-images.sh for fetching latest commit-based tags
- Extend e2e-tests.yaml and install-operator.sh for component overrides
- Add variables.yaml reusable workflow for shared CI variables

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace individual component image environment variables with a values
override file approach. Add helper scripts for generating values files
from env vars and component images. Ensure clean separation between
Helm values file (-f) and --set flags to avoid conflicts.

This reduces workflow boilerplate by ~58% and makes adding new operands
trivial (no workflow changes needed).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants