Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Dec 9, 2025

Description

Adds optional extended diagnostics to must-gather.sh using a lightweight debug container. Enables complete nvidia-bug-report collection including dmidecode and lspci without adding these tools to the driver container (addressing CVE compliance concerns). This feature is opt-in only; when enabled, users are shown a warning about the external debug container and privileged access requirements before collection begins.

Usage

Standard (existing behavior)

./must-gather.sh

Extended diagnostics

ENABLE_EXTENDED_DIAGNOSTICS=true ./must-gather.sh

Testing

Environment: K8s v1.28+, Tesla T4 GPU, ghcr.io/nvidia/gpu-operator-debug:latest

  1. Ran ./must-gather.sh without flags and verified standard nvidia-bug-report collection works unchanged. Confirmed no extended diagnostics section is added in default mode.
  2. Ran with ENABLE_EXTENDED_DIAGNOSTICS=true and verified the debug container (ghcr.io/nvidia/gpu-operator-debug:latest) attaches successfully via kubectl debug.
  3. dmidecode output: Confirmed BIOS/system information is captured and appended to the bug report.
  4. lspci output: Confirmed verbose PCI device information is captured.
  5. Verified Makefile targets build and push the debug image to ghcr.io/nvidia/gpu-operator-debug:latest (public), and confirmed custom image override works for air-gapped environments.

Sample output verification:

$ zcat nvidia-bug-report_*.log.gz | grep -A3 "EXTENDED DIAGNOSTICS"
*** EXTENDED DIAGNOSTICS (from debug container) ***
$ zcat nvidia-bug-report_*.log.gz | grep "NVIDIA Corporation"
65:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
$ zcat nvidia-bug-report_*.log.gz | grep "BIOS Information" -A2
BIOS Information
    Vendor: American Megatrends Inc.
    Version: 3.3**File size:** Standard 591KB → Extended 596KB (+5KB diagnostics)

…ner for dmidecode/lspci collection

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Comment on lines +369 to +370
collect_debug_diagnostic "${pod_name}" "${pod_nodename}" "dmidecode" "" "${bug_report_file}"
collect_debug_diagnostic "${pod_name}" "${pod_nodename}" "lspci" "-vvv" "${bug_report_file}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- would it be possible to run nvidia-bug-report.sh itself in the debug container?

Copy link
Member Author

@karthikvetrivel karthikvetrivel Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I believe so. However, it'd require adding a lot of NVIDIA utilities/libraries to the debug container. I don't think the debug contained would be "lightweight" anymore.

Do you prefer running the script from the debug container itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants