Skip to content

Conversation

@cyberchip-wang
Copy link
Contributor

Issue #, if available:

Description of changes:

Add CloudFormation template for running distributed NCCL tests on P6-b200.48xlarge instances using AWS Batch Multi-Node Parallel jobs with Capacity Block support. Template includes 8 EFA interfaces per instance, KMS-encrypted SSH secrets, IMDSv2 enforcement. Security hardened with managed IAM policies and all scan findings addressed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

cyberchip-wang and others added 17 commits October 28, 2025 20:24
- Remove jq dependency and JSON parsing
- Auto-generate EC2 SSH key pair during CloudFormation deployment
- Store private key in Secrets Manager automatically
- Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition
- Use base nccl-tests image directly from public ECR
- All setup logic now in single CloudFormation template
- Remove intermediate variables, use env vars directly

Author: yusongw@
- Use hostname -i for container IP in awsvpc mode
- Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces
- Add BatchJobRole with ecs-tasks trust for container credentials
- Simplify SSH key generation with runtime generation
- Remove debug output and set NCCL_DEBUG=WARN
- Remove ECR repository (using public ECR image)
- Add KMS encryption with key rotation for Secrets Manager
- Convert inline IAM policies to managed policies
- Remove explicit resource names for auto-generation
- Enforce IMDSv2 on Launch Template
- Add suppression for SSH key rotation (not applicable)
…rmance

Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4
- CUDA 12.8.1
- EFA 1.42.0
- OFI (libfabric) 1.16.0
- NCCL 2.27.5
- NCCL tests 2.16.4
Copy link

@t-brown t-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@iankouls-aws iankouls-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR updates README.md and adds a CFN template for P6 distributed training with AWS Batch. Approved.

@iankouls-aws iankouls-aws merged commit bf21fd3 into aws-samples:main Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants