generated from amazon-archives/__template_MIT-0
-
Notifications
You must be signed in to change notification settings - Fork 152
AWS Batch P6-B200 Distributed Training with Multi-Node Parallel Support #893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
iankouls-aws
merged 17 commits into
aws-samples:main
from
cyberchip-wang:P6-b200-AWS-Batch-MNP
Nov 21, 2025
Merged
AWS Batch P6-B200 Distributed Training with Multi-Node Parallel Support #893
iankouls-aws
merged 17 commits into
aws-samples:main
from
cyberchip-wang:P6-b200-AWS-Batch-MNP
Nov 21, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…trap, update CFN to include secret and ECR
- Remove jq dependency and JSON parsing - Auto-generate EC2 SSH key pair during CloudFormation deployment - Store private key in Secrets Manager automatically - Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition - Use base nccl-tests image directly from public ECR - All setup logic now in single CloudFormation template - Remove intermediate variables, use env vars directly Author: yusongw@
…etup, manual SSH key generation
…CS image, fix IMDSv2 and PATH issues
…registration and worker IP passing
- Use hostname -i for container IP in awsvpc mode - Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces - Add BatchJobRole with ecs-tasks trust for container credentials - Simplify SSH key generation with runtime generation - Remove debug output and set NCCL_DEBUG=WARN
- Remove ECR repository (using public ECR image) - Add KMS encryption with key rotation for Secrets Manager - Convert inline IAM policies to managed policies - Remove explicit resource names for auto-generation - Enforce IMDSv2 on Launch Template - Add suppression for SSH key rotation (not applicable)
…rmance Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 - CUDA 12.8.1 - EFA 1.42.0 - OFI (libfabric) 1.16.0 - NCCL 2.27.5 - NCCL tests 2.16.4
t-brown
approved these changes
Nov 21, 2025
t-brown
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
iankouls-aws
approved these changes
Nov 21, 2025
Contributor
iankouls-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR updates README.md and adds a CFN template for P6 distributed training with AWS Batch. Approved.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
Add CloudFormation template for running distributed NCCL tests on P6-b200.48xlarge instances using AWS Batch Multi-Node Parallel jobs with Capacity Block support. Template includes 8 EFA interfaces per instance, KMS-encrypted SSH secrets, IMDSv2 enforcement. Security hardened with managed IAM policies and all scan findings addressed.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.