Skip to content

Commit bf21fd3

Browse files
AWS Batch P6-B200 Distributed Training with Multi-Node Parallel Support (#893)
* P6-b200: use Secrets Manager for SSH keys, remove NCCL cmd from bootstrap, update CFN to include secret and ECR * Simplify AWS Batch P6 deployment with inline setup script - Remove jq dependency and JSON parsing - Auto-generate EC2 SSH key pair during CloudFormation deployment - Store private key in Secrets Manager automatically - Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition - Use base nccl-tests image directly from public ECR - All setup logic now in single CloudFormation template - Remove intermediate variables, use env vars directly Author: yusongw@ * removed CHANGES.md * Simplify AWS Batch P6 setup: remove jq dependency, inline container setup, manual SSH key generation * Auto-create resource group in P6 template, simplify deployment to 3 steps * Fix P6 deployment: use capacity reservation ID directly, add AL2023 ECS image, fix IMDSv2 and PATH issues * Add SSH key parameter for deployment, start sshd, fix main node self-registration and worker IP passing * Fix MNP networking: use container IP and exclude bridge interfaces - Use hostname -i for container IP in awsvpc mode - Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces - Add BatchJobRole with ecs-tasks trust for container credentials - Simplify SSH key generation with runtime generation - Remove debug output and set NCCL_DEBUG=WARN * updated README.md to have P6 support * Fix table of contents links in README * fix: correct VPC template filename reference in README * fix link * delete backup file * fix: address security scan findings - Remove ECR repository (using public ECR image) - Add KMS encryption with key rotation for Secrets Manager - Convert inline IAM policies to managed policies - Remove explicit resource names for auto-generation - Enforce IMDSv2 on Launch Template - Add suppression for SSH key rotation (not applicable) * feat: update NCCL tests image to specific version for better P6 performance Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 - CUDA 12.8.1 - EFA 1.42.0 - OFI (libfabric) 1.16.0 - NCCL 2.27.5 - NCCL tests 2.16.4
1 parent 8d33454 commit bf21fd3

File tree

2 files changed

+620
-38
lines changed

2 files changed

+620
-38
lines changed
Lines changed: 151 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,174 @@
1-
# AWS Batch distributed training architectures
1+
# AWS Batch Distributed Training Architectures
22

3-
This architecture serves as an example to run distributed training jobs on p4d.24xlarge instances but can be easily be modified to accommodate other instance kinds (Trn or other P instances).
3+
This repository provides CloudFormation templates and examples for running distributed training jobs on AWS Batch using GPU instances. The architecture can be easily modified to accommodate different instance types including Trainium (Trn) and other P-series instances.
44

5-
> **Important**: it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) as our Batch template will fetch automatically the EFA Security Group ID (SG) and Subnet ID to setup the AWS Batch Compute Environment. Both the SG and Subnet are exported values from the VPC template.
5+
## Table of Contents
66

7-
This architecture consists of the following resources:
7+
- [Prerequisites](#prerequisites)
8+
- [Architecture Overview](#architecture-overview)
9+
- [Available Templates](#available-templates)
10+
- [P4 Instance Deployment](#p4-instance-deployment)
11+
- [P5 Instance Deployment](#p5-instance-deployment)
12+
- [P6 Instance Deployment](#p6-instance-deployment)
13+
- [Important Considerations](#important-considerations)
814

9-
- [AWS Batch Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) for [Multi-node parallel jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html). It is similar to a compute cluster.
10-
- [AWS Batch Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) attached to the compute environment. It is similar to a queue for job schedulers (Slurm, LSF...).
11-
- [EC2 Launch Template](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) which used to setup 4 EFA cards on our instance.
12-
- [Job Definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) serves as a template for our jobs and refers to the container registry to pull containers
13-
- [ECR Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) is used to store containers.
15+
## Prerequisites
1416

15-
## Template
17+
> **⚠️ Important**: You must deploy the VPC template [`2.vpc-one-az.yaml`](../../1.architectures/1.vpc_network/2.vpc-one-az.yaml) before deploying any Batch template. The Batch templates automatically fetch the EFA Security Group ID and Subnet ID from the VPC template's exported values.
1618
17-
This template deploys AWS Batch and EC2 resources. It can be deployed via the console and the AWS CLI. Regardless of the deployment method it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) prior to deploying that one.
19+
## Architecture Overview
1820

19-
- **Template file**: [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml)
21+
This architecture consists of the following AWS resources:
2022

21-
### Quick Create
23+
| Component | Purpose | Documentation |
24+
|-----------|---------|---------------|
25+
| **AWS Batch Compute Environment** | Manages compute resources for multi-node parallel jobs (similar to a compute cluster) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) |
26+
| **AWS Batch Job Queue** | Queues jobs for execution (similar to Slurm/LSF schedulers) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) |
27+
| **EC2 Launch Template** | Configures EFA network interfaces for high-performance networking | [AWS Docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) |
28+
| **Job Definition** | Template for job execution, references container images | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) |
29+
| **ECR Container Registry** | Stores Docker container images | [AWS Docs](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) |
2230

23-
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch)
31+
<img src="../../0.docs/batch-arch.png" width="600" alt="AWS Batch Architecture Diagram">
2432

33+
## Available Templates
2534

26-
## List of Parameters
35+
| Template | Instance Types | Features |
36+
|----------|----------------|----------|
37+
| [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) | P4d.24xlarge (default) | Standard deployment with 4 EFA interfaces |
38+
| [`0.aws-batch-distributed-training-p5.yaml`](./0.aws-batch-distributed-training-p5.yaml) | P5.48xlarge | Optimized for P5 instances |
39+
| [`aws-batch-distributed-training-p6.yaml`](./aws-batch-distributed-training-p6.yaml) | P6-b200.48xlarge | P6 deployment with sample AWS Batch MNP job setup |
2740

28-
The templates takes parameters that are mandatory and optional, see below for more details.
41+
## P4 Instance Deployment
2942

30-
| Name | Type | Details |
31-
|-------------------------|-------------|-----------------------------------------------------------------------|
32-
| `VPCStackParameter` | Required | Name of the VPC stack in CloudFormation. |
33-
| `AMI` | Optional | ID of the AMI if using a custom one otherwise leave blank |
34-
| `CapacityReservationId` | Optional | Use that or the ResourceGroup to refer to an EC2 reservation |
35-
| `CapacityReservationResourceGroup` | Optional | Use that or the CapacityReservationId. |
36-
| `EC2KeyPair` | Optional | EC2 key pair to use in case you want to connect through ssh for debug.|
37-
| `CreatePlacementGroup` | Optional | Create a placement group for the instances. |
43+
### Quick Deploy
3844

45+
Deploy the standard template with one click:
3946

40-
## Deploy with the AWS CLI
47+
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch)
4148

42-
If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration.
49+
### Parameters
50+
51+
| Parameter | Type | Description |
52+
|-----------|------|-------------|
53+
| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack |
54+
| `AMI` | Optional | Custom AMI ID (leave blank for default) |
55+
| `CapacityReservationId` | Optional | EC2 Capacity Reservation ID |
56+
| `CapacityReservationResourceGroup` | Optional | Alternative to CapacityReservationId |
57+
| `EC2KeyPair` | Optional | EC2 key pair for SSH debugging |
58+
| `CreatePlacementGroup` | Optional | Create placement group for instances |
59+
60+
### P5 Instance Deployment
4361

4462
```bash
45-
aws cloudformation create-stack --stack-name aws-batch-p5 \
46-
--template-body file://0.aws-batch-distributed-training-p5.yaml \
47-
--parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
48-
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
49-
--capabilities CAPABILITY_NAMED_IAM
63+
aws cloudformation create-stack \
64+
--stack-name aws-batch-distributed-training \
65+
--template-body file://0.aws-batch-distributed-training.yaml \
66+
--parameters \
67+
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
68+
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
69+
--capabilities CAPABILITY_NAMED_IAM
5070
```
5171

52-
## Gotchas
72+
## P6 Instance Deployment
73+
74+
### Template Parameters
75+
76+
| Parameter | Type | Description |
77+
|-----------|------|-------------|
78+
| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack |
79+
| `CapacityReservationId` | **Required** | Capacity Reservation ID (e.g., cr-1234567890) |
80+
81+
### Deployment Steps
82+
83+
#### Step 1: Deploy CloudFormation Stack
84+
85+
```bash
86+
aws cloudformation create-stack \
87+
--stack-name batch-p6 \
88+
--template-body file://aws-batch-distributed-training-p6.yaml \
89+
--parameters \
90+
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
91+
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
92+
--capabilities CAPABILITY_NAMED_IAM
93+
```
94+
95+
#### Step 2: Generate and Store SSH Key
96+
97+
```bash
98+
# Generate SSH key pair
99+
ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key
100+
101+
# Store private key in Secrets Manager
102+
aws secretsmanager put-secret-value \
103+
--secret-id batch-p6-ssh-key \
104+
--secret-string file:///tmp/batch_key
105+
106+
# Clean up temporary files
107+
rm /tmp/batch_key /tmp/batch_key.pub
108+
```
109+
110+
### Testing Your Deployment
111+
112+
Submit a multi-node NCCL test job to verify the setup:
113+
114+
```bash
115+
# Retrieve stack outputs
116+
JOB_DEFINITION=$(aws cloudformation describe-stacks \
117+
--stack-name batch-p6 \
118+
--query 'Stacks[0].Outputs[?OutputKey==`JobDefinitionMultiInstance`].OutputValue' \
119+
--output text)
120+
121+
JOB_QUEUE=$(aws cloudformation describe-stacks \
122+
--stack-name batch-p6 \
123+
--query 'Stacks[0].Outputs[?OutputKey==`DistributedDeepLearningJQ`].OutputValue' \
124+
--output text)
125+
126+
# Submit test job
127+
aws batch submit-job \
128+
--job-name nccl-test-2node \
129+
--job-queue ${JOB_QUEUE} \
130+
--job-definition ${JOB_DEFINITION} \
131+
--node-overrides numNodes=2
132+
133+
# Monitor job status
134+
aws batch describe-jobs --jobs <job-id>
135+
136+
# View logs
137+
aws logs tail /aws/batch/job --follow
138+
```
139+
140+
### P6 Architecture Details
141+
142+
- **Container Image**: `public.ecr.aws/hpc-cloud/nccl-tests:latest`
143+
- **Network Configuration**: 8 EFA interfaces per instance
144+
- **SSH Setup**: Automated via inline bash script in Job Definition
145+
- **Default Test**: `all_reduce_perf` with 8 GPUs per node (16 total processes for 2-node job)
146+
- **Key Management**: SSH keys retrieved from Secrets Manager at container startup
147+
148+
## Important Considerations
149+
150+
### EFA Network Configuration
151+
152+
- EFA interfaces must be explicitly declared in the EC2 Launch Template
153+
- The EFA security group must be provided and properly configured
154+
- Network performance is critical for distributed training workloads
155+
156+
### VPC Dependencies
157+
158+
- The Compute Environment retrieves private subnet information from the VPC template
159+
- Ensure the VPC template exports the required subnet and security group values
160+
- Both templates must be deployed in the same AWS region
161+
162+
### Capacity Management
163+
164+
- Use Capacity Reservations for guaranteed instance availability
165+
- Consider using Capacity Reservation Resource Groups for easier management
166+
- Monitor your EC2 limits and request increases if needed
53167

54-
There are a few things to know as you evaluate this architecture:
55-
- EFA interfaces need to be declared explicitly in the EC2 Launch Template and you need to provide the security group used for EFA.
56-
- The Compute Environment must retrieve the list of private subnets from the VPC template. This list is exported by the VPC template.
57-
- The Batch Job Definition assumes you are pushing a container with `stress-ng` and is pre-configured as such.
168+
---
58169

59-
## Architecture Diagram
170+
## Additional Resources
60171

61-
<img src="../../0.docs/batch-arch.png" width="500">
172+
- [AWS Batch User Guide](https://docs.aws.amazon.com/batch/latest/userguide/)
173+
- [Multi-node Parallel Jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html)
174+
- [EFA Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html)

0 commit comments

Comments
 (0)