|
1 | | -# AWS Batch distributed training architectures |
| 1 | +# AWS Batch Distributed Training Architectures |
2 | 2 |
|
3 | | -This architecture serves as an example to run distributed training jobs on p4d.24xlarge instances but can be easily be modified to accommodate other instance kinds (Trn or other P instances). |
| 3 | +This repository provides CloudFormation templates and examples for running distributed training jobs on AWS Batch using GPU instances. The architecture can be easily modified to accommodate different instance types including Trainium (Trn) and other P-series instances. |
4 | 4 |
|
5 | | -> **Important**: it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) as our Batch template will fetch automatically the EFA Security Group ID (SG) and Subnet ID to setup the AWS Batch Compute Environment. Both the SG and Subnet are exported values from the VPC template. |
| 5 | +## Table of Contents |
6 | 6 |
|
7 | | -This architecture consists of the following resources: |
| 7 | +- [Prerequisites](#prerequisites) |
| 8 | +- [Architecture Overview](#architecture-overview) |
| 9 | +- [Available Templates](#available-templates) |
| 10 | +- [P4 Instance Deployment](#p4-instance-deployment) |
| 11 | +- [P5 Instance Deployment](#p5-instance-deployment) |
| 12 | +- [P6 Instance Deployment](#p6-instance-deployment) |
| 13 | +- [Important Considerations](#important-considerations) |
8 | 14 |
|
9 | | -- [AWS Batch Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) for [Multi-node parallel jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html). It is similar to a compute cluster. |
10 | | -- [AWS Batch Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) attached to the compute environment. It is similar to a queue for job schedulers (Slurm, LSF...). |
11 | | -- [EC2 Launch Template](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) which used to setup 4 EFA cards on our instance. |
12 | | -- [Job Definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) serves as a template for our jobs and refers to the container registry to pull containers |
13 | | -- [ECR Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) is used to store containers. |
| 15 | +## Prerequisites |
14 | 16 |
|
15 | | -## Template |
| 17 | +> **⚠️ Important**: You must deploy the VPC template [`2.vpc-one-az.yaml`](../../1.architectures/1.vpc_network/2.vpc-one-az.yaml) before deploying any Batch template. The Batch templates automatically fetch the EFA Security Group ID and Subnet ID from the VPC template's exported values. |
16 | 18 |
|
17 | | -This template deploys AWS Batch and EC2 resources. It can be deployed via the console and the AWS CLI. Regardless of the deployment method it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) prior to deploying that one. |
| 19 | +## Architecture Overview |
18 | 20 |
|
19 | | -- **Template file**: [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) |
| 21 | +This architecture consists of the following AWS resources: |
20 | 22 |
|
21 | | -### Quick Create |
| 23 | +| Component | Purpose | Documentation | |
| 24 | +|-----------|---------|---------------| |
| 25 | +| **AWS Batch Compute Environment** | Manages compute resources for multi-node parallel jobs (similar to a compute cluster) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) | |
| 26 | +| **AWS Batch Job Queue** | Queues jobs for execution (similar to Slurm/LSF schedulers) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) | |
| 27 | +| **EC2 Launch Template** | Configures EFA network interfaces for high-performance networking | [AWS Docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) | |
| 28 | +| **Job Definition** | Template for job execution, references container images | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) | |
| 29 | +| **ECR Container Registry** | Stores Docker container images | [AWS Docs](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) | |
22 | 30 |
|
23 | | -[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch) |
| 31 | +<img src="../../0.docs/batch-arch.png" width="600" alt="AWS Batch Architecture Diagram"> |
24 | 32 |
|
| 33 | +## Available Templates |
25 | 34 |
|
26 | | -## List of Parameters |
| 35 | +| Template | Instance Types | Features | |
| 36 | +|----------|----------------|----------| |
| 37 | +| [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) | P4d.24xlarge (default) | Standard deployment with 4 EFA interfaces | |
| 38 | +| [`0.aws-batch-distributed-training-p5.yaml`](./0.aws-batch-distributed-training-p5.yaml) | P5.48xlarge | Optimized for P5 instances | |
| 39 | +| [`aws-batch-distributed-training-p6.yaml`](./aws-batch-distributed-training-p6.yaml) | P6-b200.48xlarge | P6 deployment with sample AWS Batch MNP job setup | |
27 | 40 |
|
28 | | -The templates takes parameters that are mandatory and optional, see below for more details. |
| 41 | +## P4 Instance Deployment |
29 | 42 |
|
30 | | -| Name | Type | Details | |
31 | | -|-------------------------|-------------|-----------------------------------------------------------------------| |
32 | | -| `VPCStackParameter` | Required | Name of the VPC stack in CloudFormation. | |
33 | | -| `AMI` | Optional | ID of the AMI if using a custom one otherwise leave blank | |
34 | | -| `CapacityReservationId` | Optional | Use that or the ResourceGroup to refer to an EC2 reservation | |
35 | | -| `CapacityReservationResourceGroup` | Optional | Use that or the CapacityReservationId. | |
36 | | -| `EC2KeyPair` | Optional | EC2 key pair to use in case you want to connect through ssh for debug.| |
37 | | -| `CreatePlacementGroup` | Optional | Create a placement group for the instances. | |
| 43 | +### Quick Deploy |
38 | 44 |
|
| 45 | +Deploy the standard template with one click: |
39 | 46 |
|
40 | | -## Deploy with the AWS CLI |
| 47 | +[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch) |
41 | 48 |
|
42 | | -If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration. |
| 49 | +### Parameters |
| 50 | + |
| 51 | +| Parameter | Type | Description | |
| 52 | +|-----------|------|-------------| |
| 53 | +| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack | |
| 54 | +| `AMI` | Optional | Custom AMI ID (leave blank for default) | |
| 55 | +| `CapacityReservationId` | Optional | EC2 Capacity Reservation ID | |
| 56 | +| `CapacityReservationResourceGroup` | Optional | Alternative to CapacityReservationId | |
| 57 | +| `EC2KeyPair` | Optional | EC2 key pair for SSH debugging | |
| 58 | +| `CreatePlacementGroup` | Optional | Create placement group for instances | |
| 59 | + |
| 60 | +### P5 Instance Deployment |
43 | 61 |
|
44 | 62 | ```bash |
45 | | -aws cloudformation create-stack --stack-name aws-batch-p5 \ |
46 | | - --template-body file://0.aws-batch-distributed-training-p5.yaml \ |
47 | | - --parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \ |
48 | | - ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ |
49 | | - --capabilities CAPABILITY_NAMED_IAM |
| 63 | +aws cloudformation create-stack \ |
| 64 | + --stack-name aws-batch-distributed-training \ |
| 65 | + --template-body file://0.aws-batch-distributed-training.yaml \ |
| 66 | + --parameters \ |
| 67 | + ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \ |
| 68 | + ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ |
| 69 | + --capabilities CAPABILITY_NAMED_IAM |
50 | 70 | ``` |
51 | 71 |
|
52 | | -## Gotchas |
| 72 | +## P6 Instance Deployment |
| 73 | + |
| 74 | +### Template Parameters |
| 75 | + |
| 76 | +| Parameter | Type | Description | |
| 77 | +|-----------|------|-------------| |
| 78 | +| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack | |
| 79 | +| `CapacityReservationId` | **Required** | Capacity Reservation ID (e.g., cr-1234567890) | |
| 80 | + |
| 81 | +### Deployment Steps |
| 82 | + |
| 83 | +#### Step 1: Deploy CloudFormation Stack |
| 84 | + |
| 85 | +```bash |
| 86 | +aws cloudformation create-stack \ |
| 87 | + --stack-name batch-p6 \ |
| 88 | + --template-body file://aws-batch-distributed-training-p6.yaml \ |
| 89 | + --parameters \ |
| 90 | + ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \ |
| 91 | + ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ |
| 92 | + --capabilities CAPABILITY_NAMED_IAM |
| 93 | +``` |
| 94 | + |
| 95 | +#### Step 2: Generate and Store SSH Key |
| 96 | + |
| 97 | +```bash |
| 98 | +# Generate SSH key pair |
| 99 | +ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key |
| 100 | + |
| 101 | +# Store private key in Secrets Manager |
| 102 | +aws secretsmanager put-secret-value \ |
| 103 | + --secret-id batch-p6-ssh-key \ |
| 104 | + --secret-string file:///tmp/batch_key |
| 105 | + |
| 106 | +# Clean up temporary files |
| 107 | +rm /tmp/batch_key /tmp/batch_key.pub |
| 108 | +``` |
| 109 | + |
| 110 | +### Testing Your Deployment |
| 111 | + |
| 112 | +Submit a multi-node NCCL test job to verify the setup: |
| 113 | + |
| 114 | +```bash |
| 115 | +# Retrieve stack outputs |
| 116 | +JOB_DEFINITION=$(aws cloudformation describe-stacks \ |
| 117 | + --stack-name batch-p6 \ |
| 118 | + --query 'Stacks[0].Outputs[?OutputKey==`JobDefinitionMultiInstance`].OutputValue' \ |
| 119 | + --output text) |
| 120 | + |
| 121 | +JOB_QUEUE=$(aws cloudformation describe-stacks \ |
| 122 | + --stack-name batch-p6 \ |
| 123 | + --query 'Stacks[0].Outputs[?OutputKey==`DistributedDeepLearningJQ`].OutputValue' \ |
| 124 | + --output text) |
| 125 | + |
| 126 | +# Submit test job |
| 127 | +aws batch submit-job \ |
| 128 | + --job-name nccl-test-2node \ |
| 129 | + --job-queue ${JOB_QUEUE} \ |
| 130 | + --job-definition ${JOB_DEFINITION} \ |
| 131 | + --node-overrides numNodes=2 |
| 132 | + |
| 133 | +# Monitor job status |
| 134 | +aws batch describe-jobs --jobs <job-id> |
| 135 | + |
| 136 | +# View logs |
| 137 | +aws logs tail /aws/batch/job --follow |
| 138 | +``` |
| 139 | + |
| 140 | +### P6 Architecture Details |
| 141 | + |
| 142 | +- **Container Image**: `public.ecr.aws/hpc-cloud/nccl-tests:latest` |
| 143 | +- **Network Configuration**: 8 EFA interfaces per instance |
| 144 | +- **SSH Setup**: Automated via inline bash script in Job Definition |
| 145 | +- **Default Test**: `all_reduce_perf` with 8 GPUs per node (16 total processes for 2-node job) |
| 146 | +- **Key Management**: SSH keys retrieved from Secrets Manager at container startup |
| 147 | + |
| 148 | +## Important Considerations |
| 149 | + |
| 150 | +### EFA Network Configuration |
| 151 | + |
| 152 | +- EFA interfaces must be explicitly declared in the EC2 Launch Template |
| 153 | +- The EFA security group must be provided and properly configured |
| 154 | +- Network performance is critical for distributed training workloads |
| 155 | + |
| 156 | +### VPC Dependencies |
| 157 | + |
| 158 | +- The Compute Environment retrieves private subnet information from the VPC template |
| 159 | +- Ensure the VPC template exports the required subnet and security group values |
| 160 | +- Both templates must be deployed in the same AWS region |
| 161 | + |
| 162 | +### Capacity Management |
| 163 | + |
| 164 | +- Use Capacity Reservations for guaranteed instance availability |
| 165 | +- Consider using Capacity Reservation Resource Groups for easier management |
| 166 | +- Monitor your EC2 limits and request increases if needed |
53 | 167 |
|
54 | | -There are a few things to know as you evaluate this architecture: |
55 | | -- EFA interfaces need to be declared explicitly in the EC2 Launch Template and you need to provide the security group used for EFA. |
56 | | -- The Compute Environment must retrieve the list of private subnets from the VPC template. This list is exported by the VPC template. |
57 | | -- The Batch Job Definition assumes you are pushing a container with `stress-ng` and is pre-configured as such. |
| 168 | +--- |
58 | 169 |
|
59 | | -## Architecture Diagram |
| 170 | +## Additional Resources |
60 | 171 |
|
61 | | -<img src="../../0.docs/batch-arch.png" width="500"> |
| 172 | +- [AWS Batch User Guide](https://docs.aws.amazon.com/batch/latest/userguide/) |
| 173 | +- [Multi-node Parallel Jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html) |
| 174 | +- [EFA Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) |
0 commit comments