diff --git a/1.architectures/3.aws-batch/README.md b/1.architectures/3.aws-batch/README.md index af494aee5..78b3ac563 100644 --- a/1.architectures/3.aws-batch/README.md +++ b/1.architectures/3.aws-batch/README.md @@ -1,61 +1,174 @@ -# AWS Batch distributed training architectures +# AWS Batch Distributed Training Architectures -This architecture serves as an example to run distributed training jobs on p4d.24xlarge instances but can be easily be modified to accommodate other instance kinds (Trn or other P instances). +This repository provides CloudFormation templates and examples for running distributed training jobs on AWS Batch using GPU instances. The architecture can be easily modified to accommodate different instance types including Trainium (Trn) and other P-series instances. -> **Important**: it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) as our Batch template will fetch automatically the EFA Security Group ID (SG) and Subnet ID to setup the AWS Batch Compute Environment. Both the SG and Subnet are exported values from the VPC template. +## Table of Contents -This architecture consists of the following resources: +- [Prerequisites](#prerequisites) +- [Architecture Overview](#architecture-overview) +- [Available Templates](#available-templates) +- [P4 Instance Deployment](#p4-instance-deployment) +- [P5 Instance Deployment](#p5-instance-deployment) +- [P6 Instance Deployment](#p6-instance-deployment) +- [Important Considerations](#important-considerations) -- [AWS Batch Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) for [Multi-node parallel jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html). It is similar to a compute cluster. -- [AWS Batch Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) attached to the compute environment. It is similar to a queue for job schedulers (Slurm, LSF...). -- [EC2 Launch Template](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) which used to setup 4 EFA cards on our instance. -- [Job Definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) serves as a template for our jobs and refers to the container registry to pull containers -- [ECR Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) is used to store containers. +## Prerequisites -## Template +> **⚠️ Important**: You must deploy the VPC template [`2.vpc-one-az.yaml`](../../1.architectures/1.vpc_network/2.vpc-one-az.yaml) before deploying any Batch template. The Batch templates automatically fetch the EFA Security Group ID and Subnet ID from the VPC template's exported values. -This template deploys AWS Batch and EC2 resources. It can be deployed via the console and the AWS CLI. Regardless of the deployment method it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) prior to deploying that one. +## Architecture Overview -- **Template file**: [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) +This architecture consists of the following AWS resources: -### Quick Create +| Component | Purpose | Documentation | +|-----------|---------|---------------| +| **AWS Batch Compute Environment** | Manages compute resources for multi-node parallel jobs (similar to a compute cluster) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) | +| **AWS Batch Job Queue** | Queues jobs for execution (similar to Slurm/LSF schedulers) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) | +| **EC2 Launch Template** | Configures EFA network interfaces for high-performance networking | [AWS Docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) | +| **Job Definition** | Template for job execution, references container images | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) | +| **ECR Container Registry** | Stores Docker container images | [AWS Docs](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) | -[
 1-Click Deploy 🚀 
](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch) +AWS Batch Architecture Diagram +## Available Templates -## List of Parameters +| Template | Instance Types | Features | +|----------|----------------|----------| +| [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) | P4d.24xlarge (default) | Standard deployment with 4 EFA interfaces | +| [`0.aws-batch-distributed-training-p5.yaml`](./0.aws-batch-distributed-training-p5.yaml) | P5.48xlarge | Optimized for P5 instances | +| [`aws-batch-distributed-training-p6.yaml`](./aws-batch-distributed-training-p6.yaml) | P6-b200.48xlarge | P6 deployment with sample AWS Batch MNP job setup | -The templates takes parameters that are mandatory and optional, see below for more details. +## P4 Instance Deployment -| Name | Type | Details | -|-------------------------|-------------|-----------------------------------------------------------------------| -| `VPCStackParameter` | Required | Name of the VPC stack in CloudFormation. | -| `AMI` | Optional | ID of the AMI if using a custom one otherwise leave blank | -| `CapacityReservationId` | Optional | Use that or the ResourceGroup to refer to an EC2 reservation | -| `CapacityReservationResourceGroup` | Optional | Use that or the CapacityReservationId. | -| `EC2KeyPair` | Optional | EC2 key pair to use in case you want to connect through ssh for debug.| -| `CreatePlacementGroup` | Optional | Create a placement group for the instances. | +### Quick Deploy +Deploy the standard template with one click: -## Deploy with the AWS CLI +[
1-Click Deploy 🚀
](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch) -If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration. +### Parameters + +| Parameter | Type | Description | +|-----------|------|-------------| +| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack | +| `AMI` | Optional | Custom AMI ID (leave blank for default) | +| `CapacityReservationId` | Optional | EC2 Capacity Reservation ID | +| `CapacityReservationResourceGroup` | Optional | Alternative to CapacityReservationId | +| `EC2KeyPair` | Optional | EC2 key pair for SSH debugging | +| `CreatePlacementGroup` | Optional | Create placement group for instances | + +### P5 Instance Deployment ```bash -aws cloudformation create-stack --stack-name aws-batch-p5 \ - --template-body file://0.aws-batch-distributed-training-p5.yaml \ - --parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \ - ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ - --capabilities CAPABILITY_NAMED_IAM +aws cloudformation create-stack \ + --stack-name aws-batch-distributed-training \ + --template-body file://0.aws-batch-distributed-training.yaml \ + --parameters \ + ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \ + ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ + --capabilities CAPABILITY_NAMED_IAM ``` -## Gotchas +## P6 Instance Deployment + +### Template Parameters + +| Parameter | Type | Description | +|-----------|------|-------------| +| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack | +| `CapacityReservationId` | **Required** | Capacity Reservation ID (e.g., cr-1234567890) | + +### Deployment Steps + +#### Step 1: Deploy CloudFormation Stack + +```bash +aws cloudformation create-stack \ + --stack-name batch-p6 \ + --template-body file://aws-batch-distributed-training-p6.yaml \ + --parameters \ + ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \ + ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \ + --capabilities CAPABILITY_NAMED_IAM +``` + +#### Step 2: Generate and Store SSH Key + +```bash +# Generate SSH key pair +ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key + +# Store private key in Secrets Manager +aws secretsmanager put-secret-value \ + --secret-id batch-p6-ssh-key \ + --secret-string file:///tmp/batch_key + +# Clean up temporary files +rm /tmp/batch_key /tmp/batch_key.pub +``` + +### Testing Your Deployment + +Submit a multi-node NCCL test job to verify the setup: + +```bash +# Retrieve stack outputs +JOB_DEFINITION=$(aws cloudformation describe-stacks \ + --stack-name batch-p6 \ + --query 'Stacks[0].Outputs[?OutputKey==`JobDefinitionMultiInstance`].OutputValue' \ + --output text) + +JOB_QUEUE=$(aws cloudformation describe-stacks \ + --stack-name batch-p6 \ + --query 'Stacks[0].Outputs[?OutputKey==`DistributedDeepLearningJQ`].OutputValue' \ + --output text) + +# Submit test job +aws batch submit-job \ + --job-name nccl-test-2node \ + --job-queue ${JOB_QUEUE} \ + --job-definition ${JOB_DEFINITION} \ + --node-overrides numNodes=2 + +# Monitor job status +aws batch describe-jobs --jobs + +# View logs +aws logs tail /aws/batch/job --follow +``` + +### P6 Architecture Details + +- **Container Image**: `public.ecr.aws/hpc-cloud/nccl-tests:latest` +- **Network Configuration**: 8 EFA interfaces per instance +- **SSH Setup**: Automated via inline bash script in Job Definition +- **Default Test**: `all_reduce_perf` with 8 GPUs per node (16 total processes for 2-node job) +- **Key Management**: SSH keys retrieved from Secrets Manager at container startup + +## Important Considerations + +### EFA Network Configuration + +- EFA interfaces must be explicitly declared in the EC2 Launch Template +- The EFA security group must be provided and properly configured +- Network performance is critical for distributed training workloads + +### VPC Dependencies + +- The Compute Environment retrieves private subnet information from the VPC template +- Ensure the VPC template exports the required subnet and security group values +- Both templates must be deployed in the same AWS region + +### Capacity Management + +- Use Capacity Reservations for guaranteed instance availability +- Consider using Capacity Reservation Resource Groups for easier management +- Monitor your EC2 limits and request increases if needed -There are a few things to know as you evaluate this architecture: -- EFA interfaces need to be declared explicitly in the EC2 Launch Template and you need to provide the security group used for EFA. -- The Compute Environment must retrieve the list of private subnets from the VPC template. This list is exported by the VPC template. -- The Batch Job Definition assumes you are pushing a container with `stress-ng` and is pre-configured as such. +--- -## Architecture Diagram +## Additional Resources - +- [AWS Batch User Guide](https://docs.aws.amazon.com/batch/latest/userguide/) +- [Multi-node Parallel Jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html) +- [EFA Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) diff --git a/1.architectures/3.aws-batch/aws-batch-distributed-training-p6.yaml b/1.architectures/3.aws-batch/aws-batch-distributed-training-p6.yaml new file mode 100644 index 000000000..f9eaf89c7 --- /dev/null +++ b/1.architectures/3.aws-batch/aws-batch-distributed-training-p6.yaml @@ -0,0 +1,469 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# AWS Batch setup for P6 distributed training with multi-node NCCL tests +# Author: yusongw@ +# See README.md for detailed usage instructions +# +AWSTemplateFormatVersion: '2010-09-09' +Description: AWS Batch setup for P6 distributed training with multi-node NCCL tests. Simplified deployment with inline container setup. + +Parameters: + VPCStackParameter: + Type: String + Description: Private subnets will be retrieved for the compute environment + Default: 'aws-batch-vpc' + CapacityReservationId: + Type: String + Description: Capacity Reservation ID (e.g., cr-1234567890) + Default: '' + +Metadata: + AWS::CloudFormation::Interface: + ParameterGroups: + - Label: + default: General configuration + Parameters: + - VPCStackParameter + ParameterLabels: + VPCStackParameter: + default: Name of the VPC Stack + +Resources: + ################### + ## KMS Key for Secrets Manager ## + ################### + SecretsKMSKey: + Type: AWS::KMS::Key + Properties: + Description: KMS key for encrypting Batch SSH secrets + EnableKeyRotation: true + KeyPolicy: + Version: '2012-10-17' + Statement: + - Sid: Enable IAM User Permissions + Effect: Allow + Principal: + AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' + Action: 'kms:*' + Resource: '*' + - Sid: Allow Secrets Manager to use the key + Effect: Allow + Principal: + Service: secretsmanager.amazonaws.com + Action: + - 'kms:Decrypt' + - 'kms:GenerateDataKey' + Resource: '*' + + SecretsKMSKeyAlias: + Type: AWS::KMS::Alias + Properties: + AliasName: !Sub 'alias/${AWS::StackName}-secrets' + TargetKeyId: !Ref SecretsKMSKey + + ################### + ## EC2 Resources ## + ################### + DistributedDeepLearningLT: + Type: AWS::EC2::LaunchTemplate + Properties: + LaunchTemplateData: + MetadataOptions: + HttpTokens: required # Enforce IMDSv2 + HttpEndpoint: enabled + HttpPutResponseHopLimit: 1 + InstanceMarketOptions: + MarketType: "capacity-block" + CapacityReservationSpecification: + CapacityReservationTarget: + CapacityReservationId: !Ref CapacityReservationId + NetworkInterfaces: + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 0 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 1 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 2 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 3 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 4 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 5 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 6 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + - Description: EFA Interface + Groups: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-SecurityGroup + NetworkCardIndex: 7 + DeviceIndex: 0 + DeleteOnTermination: true + InterfaceType: efa + + ######################## + ## Batch Architecture ## + ######################## + + ## + ## IAM Roles for AWS Batch + ## + BatchInstanceRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - + Effect: Allow + Principal: + Service: + - batch.amazonaws.com + Action: + - sts:AssumeRole + ManagedPolicyArns: + - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole + + SecretsManagerPolicy: + Type: AWS::IAM::ManagedPolicy + Properties: + Description: Policy for reading SSH secrets from Secrets Manager + PolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: Allow + Action: + - secretsmanager:GetSecretValue + Resource: + - !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:${AWS::StackName}-ssh-key-*' + - Effect: Allow + Action: + - kms:Decrypt + Resource: + - !GetAtt SecretsKMSKey.Arn + + ECSTaskServiceRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - + Effect: Allow + Principal: + Service: + - ec2.amazonaws.com + Action: + - sts:AssumeRole + ManagedPolicyArns: + - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role + - !Ref SecretsManagerPolicy + + ECSTaskInstanceProfile: + Type: AWS::IAM::InstanceProfile + Properties: + Path: / + Roles: + - !Ref ECSTaskServiceRole + InstanceProfileName: !Join [ "", [ "ECSTaskInstanceProfileIAM-", !Ref AWS::StackName ] ] + + BatchJobRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - Effect: Allow + Principal: + Service: + - ecs-tasks.amazonaws.com + Action: + - sts:AssumeRole + ManagedPolicyArns: + - !Ref SecretsManagerPolicy + + ## + ## Secrets Manager - Placeholder for SSH key (populate manually after stack creation) + ## + SSHKeySecret: + Type: AWS::SecretsManager::Secret + Metadata: + cfn_nag: + rules_to_suppress: + - id: W77 + reason: "SSH keys for Batch MNP do not require automatic rotation" + guard: + SuppressedRules: + - SECRETSMANAGER_ROTATION_ENABLED_CHECK + Properties: + Name: !Sub "${AWS::StackName}-ssh-key" + Description: "SSH private key for Batch MNP jobs - populate with: ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key && aws secretsmanager put-secret-value --secret-id --secret-string file:///tmp/batch_key" + SecretString: "PLACEHOLDER - Run the command in Description to generate and upload SSH key" + KmsKeyId: !Ref SecretsKMSKey + + ## + ## Compute Environment and Job Definition + ## + DistributedDeepLearningCE: + Type: AWS::Batch::ComputeEnvironment + Properties: + Type: MANAGED + ServiceRole: !Ref BatchInstanceRole + ComputeResources: + AllocationStrategy: BEST_FIT + MaxvCpus: 100000 + DesiredvCpus: 0 + MinvCpus: 0 + Subnets: !Split + - ',' + - Fn::ImportValue: !Sub ${VPCStackParameter}-PrivateSubnet + Type: EC2 + InstanceRole: !Ref ECSTaskInstanceProfile + Ec2Configuration: + - ImageType: ECS_AL2023_NVIDIA + LaunchTemplate: + LaunchTemplateId: !Ref DistributedDeepLearningLT + Version: $Latest + InstanceTypes: + - p6-b200.48xlarge + State: ENABLED + Tags: + Name: Batch Deep Learning + + DistributedDeepLearningJQ: + Type: AWS::Batch::JobQueue + Properties: + JobQueueName: !Sub "${AWS::StackName}-job-queue" + ComputeEnvironmentOrder: + - ComputeEnvironment: !Ref DistributedDeepLearningCE + Order: 1 + Priority: 1 + State: "ENABLED" + + ## + ## AWS Batch Job definition + ## + NCCLTest: + Type: AWS::Batch::JobDefinition + Properties: + JobDefinitionName: !Sub "${AWS::StackName}-nccl-test" + Type: multinode + NodeProperties: + MainNode: 0 + NumNodes: 2 + NodeRangeProperties: + - TargetNodes: '0:' + Container: + # Use specific nccl-tests image with CUDA 12.8.1, EFA 1.42.0, OFI 1.16.0, NCCL 2.27.5 + Image: public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 + JobRoleArn: !GetAtt BatchJobRole.Arn + # Inline command that sets up SSH and hostfile, then runs NCCL test + Command: + - /bin/bash + - -c + - | + set -euo pipefail + export PATH="/usr/local/bin:/usr/sbin:$PATH" + echo "Node ${AWS_BATCH_JOB_NODE_INDEX}/${AWS_BATCH_JOB_NUM_NODES} starting" + + # Fetch SSH key from Secrets Manager + mkdir -p /root/.ssh + chmod 700 /root/.ssh + /usr/local/bin/aws --region "${AWS_DEFAULT_REGION}" secretsmanager get-secret-value --secret-id "${SSH_SECRET_ARN}" --query SecretString --output text > /root/.ssh/id_rsa + chmod 600 /root/.ssh/id_rsa + ssh-keygen -y -f /root/.ssh/id_rsa > /root/.ssh/id_rsa.pub + cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys + chmod 600 /root/.ssh/authorized_keys + + # Start SSH daemon + mkdir -p /run/sshd + /usr/sbin/sshd || echo "ERROR: sshd failed to start" + sleep 1 + + # Get container IP address (not host IP) - in awsvpc mode, container has its own ENI + MY_IP=$(hostname -i | awk '{print $1}') + + if [ "${AWS_BATCH_JOB_NUM_NODES}" -eq 1 ]; then + # Single node job + exec /opt/amazon/openmpi/bin/mpirun --allow-run-as-root -np "${NCCL_TOTAL_PROCS}" --bind-to none -x PATH -x LD_LIBRARY_PATH -x FI_PROVIDER=efa -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_EFA_FORK_SAFE=1 ${NCCL_TEST_CMD} + fi + + if [ "${AWS_BATCH_JOB_NODE_INDEX}" -eq "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" ]; then + # Main node: register own IP and collect worker IPs + mkdir -p /tmp/hosts + echo "${MY_IP}" > /tmp/hosts/${AWS_BATCH_JOB_NODE_INDEX} + echo "Main node waiting for ${AWS_BATCH_JOB_NUM_NODES} nodes" + + while [ "$(ls /tmp/hosts 2>/dev/null | wc -l)" -lt "${AWS_BATCH_JOB_NUM_NODES}" ]; do + sleep 1 + done + + # Build hostfile + for idx in $(ls /tmp/hosts | sort -n); do + ip="$(cat /tmp/hosts/${idx})" + echo "${ip} slots=${NCCL_PROCS_PER_NODE}" >> /tmp/hostfile + done + + echo "Hostfile:" + cat /tmp/hostfile + + # Launch NCCL test - exclude loopback and bridge interfaces + export NCCL_SOCKET_IFNAME=^lo,docker,ecs + export NCCL_DEBUG=WARN + exec /opt/amazon/openmpi/bin/mpirun --allow-run-as-root --mca btl tcp,self -np "${NCCL_TOTAL_PROCS}" --hostfile /tmp/hostfile --bind-to none -x PATH -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME -x NCCL_NET_PLUGIN=ofi -x FI_PROVIDER=efa -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_EFA_FORK_SAFE=1 -x FI_EFA_ENABLE_SHM_TRANSFER=1 -x NCCL_DEBUG ${NCCL_TEST_CMD} + else + # Worker node: register with main + for i in 1 2 3 4 5; do + if ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=5 -o BatchMode=yes root@"${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}" "mkdir -p /tmp/hosts && echo '${MY_IP}' > /tmp/hosts/${AWS_BATCH_JOB_NODE_INDEX}"; then + echo "Registered with main node" + break + fi + sleep 1 + done + + # Wait for mpirun + tail -f /dev/null + fi + Environment: + - Name: AWS_DEFAULT_REGION + Value: !Ref AWS::Region + - Name: LD_LIBRARY_PATH + Value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH + - Name: PATH + Value: $PATH:/opt/amazon/efa/bin:/usr/bin + - Name: NCCL_PROCS_PER_NODE + Value: "8" + - Name: NCCL_TOTAL_PROCS + Value: "16" + - Name: NCCL_TEST_CMD + Value: "/opt/nccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 1 -n 100" + - Name: SSH_SECRET_ARN + Value: !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:${AWS::StackName}-ssh-key' + ResourceRequirements: + - Type: VCPU + Value: 192 + - Type: GPU + Value: 8 + - Type: MEMORY + Value: 1049000 + Ulimits: + - Name: memlock + HardLimit: -1 + SoftLimit: -1 + - Name: stack + HardLimit: 67108864 + SoftLimit: 67108864 + - Name: nofile + HardLimit: 1024000 + SoftLimit: 1024000 + LinuxParameters: + SharedMemorySize: 49152 + # Expose the first 8 uverbs devices for EFA on p6-b200 + Devices: + - HostPath: /dev/infiniband/uverbs0 + ContainerPath: /dev/infiniband/uverbs0 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs1 + ContainerPath: /dev/infiniband/uverbs1 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs2 + ContainerPath: /dev/infiniband/uverbs2 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs3 + ContainerPath: /dev/infiniband/uverbs3 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs4 + ContainerPath: /dev/infiniband/uverbs4 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs5 + ContainerPath: /dev/infiniband/uverbs5 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs6 + ContainerPath: /dev/infiniband/uverbs6 + Permissions: + - READ + - WRITE + - MKNOD + - HostPath: /dev/infiniband/uverbs7 + ContainerPath: /dev/infiniband/uverbs7 + Permissions: + - READ + - WRITE + - MKNOD + PropagateTags: true + RetryStrategy: + Attempts: 1 + +Outputs: + SSHKeySecretArn: + Description: ARN of the SSH private key secret in Secrets Manager + Value: !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:${AWS::StackName}-ssh-key' + + JobDefinitionMultiInstance: + Description: Job definition for Multi-node Parallel Jobs + Value: !Ref NCCLTest + + DistributedDeepLearningJQ: + Description: Job Queue + Value: !Ref DistributedDeepLearningJQ \ No newline at end of file