-
Notifications
You must be signed in to change notification settings - Fork 152
Updating CF stack to allow for local zone deployments for GB200 #838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
5c96f94
d052507
862e50d
8b3efda
0f9cbc3
46e6c16
af02f06
9f8c10c
4f1b5b1
98d724c
2c1d414
fff5ec6
2bb82bd
a2b7900
e528f6d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,7 @@ Description: > | |
| 1.2 TB storage which can be overridden by parameter. A role is also created which | ||
| helps to execute HyperPod cluster operations. | ||
|
|
||
| #TODO: DO THIS FOR EKS TOO. | ||
|
|
||
| #################### | ||
| ## Stack Metadata ## | ||
|
|
@@ -31,6 +32,7 @@ Metadata: | |
| Parameters: | ||
| - PrimarySubnetAZ | ||
| - BackupSubnetAZ | ||
| - IsLocalZone | ||
| - Label: | ||
| default: FSx Lustre configuration | ||
| Parameters: | ||
|
|
@@ -59,7 +61,7 @@ Metadata: | |
| SSMDocumentName: | ||
| default: True/False; Create SSM Session Manager Document. Only set to False if SSM-SessionManagerRunShellAsUbuntu document exists in your account. | ||
| PrimarySubnetAZ: | ||
| default: Availability zone id to deploy the primary subnets | ||
| default: Availability zone id to deploy the primary subnets (OR set this to your Local Zone ID if you set IsLocalZone to True. Example use1-dfw2-az1) | ||
| BackupSubnetAZ: | ||
| default: (Optional) Availability zone id to deploy the backup private subnet | ||
| CreateS3Endpoint: | ||
|
|
@@ -175,6 +177,14 @@ Parameters: | |
| Default: 0 | ||
| MinValue: 0 | ||
| MaxValue: 400000 | ||
|
|
||
| IsLocalZone: | ||
| Type: String | ||
| Default: 'false' | ||
| AllowedValues: | ||
| - 'true' | ||
| - 'false' | ||
| Description: Set to true if you are using a local zone for GB200 (DFW only currently). | ||
|
|
||
| ############################### | ||
| ## Conditions for Parameters ## | ||
|
|
@@ -186,6 +196,8 @@ Conditions: | |
| CreateSSMDocument: !Equals [!Ref 'SSMDocumentName', 'true'] | ||
| CreateOpenZFSCondition: !Equals [!Ref 'CreateOpenZFS', 'true'] | ||
| ConfigureCustomIops: !Not [!Equals [!Ref OpenZFSIops, 0]] | ||
| UseNATInstance: !Equals [!Ref IsLocalZone, 'true'] | ||
| UseNATGateway: !Equals [!Ref IsLocalZone, 'false'] | ||
|
|
||
|
|
||
| ########################## | ||
|
|
@@ -279,15 +291,97 @@ Resources: | |
|
|
||
| # Create a NAT GW then add it to the public subnet | ||
| NATGateway: | ||
| Condition: UseNATGateway | ||
| Type: AWS::EC2::NatGateway | ||
| Properties: | ||
| AllocationId: !GetAtt ElasticIP.AllocationId | ||
| SubnetId: !Ref PublicSubnet | ||
|
|
||
| ElasticIP: | ||
| Condition: UseNATGateway | ||
| Type: AWS::EC2::EIP | ||
| Properties: | ||
| Domain: vpc | ||
|
|
||
| # For GB200: Create a NAT Instance for the DFW Local Zone and an EIP | ||
| LocalZoneEIP: | ||
| Condition: UseNATInstance | ||
| Type: AWS::EC2::EIP | ||
| Properties: | ||
| Domain: vpc | ||
| NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW. | ||
|
||
|
|
||
| NATInstance: | ||
|
||
| Condition: UseNATInstance | ||
| Type: AWS::EC2::Instance | ||
| Properties: | ||
| ImageId: ami-00ca32bbc84273381 # TODO: CURRENTLY HARDCODED TO IAD -- create map. | ||
| InstanceType: c6i.2xlarge | ||
| SubnetId: !Ref PublicSubnet | ||
| SecurityGroupIds: [!Ref NATSecurityGroup] | ||
| SourceDestCheck: false | ||
| IamInstanceProfile: !Ref NATInstanceProfile | ||
| UserData: | ||
| Fn::Base64: !Sub | | ||
| #!/bin/bash -ex | ||
| yum update -y | ||
| yum install iptables-services -y | ||
| systemctl enable iptables | ||
| systemctl start iptables | ||
| echo 'net.ipv4.ip_forward=1' > /etc/sysctl.d/custom-ip-forwarding.conf | ||
| sysctl -p /etc/sysctl.d/custom-ip-forwarding.conf | ||
| /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE | ||
| /sbin/iptables -F FORWARD | ||
| service iptables save | ||
|
|
||
| # Associate EIP and disable source/dest check | ||
| TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") | ||
| INSTANCEID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id) | ||
| aws --region ${AWS::Region} ec2 associate-address --instance-id $INSTANCEID --allocation-id ${LocalZoneEIP.AllocationId} | ||
|
|
||
| NATSecurityGroup: | ||
| Condition: UseNATInstance | ||
| Type: AWS::EC2::SecurityGroup | ||
| Properties: | ||
| VpcId: !Ref VPC | ||
| GroupDescription: NAT Instance Security Group | ||
| SecurityGroupIngress: | ||
| - CidrIp: 10.0.0.0/8 | ||
| IpProtocol: "-1" | ||
| SecurityGroupEgress: | ||
| - CidrIp: 0.0.0.0/0 | ||
| IpProtocol: "-1" | ||
| Tags: | ||
| - Key: Name | ||
| Value: NATSecurityGroup | ||
|
|
||
| NATInstanceProfile: | ||
| Condition: UseNATInstance | ||
| Type: AWS::IAM::InstanceProfile | ||
| Properties: | ||
| Roles: [!Ref NATInstanceRole] | ||
|
|
||
| NATInstanceRole: | ||
| Condition: UseNATInstance | ||
| Type: AWS::IAM::Role | ||
| Properties: | ||
| AssumeRolePolicyDocument: | ||
| Version: "2012-10-17" | ||
| Statement: | ||
| - Effect: Allow | ||
| Principal: | ||
| Service: ["ec2.amazonaws.com"] | ||
| Action: "sts:AssumeRole" | ||
| Policies: | ||
| - PolicyName: AssociateEIP | ||
| PolicyDocument: | ||
| Version: "2012-10-17" | ||
| Statement: | ||
| - Effect: Allow | ||
| Action: | ||
| - ec2:AssociateAddress | ||
| - ec2:ModifyInstanceAttribute | ||
| Resource: "*" | ||
|
|
||
| # NOTE: when you create additional security groups, you must ensure that every | ||
| # security group has ingress/egress from/to its own security group id. Failure | ||
|
|
@@ -392,7 +486,9 @@ Resources: | |
| Properties: | ||
| RouteTableId: !Ref PrivateRouteTable | ||
| DestinationCidrBlock: 0.0.0.0/0 | ||
| NatGatewayId: !Ref NATGateway | ||
| NatGatewayId: !If [UseNATGateway, !Ref NATGateway, !Ref 'AWS::NoValue'] | ||
| InstanceId: !If [UseNATInstance, !Ref NATInstance, !Ref 'AWS::NoValue'] | ||
|
|
||
|
|
||
| # Associate the public route table to the public subnet | ||
| PublicSubnetRouteTableAssociation: | ||
|
|
@@ -630,4 +726,4 @@ Outputs: | |
| FSxOpenZFSFileSystemDNSname: | ||
| Condition: CreateOpenZFSCondition | ||
| Description: The DNS of the FSxOpenZFS filesystem that has been created | ||
| Value: !GetAtt FSxOpenZFSFileSystem.DNSName | ||
| Value: !GetAtt FSxOpenZFSFileSystem.DNSName | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, this template is just for Slurm, so context will not make sense for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in draft mode still.