Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 99 additions & 3 deletions 1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Description: >
1.2 TB storage which can be overridden by parameter. A role is also created which
helps to execute HyperPod cluster operations.

#TODO: DO THIS FOR EKS TOO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, this template is just for Slurm, so context will not make sense for users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in draft mode still.


####################
## Stack Metadata ##
Expand All @@ -31,6 +32,7 @@ Metadata:
Parameters:
- PrimarySubnetAZ
- BackupSubnetAZ
- IsLocalZone
- Label:
default: FSx Lustre configuration
Parameters:
Expand Down Expand Up @@ -59,7 +61,7 @@ Metadata:
SSMDocumentName:
default: True/False; Create SSM Session Manager Document. Only set to False if SSM-SessionManagerRunShellAsUbuntu document exists in your account.
PrimarySubnetAZ:
default: Availability zone id to deploy the primary subnets
default: Availability zone id to deploy the primary subnets (OR set this to your Local Zone ID if you set IsLocalZone to True. Example use1-dfw2-az1)
BackupSubnetAZ:
default: (Optional) Availability zone id to deploy the backup private subnet
CreateS3Endpoint:
Expand Down Expand Up @@ -175,6 +177,14 @@ Parameters:
Default: 0
MinValue: 0
MaxValue: 400000

IsLocalZone:
Type: String
Default: 'false'
AllowedValues:
- 'true'
- 'false'
Description: Set to true if you are using a local zone for GB200 (DFW only currently).

###############################
## Conditions for Parameters ##
Expand All @@ -186,6 +196,8 @@ Conditions:
CreateSSMDocument: !Equals [!Ref 'SSMDocumentName', 'true']
CreateOpenZFSCondition: !Equals [!Ref 'CreateOpenZFS', 'true']
ConfigureCustomIops: !Not [!Equals [!Ref OpenZFSIops, 0]]
UseNATInstance: !Equals [!Ref IsLocalZone, 'true']
UseNATGateway: !Equals [!Ref IsLocalZone, 'false']


##########################
Expand Down Expand Up @@ -279,15 +291,97 @@ Resources:

# Create a NAT GW then add it to the public subnet
NATGateway:
Condition: UseNATGateway
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt ElasticIP.AllocationId
SubnetId: !Ref PublicSubnet

ElasticIP:
Condition: UseNATGateway
Type: AWS::EC2::EIP
Properties:
Domain: vpc

# For GB200: Create a NAT Instance for the DFW Local Zone and an EIP
LocalZoneEIP:
Condition: UseNATInstance
Type: AWS::EC2::EIP
Properties:
Domain: vpc
NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fetch Local Zone ID from somewhere in the stack, or as an input param, to make this dynamic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Like I said, this is a draft atm, will test with hardcoded values and make it dynamic later.


NATInstance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NAT Instance required for local zone?

NatGW recommended for HA, resiliency. NAT instance introduced single point of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. NATGW not supported in LZ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nat GW can be used in IAD, and traffic routed from LZ to the NATGW in IAD. @bluecrayon52 implemented this for EKS VPC stack. Plz consult with him. An AWS managed NAT removes single point of failure and also finger pointing to the CFN stack NAT Instance if customer experiences network outages. It comes with cost of potential latency overhead and traffic. Hrushi Gangur would be good SA to consult on this as well

Condition: UseNATInstance
Type: AWS::EC2::Instance
Properties:
ImageId: ami-00ca32bbc84273381 # TODO: CURRENTLY HARDCODED TO IAD -- create map.
InstanceType: c6i.2xlarge
SubnetId: !Ref PublicSubnet
SecurityGroupIds: [!Ref NATSecurityGroup]
SourceDestCheck: false
IamInstanceProfile: !Ref NATInstanceProfile
UserData:
Fn::Base64: !Sub |
#!/bin/bash -ex
yum update -y
yum install iptables-services -y
systemctl enable iptables
systemctl start iptables
echo 'net.ipv4.ip_forward=1' > /etc/sysctl.d/custom-ip-forwarding.conf
sysctl -p /etc/sysctl.d/custom-ip-forwarding.conf
/sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
/sbin/iptables -F FORWARD
service iptables save

# Associate EIP and disable source/dest check
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCEID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id)
aws --region ${AWS::Region} ec2 associate-address --instance-id $INSTANCEID --allocation-id ${LocalZoneEIP.AllocationId}

NATSecurityGroup:
Condition: UseNATInstance
Type: AWS::EC2::SecurityGroup
Properties:
VpcId: !Ref VPC
GroupDescription: NAT Instance Security Group
SecurityGroupIngress:
- CidrIp: 10.0.0.0/8
IpProtocol: "-1"
SecurityGroupEgress:
- CidrIp: 0.0.0.0/0
IpProtocol: "-1"
Tags:
- Key: Name
Value: NATSecurityGroup

NATInstanceProfile:
Condition: UseNATInstance
Type: AWS::IAM::InstanceProfile
Properties:
Roles: [!Ref NATInstanceRole]

NATInstanceRole:
Condition: UseNATInstance
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: ["ec2.amazonaws.com"]
Action: "sts:AssumeRole"
Policies:
- PolicyName: AssociateEIP
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ec2:AssociateAddress
- ec2:ModifyInstanceAttribute
Resource: "*"

# NOTE: when you create additional security groups, you must ensure that every
# security group has ingress/egress from/to its own security group id. Failure
Expand Down Expand Up @@ -392,7 +486,9 @@ Resources:
Properties:
RouteTableId: !Ref PrivateRouteTable
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NATGateway
NatGatewayId: !If [UseNATGateway, !Ref NATGateway, !Ref 'AWS::NoValue']
InstanceId: !If [UseNATInstance, !Ref NATInstance, !Ref 'AWS::NoValue']


# Associate the public route table to the public subnet
PublicSubnetRouteTableAssociation:
Expand Down Expand Up @@ -630,4 +726,4 @@ Outputs:
FSxOpenZFSFileSystemDNSname:
Condition: CreateOpenZFSCondition
Description: The DNS of the FSxOpenZFS filesystem that has been created
Value: !GetAtt FSxOpenZFSFileSystem.DNSName
Value: !GetAtt FSxOpenZFSFileSystem.DNSName