diff --git a/hack/reduce_nat_gateway_cost/AGENTS.md b/hack/reduce_nat_gateway_cost/AGENTS.md new file mode 100644 index 0000000000..cd0ff6176a --- /dev/null +++ b/hack/reduce_nat_gateway_cost/AGENTS.md @@ -0,0 +1,356 @@ +# NAT Gateway Cost Reduction - Agent Reference + +This document provides comprehensive information for AI agents working with the NAT Gateway cost reduction solution. + +## Overview + +This solution replaces expensive AWS NAT Gateways with low-cost NAT instances for ephemeral OpenShift CI clusters. NAT Gateway traffic costs several cents per GB, while NAT instance traffic is nearly free (only the small EC2 instance and its public IP are charged). + +### How It Works + +1. **EventBridge** monitors EC2 API calls via CloudTrail +2. When a `CreateNatGateway` event is detected, a **Lambda function** is triggered +3. The Lambda creates a small EC2 instance (NAT instance) in the same AZ +4. The NAT instance's **userData script** updates the private route table to route `0.0.0.0/0` traffic through itself instead of the NAT Gateway +5. When the cluster is destroyed (`TerminateInstances` or `DeleteNatGateway` events), the Lambda cleans up the NAT instance + +## AWS Accounts + +The solution is deployed to these AWS accounts: + +| Profile Name | Account ID | Notes | +|--------------|------------|-------| +| `openshift-ci-1` | 460538899914 | Primary CI account | +| `openshift-ci-2` | 892173657978 | CI account | +| `openshift-ci-3` | 130757279292 | CI account | +| `openshift-ci-4` | 387344700001 | CI account | +| `openshift-ci-5` | 677273281401 | CI account | + +### AWS Profile Setup + +**Important:** The user is responsible for configuring AWS CLI profiles on their local machine. These profiles must be set up in `~/.aws/config` and `~/.aws/credentials` (or via SSO) before any deployment or monitoring operations. + +**AI agents should verify profiles before use:** + +```bash +# Test that a profile exists and has valid credentials +aws --profile sts get-caller-identity + +# Test all profiles at once +for PROFILE in openshift-ci-1 openshift-ci-2 openshift-ci-3 openshift-ci-4 openshift-ci-5; do + echo -n "$PROFILE: " + aws --profile "$PROFILE" sts get-caller-identity --query 'Account' --output text 2>&1 || echo "FAILED" +done +``` + +If credentials are expired or missing, ask the user to refresh their AWS authentication before proceeding. + +## File Structure + +``` +hack/reduce_nat_gateway_cost/ +├── replace_nat_with_nat_instance.py # Main Lambda function code +├── use-nat-instance.yaml # CloudFormation template for us-east-1 +├── use-nat-instance-forwarders.yaml # CloudFormation template for other regions +├── deploy.sh # Deployment script +├── monitor_resources.py # Python monitoring script +├── lambda.zip # Packaged Lambda (auto-generated) +└── AGENTS.md # This file +``` + +## Key Resources Created + +### In us-east-1 (Main Region) + +| Resource | Name | Purpose | +|----------|------|---------| +| Lambda Function | `use-nat-instance-function` | Main logic - creates NAT instances, updates routes | +| IAM Role | `use-nat-instance-function-role` | Lambda execution permissions | +| IAM Role | `use-nat-instance-role` | NAT instance permissions (modify routes) | +| IAM Instance Profile | `use-nat-instance-profile` | Attached to NAT instances | +| EventBridge Rule | `use-nat-instance-event-rule` | Triggers Lambda on EC2 events | +| IAM Role | `use-nat-instance-execution-role` | EventBridge to invoke Lambda | +| CloudWatch Log Group | `use-nat-instance-log-group` | Lambda logs (14 day retention) | + +### In us-east-2, us-west-1, us-west-2 (Forwarder Regions) + +| Resource | Name | Purpose | +|----------|------|---------| +| EventBridge Rule | `use-nat-instance-forward-event-rule` | Forwards events to us-east-1 | +| IAM Role | `use-nat-instance-forward-role-{region}` | Permission to forward events | + +## Deployment + +### Deploy to an Account + +```bash +cd hack/reduce_nat_gateway_cost +./deploy.sh + +# Example: +./deploy.sh openshift-ci-1 +``` + +### Deploy to All Accounts + +```bash +for PROFILE in openshift-ci-1 openshift-ci-2 openshift-ci-3 openshift-ci-4 openshift-ci-5; do + ./deploy.sh "$PROFILE" +done +``` + +### Destroy (Disable NAT Instance Replacement) + +```bash +./deploy.sh destroy +``` + +**Important:** Resources have `DeletionPolicy: Retain` to prevent accidental deletion. The `destroy` command explicitly deletes the Lambda function (the critical resource) but retains IAM roles and other resources. + +### Update Lambda Code Only + +If you only changed `replace_nat_with_nat_instance.py`: + +```bash +cd hack/reduce_nat_gateway_cost +rm -f lambda.zip +zip -r lambda.zip replace_nat_with_nat_instance.py +aws --profile lambda update-function-code \ + --function-name use-nat-instance-function \ + --zip-file fileb://lambda.zip \ + --region us-east-1 +``` + +## Monitoring + +### Run the Monitor Script + +```bash +cd hack/reduce_nat_gateway_cost +python3 monitor_resources.py --once # Single check +python3 monitor_resources.py # Continuous monitoring (5 min intervals) +python3 monitor_resources.py --alarm # With audio alarm on issues +``` + +### What the Monitor Checks + +1. **Expected Resources Exist**: Lambda, IAM roles, instance profile, EventBridge rules +2. **Orphaned IAM Resources**: Instance profiles/roles with `Created-` prefix (legacy) +3. **Orphaned EC2 Resources**: Security groups and NAT instances where VPC no longer exists +4. **Instance Profile Count**: Warns if ≥500 (AWS limit is 1000) +5. **NAT Instance Age**: Alerts if any NAT instance is >8 hours old +6. **Lambda Errors**: CloudWatch metrics for errors in last 8 hours +7. **NAT Instance Effectiveness**: Percentage of NAT instances that successfully updated route tables + +### Check Lambda Logs + +```bash +aws --profile logs filter-log-events \ + --log-group-name use-nat-instance-log-group \ + --filter-pattern "ERROR" \ + --start-time $(( $(date +%s) - 86400 ))000 \ + --region us-east-1 +``` + +### Verify Lambda is Working + +```bash +aws --profile lambda get-function \ + --function-name use-nat-instance-function \ + --region us-east-1 \ + --query '{State: Configuration.State, LastModified: Configuration.LastModified, CodeSha256: Configuration.CodeSha256}' +``` + +## Lambda Event Flow + +### CreateNatGateway Event + +1. Lambda receives event with NAT Gateway details +2. Finds the public subnet where NAT Gateway was created +3. Finds the corresponding private subnet (same VPC/AZ, name contains `-private`) +4. Creates a security group for the NAT instance +5. Launches a NAT instance (t4g.nano ARM64) with userData script +6. Tags the NAT Gateway, route table, and instance for tracking +7. The userData script (on the instance): + - Enables IP forwarding and NAT via iptables + - **Waits for 0.0.0.0/0 route to exist** (up to 5 minutes) + - Replaces the route to point to itself + - Tags the route table with the instance ID + +### DeleteNatGateway / TerminateInstances Events + +1. Lambda identifies affected VPC +2. Finds NAT instances tagged with `ci-nat-gateway` +3. Restores route table to point back to NAT Gateway (if available) +4. Terminates NAT instances +5. Deletes associated security groups + +## Common Issues and Fixes + +### Issue: "Unable to import module 'replace_nat_with_nat_instance'" + +**Cause:** Lambda code not uploaded after CloudFormation deploy. + +**Fix:** Upload the Lambda code: +```bash +rm -f lambda.zip +zip -r lambda.zip replace_nat_with_nat_instance.py +aws --profile lambda update-function-code \ + --function-name use-nat-instance-function \ + --zip-file fileb://lambda.zip \ + --region us-east-1 +``` + +### Issue: "There is no route defined for '0.0.0.0/0' in the route table" + +**Cause:** Race condition - NAT instance starts before cluster installer creates the route. + +**Fix:** v1.3+ of the Lambda includes a retry loop that waits for the route to exist. + +### Issue: NAT instances not being cleaned up + +**Cause:** Lambda was broken during the cleanup event, or forwarder not working. + +**Fix:** Manually terminate orphaned instances: +```bash +aws --profile ec2 terminate-instances --instance-ids --region +``` + +### Issue: Stack in ROLLBACK_COMPLETE state + +**Cause:** Previous deployment failed. + +**Fix:** Delete the stack and redeploy: +```bash +aws --profile cloudformation delete-stack --stack-name use-nat-instance --region us-east-1 +aws --profile cloudformation wait stack-delete-complete --stack-name use-nat-instance --region us-east-1 +./deploy.sh +``` + +### Issue: ResourceExistenceCheck failure during deploy + +**Cause:** Resources with `DeletionPolicy: Retain` exist outside CloudFormation. + +**Fix:** Delete the retained resources manually, then redeploy: +```bash +# Delete Lambda, roles, instance profile, event rules, log group +# Then redeploy +./deploy.sh +``` + +### Issue: NAT instances failing with "UnauthorizedOperation" on ec2:DescribeRouteTables + +**Cause:** The UserData script uses AWS CLI commands that require permissions not granted to `use-nat-instance-role`. + +**Fix:** Update the IAM policy in all accounts: +```bash +for PROFILE in openshift-ci-1 openshift-ci-2 openshift-ci-3 openshift-ci-4 openshift-ci-5; do + aws --profile "$PROFILE" iam put-role-policy \ + --role-name use-nat-instance-role \ + --policy-name nat-instance-policy \ + --policy-document '{ + "Version": "2012-10-17", + "Statement": [ + {"Effect": "Allow", "Action": ["ec2:ReplaceRoute", "ec2:DescribeRouteTables"], "Resource": "*"}, + {"Effect": "Allow", "Action": ["ec2:CreateTags"], "Resource": "arn:aws:ec2:*:*:route-table/*"}, + {"Effect": "Allow", "Action": ["ec2:ModifyInstanceAttribute"], "Resource": "*"} + ] + }' +done +``` + +Also update `use-nat-instance.yaml` so future deployments include the fix. + +## Tags Used + +| Tag Key | Purpose | +|---------|---------| +| `ci-nat-gateway` | Marks NAT Gateway ID associated with a resource | +| `ci-nat-instance` | Instance ID that updated a route table | +| `ci-nat-vpc` | VPC ID for the resource | +| `ci-nat-public-subnet` | Public subnet ID | +| `ci-nat-private-route-table` | Route table the instance is configured to update | +| `ci-nat-replace` | Marker that route replacement is enabled | + +## NAT Instance Details + +- **AMI:** Amazon Linux 2 (latest, ARM64) +- **Instance Types (tried in order):** + 1. `t4g.nano` (preferred - cheapest ARM) + 2. `t4g.micro` + 3. `t3.nano` + 4. `t3.micro` +- **Instance Profile:** `use-nat-instance-profile` +- **Security Group:** Named `{subnet-name}-ci-nat-sg`, allows inbound from private subnet CIDR + +## Useful Commands + +### List NAT Instances in an Account + +```bash +aws --profile ec2 describe-instances \ + --filters "Name=tag-key,Values=ci-nat-gateway" "Name=instance-state-name,Values=running" \ + --region \ + --query 'Reservations[*].Instances[*].[InstanceId,VpcId,LaunchTime,Tags[?Key==`Name`].Value|[0]]' \ + --output table +``` + +### Check Route Table for NAT Instance Route + +```bash +aws --profile ec2 describe-route-tables \ + --route-table-ids \ + --region \ + --query 'RouteTables[0].Routes[?DestinationCidrBlock==`0.0.0.0/0`]' +``` + +### Get NAT Instance Console Output (for debugging userData) + +```bash +aws --profile ec2 get-console-output \ + --instance-id \ + --region \ + --output text +``` + +### Count Instance Profiles + +```bash +aws --profile iam list-instance-profiles --no-paginate \ + --query 'InstanceProfiles[*].InstanceProfileName' --output text | wc -w +``` + +## Version History + +- **v1.0:** Initial implementation with dynamic instance profiles +- **v1.1:** Bug fixes for instance profile cleanup +- **v1.2:** Static instance profile via CloudFormation, improved cleanup +- **v1.3:** Fixed race condition - wait for 0.0.0.0/0 route before replacing +- **v1.3.1:** Added `ec2:DescribeRouteTables` to NAT instance role (required by v1.3 UserData script) + +## Important Notes + +1. **PG&E Cloud Ops Pruner:** If any **new resources** (compared to what currently exists) are added to the CloudFormation templates, the PG&E Cloud Ops pruner will likely automatically delete them from the AWS account shortly after creation. A request must be filed with Cloud Ops to have new resource types whitelisted/preserved before deploying CloudFormation changes that create new resources. + +2. **DeletionPolicy: Retain:** Most resources are retained on stack deletion to prevent accidental service disruption. + +3. **Lambda Timeout:** 10 minutes (600 seconds) - sufficient for all operations. + +4. **Event Retry:** EventBridge retries once with max 5 minute event age. + +5. **Regions:** Lambda runs in us-east-1; forwarders in us-east-2, us-west-1, us-west-2. + +## Lessons Learned + +### UserData Scripts Require Corresponding IAM Permissions + +**Incident (v1.3):** The UserData script was updated to call `aws ec2 describe-route-tables` to wait for the `0.0.0.0/0` route to exist before replacing it. However, the `use-nat-instance-role` IAM policy was not updated to include `ec2:DescribeRouteTables` permission. + +**Result:** All NAT instances launched after the v1.3 deployment failed to update their route tables. The UserData script's retry loop ran for 5 minutes, logging `UnauthorizedOperation` errors on every attempt, then gave up. + +**Detection:** The monitoring script showed NAT instance effectiveness dropping from ~85% to ~23%. + +**Fix:** Added `ec2:DescribeRouteTables` permission to the `use-nat-instance-role` policy in `use-nat-instance.yaml` and manually updated the policy in all accounts using `aws iam put-role-policy`. + +**Lesson:** When modifying the UserData script to call AWS APIs, always verify that the NAT instance's IAM role (`use-nat-instance-role`) has the required permissions. The role's policy is defined in `use-nat-instance.yaml` under `NatInstanceRole.Policies`. + diff --git a/hack/reduce_nat_gateway_cost/deploy.sh b/hack/reduce_nat_gateway_cost/deploy.sh old mode 100644 new mode 100755 index 492b9ba7c1..d245d1c9d3 --- a/hack/reduce_nat_gateway_cost/deploy.sh +++ b/hack/reduce_nat_gateway_cost/deploy.sh @@ -2,9 +2,26 @@ set -ux -if [ "${1:-}" != 'destroy' ]; then +# Change to the directory where this script is located +cd "$(dirname "$0")" - aws cloudformation deploy \ +usage() { + echo "Usage: $0 [destroy]" + echo " aws-profile: AWS CLI profile name to use" + echo " destroy: Optional - if specified, deletes the stack instead of deploying" + exit 1 +} + +if [ -z "${1:-}" ]; then + usage +fi + +AWS_PROFILE="$1" +ACTION="${2:-deploy}" + +if [ "$ACTION" != 'destroy' ]; then + + aws --profile "$AWS_PROFILE" cloudformation deploy \ --stack-name use-nat-instance \ --template-file use-nat-instance.yaml \ --region us-east-1 \ @@ -13,10 +30,13 @@ if [ "${1:-}" != 'destroy' ]; then rm -f lambda.zip zip -r lambda.zip replace_nat_with_nat_instance.py - aws lambda update-function-code --function-name use-nat-instance-function --zip-file fileb://lambda.zip + aws --profile "$AWS_PROFILE" lambda update-function-code \ + --function-name use-nat-instance-function \ + --zip-file fileb://lambda.zip \ + --region us-east-1 for region in us-east-2 us-west-1 us-west-2; do - aws cloudformation deploy \ + aws --profile "$AWS_PROFILE" cloudformation deploy \ --stack-name use-nat-instance-forwarder \ --template-file use-nat-instance-forwarders.yaml \ --capabilities CAPABILITY_NAMED_IAM \ @@ -25,11 +45,23 @@ if [ "${1:-}" != 'destroy' ]; then else - aws cloudformation delete-stack --stack-name use-nat-instance --region us-east-1 + # Delete the Lambda function explicitly since it has DeletionPolicy: Retain. + # This is the critical resource - without it, NAT instance replacement stops. + echo "Deleting Lambda function use-nat-instance-function..." + aws --profile "$AWS_PROFILE" lambda delete-function \ + --function-name use-nat-instance-function \ + --region us-east-1 2>/dev/null || echo "Lambda function not found or already deleted" + + # Delete CloudFormation stacks (other resources will be retained due to DeletionPolicy) + aws --profile "$AWS_PROFILE" cloudformation delete-stack --stack-name use-nat-instance --region us-east-1 for region in us-east-2 us-west-1 us-west-2; do - aws cloudformation delete-stack \ + aws --profile "$AWS_PROFILE" cloudformation delete-stack \ --stack-name use-nat-instance-forwarder \ --region $region done + echo "" + echo "NOTE: Other resources (IAM roles, instance profile, log group, event rule) were retained." + echo "To fully clean up, manually delete these resources or redeploy without DeletionPolicy: Retain." + fi diff --git a/hack/reduce_nat_gateway_cost/monitor_resources.py b/hack/reduce_nat_gateway_cost/monitor_resources.py new file mode 100755 index 0000000000..545e123fc4 --- /dev/null +++ b/hack/reduce_nat_gateway_cost/monitor_resources.py @@ -0,0 +1,853 @@ +#!/usr/bin/env python3 +""" +Monitor script for NAT instance resource leaks. + +Checks for orphaned security groups, NAT instances, and instance profiles +across multiple AWS accounts. Also monitors NAT instance age and Lambda errors. +""" + +import argparse +import subprocess +import sys +import time +from datetime import datetime, timezone, timedelta +from typing import Optional + +import boto3 +from botocore.exceptions import ClientError + +# Configuration +PROFILES = ["openshift-ci-1", "openshift-ci-2", "openshift-ci-3", "openshift-ci-4", "openshift-ci-5"] +REGIONS = ["us-east-1", "us-east-2", "us-west-1", "us-west-2"] +CHECK_INTERVAL_SECONDS = 300 # 5 minutes +INSTANCE_PROFILE_THRESHOLD = 500 +NAT_INSTANCE_AGE_THRESHOLD_HOURS = 8 +NAT_INSTANCE_EFFECTIVENESS_MINUTES = 15 # Check effectiveness for instances older than this +LAMBDA_FUNCTION_NAME = "use-nat-instance-function" +LAMBDA_REGION = "us-east-1" + +# Expected resources that should exist in each account +EXPECTED_INSTANCE_PROFILE = "use-nat-instance-profile" +EXPECTED_ROLES = ["use-nat-instance-role", "use-nat-instance-function-role"] +EXPECTED_EVENTBRIDGE_RULE = "use-nat-instance-event-rule" +EXPECTED_FORWARDER_RULE = "use-nat-instance-forward-event-rule" +FORWARDER_REGIONS = ["us-east-2", "us-west-1", "us-west-2"] + +# ANSI colors +RED = "\033[0;31m" +GREEN = "\033[0;32m" +YELLOW = "\033[1;33m" +CYAN = "\033[0;36m" +NC = "\033[0m" # No Color + + +def format_duration(td: timedelta) -> str: + """Format a timedelta as a human-readable duration string like '2h3m' or '45m'.""" + total_seconds = int(td.total_seconds()) + hours, remainder = divmod(total_seconds, 3600) + minutes, _ = divmod(remainder, 60) + + if hours > 0: + return f"{hours}h{minutes}m" + else: + return f"{minutes}m" + + +def play_alarm(): + """Play an alarm sound using available system utilities.""" + try: + # Try PulseAudio (Linux) + for _ in range(3): + result = subprocess.run( + ["paplay", "/usr/share/sounds/freedesktop/stereo/phone-incoming-call.oga"], + capture_output=True, + timeout=5, + ) + if result.returncode != 0: + subprocess.run( + ["paplay", "/usr/share/sounds/gnome/default/alerts/drip.ogg"], + capture_output=True, + timeout=5, + ) + time.sleep(0.5) + except (FileNotFoundError, subprocess.TimeoutExpired): + try: + # Try macOS + for _ in range(3): + subprocess.run( + ["afplay", "/System/Library/Sounds/Ping.aiff"], + capture_output=True, + timeout=5, + ) + time.sleep(0.5) + except (FileNotFoundError, subprocess.TimeoutExpired): + # Terminal bell as last resort + for _ in range(5): + print("\a", end="", flush=True) + time.sleep(0.3) + + +def get_session(profile: str, region: str = "us-east-1") -> boto3.Session: + """Create a boto3 session for the given profile and region.""" + return boto3.Session(profile_name=profile, region_name=region) + + +def vpc_exists(ec2_client, vpc_id: str) -> bool: + """Check if a VPC exists.""" + try: + ec2_client.describe_vpcs(VpcIds=[vpc_id]) + return True + except ClientError as e: + if "InvalidVpcID.NotFound" in str(e): + return False + raise + + +def check_orphaned_ec2_resources(profile: str, region: str) -> tuple[list[str], list[dict]]: + """ + Check for orphaned security groups and NAT instances in a profile/region. + + Returns: + Tuple of (issues list, nat_instances list with details) + """ + issues = [] + nat_instances = [] + + session = get_session(profile, region) + ec2_client = session.client("ec2") + + # Check for security groups with ci-nat-gateway tag + try: + response = ec2_client.describe_security_groups( + Filters=[{"Name": "tag-key", "Values": ["ci-nat-gateway"]}] + ) + for sg in response.get("SecurityGroups", []): + sg_id = sg["GroupId"] + sg_name = sg.get("GroupName", "unknown") + vpc_id = sg.get("VpcId", "") + if vpc_id and not vpc_exists(ec2_client, vpc_id): + issues.append(f" ORPHANED SECURITY GROUP: {sg_id} ({sg_name}) - VPC {vpc_id} no longer exists") + except ClientError as e: + issues.append(f" ERROR checking security groups: {e}") + + # Check for NAT instances with ci-nat-gateway tag + try: + response = ec2_client.describe_instances( + Filters=[ + {"Name": "tag-key", "Values": ["ci-nat-gateway"]}, + {"Name": "instance-state-name", "Values": ["running", "pending", "stopping", "stopped"]}, + ] + ) + for reservation in response.get("Reservations", []): + for instance in reservation.get("Instances", []): + instance_id = instance["InstanceId"] + vpc_id = instance.get("VpcId", "") + launch_time = instance.get("LaunchTime") + + # Get instance name from tags + instance_name = "unknown" + for tag in instance.get("Tags", []): + if tag["Key"] == "Name": + instance_name = tag["Value"] + break + + nat_instances.append({ + "instance_id": instance_id, + "instance_name": instance_name, + "vpc_id": vpc_id, + "launch_time": launch_time, + "region": region, + "profile": profile, + }) + + # Check if VPC exists + if vpc_id and not vpc_exists(ec2_client, vpc_id): + issues.append(f" ORPHANED NAT INSTANCE: {instance_id} ({instance_name}) - VPC {vpc_id} no longer exists") + except ClientError as e: + issues.append(f" ERROR checking NAT instances: {e}") + + return issues, nat_instances + + +def check_orphaned_iam_resources(profile: str) -> list[str]: + """Check for orphaned instance profiles and roles with Created- prefix.""" + issues = [] + + session = get_session(profile) + iam_client = session.client("iam") + + # Check for instance profiles with Created- prefix + try: + paginator = iam_client.get_paginator("list_instance_profiles") + for page in paginator.paginate(): + for ip in page.get("InstanceProfiles", []): + if "Created-" in ip["InstanceProfileName"]: + issues.append(f" ORPHANED INSTANCE PROFILE: {ip['InstanceProfileName']}") + except ClientError as e: + issues.append(f" ERROR checking instance profiles: {e}") + + # Check for roles with Created- prefix + try: + paginator = iam_client.get_paginator("list_roles") + for page in paginator.paginate(): + for role in page.get("Roles", []): + if "Created-" in role["RoleName"]: + issues.append(f" ORPHANED ROLE: {role['RoleName']}") + except ClientError as e: + issues.append(f" ERROR checking roles: {e}") + + return issues + + +def check_instance_profile_count(profile: str) -> tuple[int, Optional[str]]: + """Check total instance profile count and return warning if over threshold.""" + session = get_session(profile) + iam_client = session.client("iam") + + count = 0 + try: + paginator = iam_client.get_paginator("list_instance_profiles") + for page in paginator.paginate(): + count += len(page.get("InstanceProfiles", [])) + except ClientError as e: + return 0, f" ERROR counting instance profiles: {e}" + + if count >= INSTANCE_PROFILE_THRESHOLD: + return count, f" WARNING: {count} instance profiles (threshold: {INSTANCE_PROFILE_THRESHOLD})" + + return count, None + + +def check_nat_instance_age(nat_instances: list[dict]) -> list[str]: + """Check if any NAT instances are older than the threshold.""" + issues = [] + now = datetime.now(timezone.utc) + threshold = timedelta(hours=NAT_INSTANCE_AGE_THRESHOLD_HOURS) + + for instance in nat_instances: + launch_time = instance.get("launch_time") + if launch_time: + age = now - launch_time + if age > threshold: + hours = age.total_seconds() / 3600 + issues.append( + f" OLD NAT INSTANCE: {instance['instance_id']} ({instance['instance_name']}) " + f"in {instance['profile']}/{instance['region']} - running for {hours:.1f} hours" + ) + + return issues + + +def get_lambda_error_count(profile: str, hours: int = 8) -> tuple[int, Optional[str]]: + """ + Get the number of Lambda errors in the last N hours using CloudWatch metrics. + + Uses the Errors metric which is automatically published by Lambda. + """ + session = get_session(profile, LAMBDA_REGION) + cloudwatch_client = session.client("cloudwatch") + + end_time = datetime.now(timezone.utc) + start_time = end_time - timedelta(hours=hours) + + try: + response = cloudwatch_client.get_metric_statistics( + Namespace="AWS/Lambda", + MetricName="Errors", + Dimensions=[ + {"Name": "FunctionName", "Value": LAMBDA_FUNCTION_NAME}, + ], + StartTime=start_time, + EndTime=end_time, + Period=3600 * hours, # One period covering the entire time range + Statistics=["Sum"], + ) + + datapoints = response.get("Datapoints", []) + if datapoints: + return int(datapoints[0].get("Sum", 0)), None + return 0, None + except ClientError as e: + return 0, f"ERROR getting Lambda metrics: {e}" + + +def get_lambda_last_modified(profile: str) -> tuple[Optional[datetime], Optional[str]]: + """ + Get the last modified time of the Lambda function. + + Returns: + Tuple of (last_modified datetime in UTC, error message if any) + """ + session = get_session(profile, LAMBDA_REGION) + lambda_client = session.client("lambda") + + try: + response = lambda_client.get_function(FunctionName=LAMBDA_FUNCTION_NAME) + last_modified_str = response.get("Configuration", {}).get("LastModified", "") + if last_modified_str: + # Parse ISO format: "2026-01-06T16:11:36.000+0000" + # Handle both formats with and without milliseconds + try: + last_modified = datetime.fromisoformat(last_modified_str.replace("+0000", "+00:00")) + except ValueError: + # Try parsing without timezone offset + last_modified = datetime.strptime( + last_modified_str[:19], "%Y-%m-%dT%H:%M:%S" + ).replace(tzinfo=timezone.utc) + return last_modified, None + return None, "Lambda LastModified not found" + except ClientError as e: + return None, f"ERROR getting Lambda info: {e}" + + +def get_lambda_errors_since_update(profile: str) -> tuple[int, str, Optional[str]]: + """ + Get Lambda errors since the last update or last hour, whichever is more recent. + + This helps assess whether a recent deployment has introduced issues. + + Returns: + Tuple of (error_count, time_description, error_message if any) + """ + session = get_session(profile, LAMBDA_REGION) + cloudwatch_client = session.client("cloudwatch") + + end_time = datetime.now(timezone.utc) + one_hour_ago = end_time - timedelta(hours=1) + + # Get last modified time + last_modified, err = get_lambda_last_modified(profile) + if err: + return 0, "unknown", err + + # Use the more recent of: last_modified or one_hour_ago + if last_modified and last_modified > one_hour_ago: + start_time = last_modified + time_desc = f"since update ({last_modified.strftime('%H:%M:%S')} UTC)" + else: + start_time = one_hour_ago + time_desc = "last hour" + + # Calculate period in seconds (must be a multiple of 60 for CloudWatch) + raw_seconds = int((end_time - start_time).total_seconds()) + # Round up to nearest 60 seconds, minimum 60 + period_seconds = max(60, ((raw_seconds + 59) // 60) * 60) + + try: + response = cloudwatch_client.get_metric_statistics( + Namespace="AWS/Lambda", + MetricName="Errors", + Dimensions=[ + {"Name": "FunctionName", "Value": LAMBDA_FUNCTION_NAME}, + ], + StartTime=start_time, + EndTime=end_time, + Period=period_seconds, + Statistics=["Sum"], + ) + + datapoints = response.get("Datapoints", []) + if datapoints: + return int(datapoints[0].get("Sum", 0)), time_desc, None + return 0, time_desc, None + except ClientError as e: + return 0, time_desc, f"ERROR getting Lambda metrics: {e}" + + +def count_nat_instances_by_region(nat_instances: list[dict]) -> dict[str, int]: + """Count NAT instances by region.""" + counts = {region: 0 for region in REGIONS} + for instance in nat_instances: + region = instance.get("region", "") + if region in counts: + counts[region] += 1 + return counts + + +def get_nat_gateway_stats(profiles: list[str], regions: list[str]) -> dict: + """ + Get NAT Gateway statistics across all profiles and regions. + + Returns: + Dictionary with: + - by_tag_value: dict mapping each ci-nat-replace value to its count + - untagged: count of NAT Gateways with no ci-nat-replace tag + - by_profile: breakdown by profile + """ + stats = { + "by_tag_value": {}, # Maps tag value -> count + "untagged": 0, + "by_profile": {}, + } + + for profile in profiles: + profile_stats = {"by_tag_value": {}, "untagged": 0} + + for region in regions: + try: + session = get_session(profile, region) + ec2_client = session.client("ec2") + + # Get all NAT Gateways that are available (not deleted/pending) + paginator = ec2_client.get_paginator("describe_nat_gateways") + for page in paginator.paginate( + Filters=[{"Name": "state", "Values": ["available", "pending"]}] + ): + for nat_gw in page.get("NatGateways", []): + # Check for ci-nat-replace tag + tags = {t["Key"]: t["Value"] for t in nat_gw.get("Tags", [])} + + if "ci-nat-replace" not in tags: + profile_stats["untagged"] += 1 + else: + tag_value = tags["ci-nat-replace"] + profile_stats["by_tag_value"][tag_value] = profile_stats["by_tag_value"].get(tag_value, 0) + 1 + except ClientError as e: + # Log but continue - don't fail the whole check + pass + + # Aggregate into global stats + for tag_value, count in profile_stats["by_tag_value"].items(): + stats["by_tag_value"][tag_value] = stats["by_tag_value"].get(tag_value, 0) + count + stats["untagged"] += profile_stats["untagged"] + stats["by_profile"][profile] = profile_stats + + return stats + + +def check_nat_instance_effectiveness(nat_instances: list[dict]) -> dict: + """ + Check the effectiveness of NAT instances by verifying route table updates. + + For NAT instances older than NAT_INSTANCE_EFFECTIVENESS_MINUTES, checks whether + the route table has been updated to route traffic through the NAT instance + instead of a NAT Gateway. + + Returns: + Dictionary with effectiveness statistics and details + """ + now = datetime.now(timezone.utc) + min_age = timedelta(minutes=NAT_INSTANCE_EFFECTIVENESS_MINUTES) + + # Filter to instances old enough to evaluate + eligible_instances = [ + inst for inst in nat_instances + if inst.get("launch_time") and (now - inst["launch_time"]) > min_age + ] + + if not eligible_instances: + return { + "eligible_count": 0, + "effective_count": 0, + "ineffective_count": 0, + "percentage": 100.0, + "ineffective_instances": [], + } + + effective_count = 0 + skipped_count = 0 # Terminated/shutting-down instances are skipped + ineffective_instances = [] + + # Group instances by profile and region for efficient API calls + by_profile_region = {} + for inst in eligible_instances: + key = (inst["profile"], inst["region"]) + if key not in by_profile_region: + by_profile_region[key] = [] + by_profile_region[key].append(inst) + + for (profile, region), instances in by_profile_region.items(): + session = get_session(profile, region) + ec2_client = session.client("ec2") + + for inst in instances: + instance_id = inst["instance_id"] + vpc_id = inst["vpc_id"] + + try: + # Get the instance state + instance_response = ec2_client.describe_instances( + InstanceIds=[instance_id] + ) + + if not instance_response.get("Reservations"): + continue + + instance_data = instance_response["Reservations"][0]["Instances"][0] + instance_state = instance_data.get("State", {}).get("Name", "") + + # Skip terminated instances - they restore the NAT gateway route on termination + if instance_state in ("terminated", "shutting-down"): + skipped_count += 1 + continue + + # Get all route tables in the VPC + route_tables_response = ec2_client.describe_route_tables( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + ) + + # Check if any route table has a 0.0.0.0/0 route pointing to this instance + # Note: replace-route with --instance-id sets InstanceId, not NetworkInterfaceId + found_route = False + for rt in route_tables_response.get("RouteTables", []): + for route in rt.get("Routes", []): + if route.get("DestinationCidrBlock") == "0.0.0.0/0": + # Check if this route points to our NAT instance + if route.get("InstanceId") == instance_id: + found_route = True + break + if found_route: + break + + if found_route: + effective_count += 1 + else: + ineffective_instances.append({ + **inst, + "reason": "No route table updated to use this instance", + }) + + except ClientError as e: + ineffective_instances.append({ + **inst, + "reason": f"Error checking: {e}", + }) + + eligible_count = len(eligible_instances) - skipped_count # Exclude terminated instances + ineffective_count = len(ineffective_instances) + percentage = (effective_count / eligible_count * 100) if eligible_count > 0 else 100.0 + + return { + "eligible_count": eligible_count, + "effective_count": effective_count, + "ineffective_count": ineffective_count, + "percentage": percentage, + "ineffective_instances": ineffective_instances, + } + + +def check_expected_resources(profile: str) -> tuple[list[str], list[str]]: + """ + Verify that all expected resources exist in the account. + + Returns: + Tuple of (missing_resources list, present_resources list) + """ + missing = [] + present = [] + + session = get_session(profile, LAMBDA_REGION) + + # Check Lambda function + lambda_client = session.client("lambda") + try: + lambda_client.get_function(FunctionName=LAMBDA_FUNCTION_NAME) + present.append(f"Lambda: {LAMBDA_FUNCTION_NAME}") + except ClientError as e: + if "ResourceNotFoundException" in str(e): + missing.append(f"MISSING Lambda: {LAMBDA_FUNCTION_NAME}") + else: + missing.append(f"ERROR checking Lambda: {e}") + + # Check IAM instance profile + iam_client = session.client("iam") + try: + iam_client.get_instance_profile(InstanceProfileName=EXPECTED_INSTANCE_PROFILE) + present.append(f"Instance Profile: {EXPECTED_INSTANCE_PROFILE}") + except ClientError as e: + if "NoSuchEntity" in str(e): + missing.append(f"MISSING Instance Profile: {EXPECTED_INSTANCE_PROFILE}") + else: + missing.append(f"ERROR checking Instance Profile: {e}") + + # Check IAM roles + for role_name in EXPECTED_ROLES: + try: + iam_client.get_role(RoleName=role_name) + present.append(f"Role: {role_name}") + except ClientError as e: + if "NoSuchEntity" in str(e): + missing.append(f"MISSING Role: {role_name}") + else: + missing.append(f"ERROR checking Role {role_name}: {e}") + + # Check EventBridge rule in us-east-1 + events_client = session.client("events") + try: + events_client.describe_rule(Name=EXPECTED_EVENTBRIDGE_RULE) + present.append(f"EventBridge Rule: {EXPECTED_EVENTBRIDGE_RULE} (us-east-1)") + except ClientError as e: + if "ResourceNotFoundException" in str(e): + missing.append(f"MISSING EventBridge Rule: {EXPECTED_EVENTBRIDGE_RULE} (us-east-1)") + else: + missing.append(f"ERROR checking EventBridge Rule: {e}") + + # Check forwarder rules in other regions + for region in FORWARDER_REGIONS: + region_session = get_session(profile, region) + region_events_client = region_session.client("events") + try: + region_events_client.describe_rule(Name=EXPECTED_FORWARDER_RULE) + present.append(f"Forwarder Rule: {EXPECTED_FORWARDER_RULE} ({region})") + except ClientError as e: + if "ResourceNotFoundException" in str(e): + missing.append(f"MISSING Forwarder Rule: {EXPECTED_FORWARDER_RULE} ({region})") + else: + missing.append(f"ERROR checking Forwarder Rule ({region}): {e}") + + return missing, present + + +def run_check(play_alarm_on_issues: bool = False) -> bool: + """ + Run all monitoring checks. + + Args: + play_alarm_on_issues: If True, play an alarm sound when issues are detected. + + Returns: + True if problems were found, False otherwise. + """ + timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") + all_issues = [] + all_nat_instances = [] + has_problems = False + + print("=" * 60) + print(f"Resource Monitor Check: {timestamp}") + print("=" * 60) + print() + + # Collect data from all profiles + for profile in PROFILES: + print(f"{YELLOW}Checking profile: {profile}{NC}") + profile_issues = [] + + # Check expected resources exist + missing_resources, present_resources = check_expected_resources(profile) + if missing_resources: + profile_issues.extend([f" {r}" for r in missing_resources]) + + # Check instance profile count + ip_count, ip_warning = check_instance_profile_count(profile) + if ip_warning: + profile_issues.append(ip_warning) + + # Check for orphaned IAM resources + iam_issues = check_orphaned_iam_resources(profile) + profile_issues.extend(iam_issues) + + # Check each region for orphaned EC2 resources + for region in REGIONS: + ec2_issues, nat_instances = check_orphaned_ec2_resources(profile, region) + all_nat_instances.extend(nat_instances) + if ec2_issues: + profile_issues.append(f" Region {region}:") + profile_issues.extend([f" {issue}" for issue in ec2_issues]) + + # Get Lambda error count (only for us-east-1 where Lambda runs) + error_count, error_msg = get_lambda_error_count(profile) + if error_msg: + profile_issues.append(f" {error_msg}") + elif error_count > 0: + profile_issues.append(f" LAMBDA ERRORS: {error_count} errors in the last {NAT_INSTANCE_AGE_THRESHOLD_HOURS} hours") + + if profile_issues: + print(f"{RED}ISSUES FOUND:{NC}") + for issue in profile_issues: + print(issue) + all_issues.extend([f"Profile {profile}:"] + profile_issues) + has_problems = True + else: + print(f"{GREEN} No issues found{NC}") + print() + + # Check NAT instance age across all profiles + age_issues = check_nat_instance_age(all_nat_instances) + if age_issues: + print(f"{RED}NAT INSTANCE AGE ISSUES:{NC}") + for issue in age_issues: + print(issue) + all_issues.extend(["NAT Instance Age Issues:"] + age_issues) + has_problems = True + print() + + # Display NAT instance counts by region + print(f"{CYAN}NAT Instance Counts by Region:{NC}") + region_counts = count_nat_instances_by_region(all_nat_instances) + total_instances = 0 + for region in REGIONS: + count = region_counts[region] + total_instances += count + print(f" {region}: {count}") + print(f" Total: {total_instances}") + print() + + # Display NAT Gateway replacement stats + print(f"{CYAN}NAT Gateway Replacement Status:{NC}") + nat_gw_stats = get_nat_gateway_stats(PROFILES, REGIONS) + by_tag_value = nat_gw_stats["by_tag_value"] + untagged = nat_gw_stats["untagged"] + total_tagged = sum(by_tag_value.values()) + total_nat_gw = total_tagged + untagged + + # Sort tag values: "true" first, then alphabetically + sorted_values = sorted(by_tag_value.keys(), key=lambda x: (x.lower() != "true", x.lower())) + for tag_value in sorted_values: + count = by_tag_value[tag_value] + print(f" NAT Gateways with ci-nat-replace={tag_value}: {count}") + print(f" NAT Gateways with no ci-nat-replace tag: {untagged}") + print(f" Total NAT Gateways: {total_nat_gw}") + print(f" NAT Instances launched: {total_instances}") + + # Only "true" results in replacement + true_count = by_tag_value.get("true", 0) + if true_count > 0: + ratio = total_instances / true_count + print(f" NAT Instance to Replaced Gateway ratio: {ratio:.2f}") + print() + + # Check NAT instance effectiveness + print(f"{CYAN}NAT Instance Effectiveness (instances > {NAT_INSTANCE_EFFECTIVENESS_MINUTES} min old):{NC}") + effectiveness = check_nat_instance_effectiveness(all_nat_instances) + if effectiveness["eligible_count"] == 0: + print(f" No NAT instances old enough to evaluate") + else: + pct = effectiveness["percentage"] + if pct >= 95: + color = GREEN + elif pct >= 80: + color = YELLOW + else: + color = RED + print(f" Eligible instances: {effectiveness['eligible_count']}") + print(f" Effective (route updated): {effectiveness['effective_count']}") + print(f" Ineffective: {effectiveness['ineffective_count']}") + print(f" Effectiveness: {color}{pct:.1f}%{NC}") + + # Report ineffective instances as issues if any + if effectiveness["ineffective_instances"]: + now = datetime.now(timezone.utc) + print(f"\n {RED}Ineffective NAT instances:{NC}") + for inst in effectiveness["ineffective_instances"]: + age_str = "" + if inst.get("launch_time"): + age_str = f" [{format_duration(now - inst['launch_time'])}]" + print(f" - {inst['instance_id']} ({inst['instance_name']}) in {inst['profile']}/{inst['region']}{age_str}") + print(f" Reason: {inst['reason']}") + # Add to issues + all_issues.append("NAT Instance Effectiveness Issues:") + for inst in effectiveness["ineffective_instances"]: + age_str = "" + if inst.get("launch_time"): + age_str = f" [{format_duration(now - inst['launch_time'])}]" + all_issues.append( + f" INEFFECTIVE: {inst['instance_id']} ({inst['instance_name']}) " + f"in {inst['profile']}/{inst['region']}{age_str} - {inst['reason']}" + ) + has_problems = True + print() + + # Display Lambda error summary + print(f"{CYAN}Lambda Error Summary (last {NAT_INSTANCE_AGE_THRESHOLD_HOURS} hours):{NC}") + total_errors = 0 + for profile in PROFILES: + error_count, _ = get_lambda_error_count(profile) + total_errors += error_count + if error_count > 0: + print(f" {profile}: {RED}{error_count} errors{NC}") + else: + print(f" {profile}: {GREEN}0 errors{NC}") + print(f" Total: {total_errors}") + print() + + # Display errors since last update (or last hour) + print(f"{CYAN}Lambda Errors Since Update (or last hour):{NC}") + total_recent_errors = 0 + for profile in PROFILES: + error_count, time_desc, err = get_lambda_errors_since_update(profile) + total_recent_errors += error_count + if err: + print(f" {profile}: {YELLOW}{err}{NC}") + elif error_count > 0: + print(f" {profile}: {RED}{error_count} errors ({time_desc}){NC}") + else: + print(f" {profile}: {GREEN}0 errors ({time_desc}){NC}") + print(f" Total: {total_recent_errors}") + print() + + # Display infrastructure health summary + print(f"{CYAN}Infrastructure Health:{NC}") + all_healthy = True + for profile in PROFILES: + missing, present = check_expected_resources(profile) + if missing: + print(f" {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}") + all_healthy = False + else: + print(f" {profile}: {GREEN}OK ({len(present)} resources){NC}") + print() + + if has_problems: + print(f"{RED}{'=' * 60}") + print("ALERT: Resource issues detected!") + print(f"{'=' * 60}{NC}") + print() + for issue in all_issues: + print(issue) + if play_alarm_on_issues: + play_alarm() + return True + else: + print(f"{GREEN}All checks passed - no issues detected{NC}") + return False + + +def main(): + parser = argparse.ArgumentParser( + description="Monitor NAT instance resources for leaks and issues." + ) + parser.add_argument( + "-o", "--once", + action="store_true", + help="Run once and exit (don't loop)", + ) + parser.add_argument( + "-i", "--interval", + type=int, + default=CHECK_INTERVAL_SECONDS, + help=f"Check interval in seconds (default: {CHECK_INTERVAL_SECONDS})", + ) + parser.add_argument( + "-a", "--alarm", + action="store_true", + help="Play an alarm sound when issues are detected", + ) + args = parser.parse_args() + + print("NAT Instance Resource Monitor") + print(f"Checking profiles: {', '.join(PROFILES)}") + print(f"Checking regions: {', '.join(REGIONS)}") + print(f"Instance profile threshold: {INSTANCE_PROFILE_THRESHOLD}") + print(f"NAT instance age threshold: {NAT_INSTANCE_AGE_THRESHOLD_HOURS} hours") + print(f"Check interval: {args.interval}s") + print() + + if args.once: + has_problems = run_check(play_alarm_on_issues=args.alarm) + sys.exit(1 if has_problems else 0) + else: + while True: + try: + run_check(play_alarm_on_issues=args.alarm) + print() + print(f"Next check in {args.interval} seconds... (Ctrl+C to stop)") + time.sleep(args.interval) + except KeyboardInterrupt: + print("\nMonitoring stopped.") + break + + +if __name__ == "__main__": + main() + diff --git a/hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py b/hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py index 33df171a86..ca0647da61 100644 --- a/hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py +++ b/hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py @@ -1,6 +1,5 @@ import boto3 import botocore.exceptions -import json import logging import time from typing import Dict, List, Optional, NamedTuple @@ -116,49 +115,8 @@ # Lambda Role policies required: # - AmazonEC2FullAccess # - AWSLambdaBasicExecutionRole -# - And an inline: -# { -# "Version": "2012-10-17", -# "Statement": [ -# { -# "Effect": "Allow", -# "Action": [ -# "iam:CreateInstanceProfile", # To be able to create new EC2 instance profiles for the NAT instances -# -# # To assign privileges to the instance profiles. -# "iam:AddRoleToInstanceProfile", -# "iam:GetInstanceProfile", -# "iam:CreateRole", -# "iam:AttachRolePolicy", -# "iam:PutRolePolicy", -# "iam:ListInstanceProfiles", -# "iam:ListInstanceProfilesForRole", -# "iam:ListRoles", -# "iam:GetRole", -# "iam:ListAttachedRolePolicies", -# "iam:ListRolePolicies", -# -# # To clean up instance profile -# "iam:DeleteInstanceProfile", -# "iam:DeleteRole", -# "iam:DetachRolePolicy", -# "iam:RemoveRoleFromInstanceProfile", -# "iam:DeleteRolePolicy", -# "iam:TagInstanceProfile", -# -# # To get information on the latest AMIs -# "ssm:GetParameter" -# ], -# "Resource": "*" -# }, -# # Allow the lambda to pass a permission into a newly created instance profile. -# { -# "Effect": "Allow", -# "Action": "iam:PassRole", -# "Resource": "arn:aws:iam::892173657978:role/Created-*" -# } -# ] -# } +# - ssm:GetParameter (to get the latest AMI) +# - iam:PassRole for the NAT instance role (created by CloudFormation) # Used on different resources to indicate the NAT instance @@ -190,7 +148,10 @@ class NatInstanceInfo(NamedTuple): ] -VERSION = 'v1.1' +# The instance profile created by CloudFormation for NAT instances to use +NAT_INSTANCE_PROFILE_NAME = 'use-nat-instance-profile' + +VERSION = 'v1.4' class RequestInfoFilter(logging.Filter): @@ -248,14 +209,6 @@ def get_ec2_client(region): return client_cache[key] -def get_iam_client(region): - global client_cache - key = f'{region}-iam' - if key not in client_cache: - client_cache[key] = boto3.client("iam", region_name=region) - return client_cache[key] - - def get_latest_amazon_linux2_ami(region, nat_instance_idx): global client_cache key = f'{region}-ami' @@ -322,6 +275,9 @@ def lambda_handler(event, context): # because the gateway is in the process of being deleted and may no # longer be associated with the VPC. vpc_id = get_tag(nat_gateway_tags, TAG_KEY_VPC_ID) + if not vpc_id: + logger.warning(f'NAT gateway {nat_gateway_id} does not have VPC tag; skipping cleanup') + return cleanup(region, vpc_id) elif event_name == 'RunInstances': @@ -342,7 +298,7 @@ def lambda_handler(event, context): # be sufficient. instance_name = get_tag(tag_set, 'Name') cluster_role = get_tag(tag_set, 'sigs.k8s.io/cluster-api-provider-aws/role') - if (cluster_role and cluster_role not in ['master', 'control-plane']) or (not cluster_role and '-master' not in instance_name): + if (cluster_role and cluster_role not in ['master', 'control-plane']) or (not cluster_role and (not instance_name or '-master' not in instance_name)): return logger.info(f'Running tagged instance {instance_id} since it is part of the control plane') @@ -374,7 +330,7 @@ def lambda_handler(event, context): if not nat_instance_vpc_id: return - if '-bootstrap' in instance_name: + if instance_name and '-bootstrap' in instance_name: # We should not terminate the NAT instance just because the bootstrap machine is # being terminated. return @@ -418,7 +374,9 @@ def set_nat_instance_enabled(region: str, vpc_id: str, enabled: bool): nat_instance_id = get_tag(route_table_tags, TAG_KEY_NAT_INSTANCE_ID) route_table_id = route_table['RouteTableId'] - if nat_instance_id: + # Validate that nat_instance_id is actually an instance ID (starts with 'i-') + # It could be a NAT Gateway ID if the userData script failed to update the route + if nat_instance_id and nat_instance_id.startswith('i-'): # Check if the route is already correctly set routes = route_table.get('Routes', []) needs_update = True @@ -541,6 +499,10 @@ def handle_create_nat_gateway(region: str, nat_gateway_id: str, public_subnet_id private_subnet=private_subnet, key_name=key_name) + if not new_nat_instance: + logger.error(f'Failed to create NAT instance for NAT gateway {nat_gateway_id}') + return None + new_nat_instance_id = new_nat_instance["InstanceId"] set_tag(ec2_client, nat_gateway_id, TAG_KEY_NAT_INSTANCE_ID, new_nat_instance_id) return new_nat_instance @@ -577,27 +539,62 @@ def set_tag(ec2_client, resource_id, key, value): def create_nat_security_group(ec2_client, nat_gateway_id: str, public_subnet: Dict, private_subnet: Dict) -> Optional[str]: """ Attempts to create a security group for the forthcoming NAT VM instance. Returns - the security group ID. + the security group ID. If the security group already exists (e.g., from a retry), + returns the existing security group ID. """ vpc_id = public_subnet["VpcId"] # Both subnets belong to the same VPC private_cidr = private_subnet["CidrBlock"] # Extract private subnet CIDR public_subnet_name = get_tag(public_subnet.get('Tags', []), 'Name') + sg_name = f"{public_subnet_name}-ci-nat-sg" - # Create Security Group for the instance - sg_response = ec2_client.create_security_group( - GroupName=f"{public_subnet_name}-ci-nat-sg", - Description="Security group for NAT instance", - VpcId=vpc_id, - TagSpecifications=[{ - "ResourceType": "security-group", - "Tags": [ - {"Key": TAG_KEY_NAT_GATEWAY_ID, "Value": nat_gateway_id}, - {'Key': TAG_KEY_VPC_ID, "Value": vpc_id}, + # Check if security group already exists (e.g., from a previous attempt or retry) + try: + existing_sgs = ec2_client.describe_security_groups( + Filters=[ + {"Name": "group-name", "Values": [sg_name]}, + {"Name": "vpc-id", "Values": [vpc_id]}, ] - }] - ) - nat_sg_id = sg_response["GroupId"] - logger.info(f"Created NAT Security Group: {nat_sg_id}") + ) + if existing_sgs.get("SecurityGroups"): + nat_sg_id = existing_sgs["SecurityGroups"][0]["GroupId"] + logger.info(f"Using existing NAT Security Group: {nat_sg_id}") + return nat_sg_id + except botocore.exceptions.ClientError as e: + logger.warning(f"Error checking for existing security group: {e}") + + # Create Security Group for the instance + try: + sg_response = ec2_client.create_security_group( + GroupName=sg_name, + Description="Security group for NAT instance", + VpcId=vpc_id, + TagSpecifications=[{ + "ResourceType": "security-group", + "Tags": [ + {"Key": TAG_KEY_NAT_GATEWAY_ID, "Value": nat_gateway_id}, + {'Key': TAG_KEY_VPC_ID, "Value": vpc_id}, + ] + }] + ) + nat_sg_id = sg_response["GroupId"] + logger.info(f"Created NAT Security Group: {nat_sg_id}") + except botocore.exceptions.ClientError as e: + if 'InvalidGroup.Duplicate' in str(e): + # Race condition - another invocation created it first, fetch the existing one + existing_sgs = ec2_client.describe_security_groups( + Filters=[ + {"Name": "group-name", "Values": [sg_name]}, + {"Name": "vpc-id", "Values": [vpc_id]}, + ] + ) + if existing_sgs.get("SecurityGroups"): + nat_sg_id = existing_sgs["SecurityGroups"][0]["GroupId"] + logger.info(f"Security group already exists, using: {nat_sg_id}") + return nat_sg_id + else: + raise + else: + raise # Allow inbound traffic from the private subnet ec2_client.authorize_security_group_ingress( @@ -638,126 +635,6 @@ def create_nat_security_group(ec2_client, nat_gateway_id: str, public_subnet: Di return nat_sg_id -def create_instance_profile(region, nat_gateway_id: str) -> str: - """ - Creates a role for the NAT instance to use to manage the routing table - for the private subnet. Returns the instance profile name. - """ - - iam_client = get_iam_client(region) - instance_profile_name = f"Created-{nat_gateway_id}" - - try: - iam_client.get_instance_profile(InstanceProfileName=instance_profile_name) - logger.warning(f'Desired instance profile already exists: {instance_profile_name}') - except: - iam_client.create_instance_profile( - InstanceProfileName=instance_profile_name, - # Tag these instances to make it easy to find for clean up - Tags=[ - { - 'Key': TAG_KEY_NAT_GATEWAY_ID, - 'Value': nat_gateway_id - } - ] - ) - logger.info(f'Created desired instance profile: {instance_profile_name}') - - # Allow the ec2 instance to assume a role - trust_policy = { - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": {"Service": "ec2.amazonaws.com"}, - "Action": "sts:AssumeRole" - } - ] - } - - # What the EC2 instance can do with that role - policy_document = { - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": [ - # Allow the instance to update route tables when it finishes booting. - "ec2:ReplaceRoute", - ], - "Resource": "*" - }, - { - # In order for the EC2 NAT instance to indicate that it has successfully - # run through its userData, it will tag the routetable with this it is - # associated at the end of userData script. - "Effect": "Allow", - "Action": [ - "ec2:CreateTags", - ], - "Resource": "arn:aws:ec2:*:*:route-table/*" - }, - { - "Effect": "Allow", - "Action": [ - "ec2:ModifyInstanceAttribute", - ], - "Resource": "*", - "Condition": { - "StringEquals": { - # Only allow the instance to modify its own attributes - # This is used by userData startup script to disable - # source/dest check necessary to perform as a NAT. - "ec2:InstanceID": "${ec2:InstanceID}" - } - } - } - ] - } - - try: - iam_client.get_role(RoleName=instance_profile_name) - logger.warning(f'Desired instance profile Role already exists: {instance_profile_name}') - except: - iam_client.create_role( - RoleName=instance_profile_name, - AssumeRolePolicyDocument=json.dumps(trust_policy) - ) - logger.info(f'Created desired instance profile Role: {instance_profile_name}') - - try: - # Attach the inline policy to the role - iam_client.put_role_policy( - RoleName=instance_profile_name, - PolicyName=instance_profile_name, - PolicyDocument=json.dumps(policy_document) - ) - logger.info(f'Added policy to instance profile Role: {instance_profile_name}') - except Exception as e: - logger.error(f'Error adding policy to instance profile Role: {instance_profile_name}: {e}') - - # IAM is slow in getting resources created and viable for use. - # Retry adding the role for several minutes before giving up - for attempt in reversed(range(24)): - try: - # Add the role to the instance profile - iam_client.add_role_to_instance_profile( - InstanceProfileName=instance_profile_name, - RoleName=instance_profile_name - ) - logger.info(f'Added Role to instance profile: {instance_profile_name}') - break - except botocore.exceptions.ClientError as e: - if e.response["Error"]["Code"] == "NoSuchEntity" and attempt > 0: - logger.info(f"Role '{instance_profile_name}' not ready yet, retrying... {attempt}") - time.sleep(10) - else: - logger.error(f'Error adding role to instance profile to {instance_profile_name}: {e}') - raise - - return instance_profile_name - - def create_nat_instance(ec2_client, vpc_id, nat_gateway_id, public_subnet: Dict, private_subnet: Dict, key_name: Optional[str] = None) -> Optional[Dict]: """ Creates the desired NAT instance and returns the description @@ -801,15 +678,14 @@ def create_nat_instance(ec2_client, vpc_id, nat_gateway_id, public_subnet: Dict, time.sleep(10) if not private_route_table_id: - print(f'Timeout waiting for route table in private subnet: {private_subnet_id}') + logger.error(f'Timeout waiting for route table in private subnet: {private_subnet_id}') + return None nat_sg_id = create_nat_security_group(ec2_client, nat_gateway_id=nat_gateway_id, public_subnet=public_subnet, private_subnet=private_subnet) - instance_profile_name = create_instance_profile(region, nat_gateway_id) - key_pair_info = {} if key_name: key_pair_info['KeyName'] = key_name @@ -817,75 +693,88 @@ def create_nat_instance(ec2_client, vpc_id, nat_gateway_id, public_subnet: Dict, nat_instance = None nat_instance_idx = 0 - # In AWS instance profiles can take several minutes to be created. - # Keep trying over 4 minutes if the exception is instance profile related. - for attempt in reversed(range(24)): - try: - instance = ec2_client.run_instances( - ImageId=get_latest_amazon_linux2_ami(region, nat_instance_idx), - InstanceType=NAT_INSTANCES_INFO[nat_instance_idx].instance_type, - NetworkInterfaces=[ - { - 'AssociatePublicIpAddress': PERMIT_IPv4_ADDRESS_POOL_USE, - 'SubnetId': public_subnet_id, - 'DeviceIndex': 0, - 'Groups': [nat_sg_id], - }, - ], - IamInstanceProfile={ - # As part of its userData, the EC2 instance will update the route table - # for the private network to point to itself. It needs a role to have permission - # to do this. - 'Name': instance_profile_name - }, - UserData=get_nat_instance_user_data( - region=region, - nat_gateway_id=nat_gateway_id, - route_table_id=private_route_table_id, - ), - MinCount=1, - MaxCount=1, - TagSpecifications=[{ - "ResourceType": "instance", - "Tags": [ - { - "Key": TAG_KEY_NAT_GATEWAY_ID, - "Value": nat_gateway_id - }, - { - "Key": TAG_KEY_PUBLIC_SUBNET_ID, - "Value": public_subnet_id - }, - { - "Key": TAG_KEY_PRIVATE_ROUTE_TABLE_ID, - "Value": private_route_table_id - }, + # Try different instance types if one is not supported in the region + try: + while nat_instance_idx < len(NAT_INSTANCES_INFO): + try: + instance = ec2_client.run_instances( + ImageId=get_latest_amazon_linux2_ami(region, nat_instance_idx), + InstanceType=NAT_INSTANCES_INFO[nat_instance_idx].instance_type, + NetworkInterfaces=[ { - "Key": TAG_KEY_VPC_ID, - "Value": vpc_id, + 'AssociatePublicIpAddress': PERMIT_IPv4_ADDRESS_POOL_USE, + 'SubnetId': public_subnet_id, + 'DeviceIndex': 0, + 'Groups': [nat_sg_id], }, - { - "Key": "Name", - "Value": f"{public_subnet_name}-ci-nat", - } - ] - }], - **key_pair_info - ) - nat_instance = instance["Instances"][0] - break - except botocore.exceptions.ClientError as e: - if e.response["Error"]["Code"] == "InvalidParameterValue" and attempt > 0: - error_message = e.response["Error"].get('Message', '') - error_message = error_message.lower() - if 'instanceprofile' in error_message or 'instance profile' in error_message: - logger.info(f'Waiting for IAM instance profile to be available: {instance_profile_name}') - elif e.response["Error"]["Code"] == "Unsupported": - logger.info(f'{NAT_INSTANCES_INFO[nat_instance_idx].instance_type} is not supported on this region.') - nat_instance_idx += 1 - else: - raise - time.sleep(10) + ], + IamInstanceProfile={ + # As part of its userData, the EC2 instance will update the route table + # for the private network to point to itself. It needs a role to have permission + # to do this. The instance profile is created by CloudFormation. + 'Name': NAT_INSTANCE_PROFILE_NAME + }, + UserData=get_nat_instance_user_data( + region=region, + nat_gateway_id=nat_gateway_id, + route_table_id=private_route_table_id, + ), + MinCount=1, + MaxCount=1, + TagSpecifications=[{ + "ResourceType": "instance", + "Tags": [ + { + "Key": TAG_KEY_NAT_GATEWAY_ID, + "Value": nat_gateway_id + }, + { + "Key": TAG_KEY_PUBLIC_SUBNET_ID, + "Value": public_subnet_id + }, + { + "Key": TAG_KEY_PRIVATE_ROUTE_TABLE_ID, + "Value": private_route_table_id + }, + { + "Key": TAG_KEY_VPC_ID, + "Value": vpc_id, + }, + { + "Key": "Name", + "Value": f"{public_subnet_name}-ci-nat", + } + ] + }], + **key_pair_info + ) + nat_instance = instance["Instances"][0] + break + except botocore.exceptions.ClientError as e: + if e.response["Error"]["Code"] == "Unsupported": + logger.info(f'{NAT_INSTANCES_INFO[nat_instance_idx].instance_type} is not supported in this region.') + nat_instance_idx += 1 + else: + raise + except Exception as e: + # If instance creation fails, clean up the security group we created + logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}') + try: + ec2_client.delete_security_group(GroupId=nat_sg_id) + logger.info(f'Deleted orphaned security group {nat_sg_id}') + except Exception as cleanup_error: + logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}') + raise + + if not nat_instance: + # All instance types failed - clean up the security group + logger.error(f'All instance types failed. Cleaning up security group {nat_sg_id}') + try: + ec2_client.delete_security_group(GroupId=nat_sg_id) + logger.info(f'Deleted orphaned security group {nat_sg_id}') + except Exception as cleanup_error: + logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}') + return None # Tag the route table with the nat gateway that we are # going to replace with the NAT instance once it starts. @@ -1005,17 +894,35 @@ def get_nat_instance_user_data(region: str, nat_gateway_id: str, route_table_id: # source/dest must be disabled for a NAT instance. This permission must be provided in the IAM instance profile. if aws ec2 modify-instance-attribute --region {region} --instance-id $INSTANCE_ID --no-source-dest-check; then - # Record the NAT gateway id to swich back to if the NAT instance needs to be shutdown + # Record the NAT gateway id to switch back to if the NAT instance needs to be shutdown # ungracefully. If shutdown gracefully, the NAT instance will restore the NAT automatically. aws ec2 create-tags --region {region} --resources {route_table_id} --tags Key={TAG_KEY_NAT_INSTANCE_ID},Value="{nat_gateway_id}" - # Set the routing table to begin routing traffic to this instance ID. - if aws ec2 replace-route --region {region} --route-table-id {route_table_id} --destination-cidr-block 0.0.0.0/0 --instance-id $INSTANCE_ID; then - # Indicate to subsequent lambda invocations that the NAT instance thinks it is ready to serve traffic - aws ec2 create-tags --region {region} --resources {route_table_id} --tags Key={TAG_KEY_NAT_INSTANCE_ID},Value=$INSTANCE_ID - echo "Routing table for private subnet updated to route 0.0.0.0/0 to NAT instance." + # Wait for the 0.0.0.0/0 route to exist in the route table before trying to replace it. + # The cluster installer may not have created it yet (race condition). + echo "Waiting for 0.0.0.0/0 route to exist in route table {route_table_id}..." + ROUTE_EXISTS=false + for attempt in $(seq 1 60); do + if aws ec2 describe-route-tables --region {region} --route-table-ids {route_table_id} --query 'RouteTables[0].Routes[?DestinationCidrBlock==`0.0.0.0/0`]' --output text | grep -q "0.0.0.0/0"; then + echo "Found 0.0.0.0/0 route after $attempt attempts" + ROUTE_EXISTS=true + break + fi + echo "Attempt $attempt: 0.0.0.0/0 route not found yet, waiting 5 seconds..." + sleep 5 + done + + if [ "$ROUTE_EXISTS" = "false" ]; then + echo "ERROR: Timeout waiting for 0.0.0.0/0 route to be created. NAT instance will not be used." else - echo "ERROR: Unable to set route table entry! NAT instance will not be used." + # Set the routing table to begin routing traffic to this instance ID. + if aws ec2 replace-route --region {region} --route-table-id {route_table_id} --destination-cidr-block 0.0.0.0/0 --instance-id $INSTANCE_ID; then + # Indicate to subsequent lambda invocations that the NAT instance thinks it is ready to serve traffic + aws ec2 create-tags --region {region} --resources {route_table_id} --tags Key={TAG_KEY_NAT_INSTANCE_ID},Value=$INSTANCE_ID + echo "Routing table for private subnet updated to route 0.0.0.0/0 to NAT instance." + else + echo "ERROR: Unable to set route table entry! NAT instance will not be used." + fi fi else echo "ERROR: Unable to disable source/dest check! NAT instance will not be used." @@ -1041,9 +948,7 @@ def cleanup(region: str, vpc_id: str): for each. """ ec2_client = get_ec2_client(region) - iam_client = get_iam_client(region) - created_instance_profiles = set() while True: # Find instances that appear to have been set up as NAT instances. @@ -1063,12 +968,6 @@ def cleanup(region: str, vpc_id: str): instances_to_terminate = [] for instance in all_instances: instance_id = instance['InstanceId'] - instance_profile_name = instance.get('IamInstanceProfile', {}).get('Arn', '').split('/')[-1] - - if instance_profile_name and 'Created-' in instance_profile_name: - # The role was created for the instance, so delete it as well. - created_instance_profiles.add(instance_profile_name) - instance_state = instance['State']['Name'] logger.info(f'Found instance {instance_id} in state {instance_state}') if instance_state == 'terminated': @@ -1092,63 +991,6 @@ def cleanup(region: str, vpc_id: str): # to prevent too many redundant threads performing the cleanup. return - if created_instance_profiles: - logger.info(f"Found {len(created_instance_profiles)} instance profiles to delete: {created_instance_profiles}") - - # To delete an instance profile, you need to remove all roles from it. - # Until this is done, you can't delete the instance profile OR the roles. - for instance_profile_name in created_instance_profiles: - while True: - try: - response = iam_client.get_instance_profile(InstanceProfileName=instance_profile_name) - roles = response["InstanceProfile"]["Roles"] - - try: - for role in roles: - role_name = role["RoleName"] - policies = role.get("PolicyNames", []) - - if policies: - # Delete each inline policy - for policy_name in policies: - iam_client.delete_role_policy(RoleName=role_name, PolicyName=policy_name) - logger.info(f"Deleted inline policy: {policy_name} from role {role_name}") - - iam_client.remove_role_from_instance_profile( - InstanceProfileName=instance_profile_name, - RoleName=role_name - ) - logger.info(f"Removed role {role_name} from instance profile {instance_profile_name}") - except Exception as e: - logger.info(f'Unable to remove roles from instance profile {instance_profile_name}: {e}') - - try: - iam_client.delete_instance_profile( - InstanceProfileName=instance_profile_name - ) - logger.info(f'Deleted instance profile: {instance_profile_name}') - except Exception as e: - logger.info(f'Unable to delete instance profile {instance_profile_name}: {e}') - - try: - # There is also a role with the same name created for - # the profile. - iam_client.delete_role(RoleName=instance_profile_name) - logger.info(f'Deleted role: {instance_profile_name}') - except Exception as e: - logger.info(f'Unable to delete instance profile role: {instance_profile_name}: {e}') - - except botocore.exceptions.ClientError as e: - if e.response["Error"]["Code"] == "NoSuchEntity": - # get_instance_profile didn't find the instance profile. So we have finished - # deleting it. - logger.info(f'Instance profile no longer detected: {instance_profile_name}') - break - logger.info(f'Issue trying to remove instance profile: {instance_profile_name}: {e}') - time.sleep(10) - - created_instance_profiles.clear() - eips = get_eips_by_tag(ec2_client, TAG_KEY_VPC_ID, vpc_id) if eips: logger.info(f"Found {len(eips)} EIPs to release.") @@ -1173,7 +1015,7 @@ def cleanup(region: str, vpc_id: str): try: logger.info(f"Deleting security group {sg_id}...") ec2_client.delete_security_group(GroupId=sg_id) - except ec2_client.exceptions.ClientError as e: + except botocore.exceptions.ClientError as e: error_code = e.response.get('Error', {}).get('Code', '') # Some security groups cannot be deleted if they're in use if error_code == 'DependencyViolation': diff --git a/hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml b/hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml index 8f4fb67f36..04fe9fb592 100644 --- a/hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml +++ b/hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml @@ -4,6 +4,7 @@ Description: >- Resources: UseNatInstanceForwardRole: Type: AWS::IAM::Role + DeletionPolicy: Retain Properties: Description: | Executes the lambda when an expected event is detected. @@ -33,6 +34,7 @@ Resources: Resource: !Sub arn:aws:events:us-east-1:${AWS::AccountId}:event-bus/default UseNatInstanceForwardEventRule: Type: AWS::Events::Rule + DeletionPolicy: Retain Properties: Name: use-nat-instance-forward-event-rule EventPattern: >- diff --git a/hack/reduce_nat_gateway_cost/use-nat-instance.yaml b/hack/reduce_nat_gateway_cost/use-nat-instance.yaml index 1d7d86a0d9..52b9eb298a 100644 --- a/hack/reduce_nat_gateway_cost/use-nat-instance.yaml +++ b/hack/reduce_nat_gateway_cost/use-nat-instance.yaml @@ -8,8 +8,49 @@ AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: An AWS Serverless Application Model template describing your function. Resources: + # IAM Role for NAT instances - allows them to modify route tables and their own attributes + NatInstanceRole: + Type: AWS::IAM::Role + DeletionPolicy: Retain + Properties: + RoleName: use-nat-instance-role + Description: Role for NAT instances to modify route tables and necessary instance attributes + AssumeRolePolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: Allow + Principal: + Service: ec2.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: nat-instance-policy + PolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: Allow + Action: + - ec2:ReplaceRoute + - ec2:DescribeRouteTables + Resource: "*" + - Effect: Allow + Action: + - ec2:CreateTags + Resource: "arn:aws:ec2:*:*:route-table/*" + - Effect: Allow + Action: + - ec2:ModifyInstanceAttribute + Resource: "*" + # Instance Profile that NAT instances will use + NatInstanceProfile: + Type: AWS::IAM::InstanceProfile + DeletionPolicy: Retain + Properties: + InstanceProfileName: use-nat-instance-profile + Roles: + - !Ref NatInstanceRole UseNatInstanceFunctionRole: Type: AWS::IAM::Role + DeletionPolicy: Retain Properties: Description: | Intercept Openshift installation calls and create a virtual machine @@ -35,37 +76,21 @@ Resources: Statement: - Effect: Allow Action: - - iam:CreateInstanceProfile - - iam:AddRoleToInstanceProfile - - iam:GetInstanceProfile - - iam:CreateRole - - iam:AttachRolePolicy - - iam:PutRolePolicy - - iam:ListInstanceProfiles - - iam:ListInstanceProfilesForRole - - iam:ListRoles - - iam:GetRole - - iam:ListAttachedRolePolicies - - iam:ListRolePolicies - - iam:DeleteInstanceProfile - - iam:DeleteRole - - iam:DetachRolePolicy - - iam:RemoveRoleFromInstanceProfile - - iam:DeleteRolePolicy - - iam:TagInstanceProfile - ssm:GetParameter Resource: "*" - Effect: Allow Action: - iam:PassRole - Resource: !Sub arn:aws:iam::${AWS::AccountId}:role/Created-* + Resource: !GetAtt NatInstanceRole.Arn UseNatInstanceLogGroup: Type: AWS::Logs::LogGroup + DeletionPolicy: Retain Properties: LogGroupName: use-nat-instance-log-group RetentionInDays: 14 UseNatInstanceFunction: Type: AWS::Lambda::Function + DeletionPolicy: Retain Properties: FunctionName: use-nat-instance-function Code: @@ -94,6 +119,7 @@ Resources: ApplyOn: None UseNatInstanceExecutionRole: Type: AWS::IAM::Role + DeletionPolicy: Retain Properties: Description: | Executes the lambda when an expected event is detected. @@ -119,6 +145,7 @@ Resources: Resource: !GetAtt UseNatInstanceFunction.Arn UseNatInstanceEventRule: Type: AWS::Events::Rule + DeletionPolicy: Retain Properties: Name: use-nat-instance-event-rule EventPattern: >-