Back to handbooks index
Production-Ready AWS Reference

AWS Cloud Engineer Handbook

A practical, billing-aware guide to AWS architecture — compute, storage, networking, cost optimization, and security patterns with Terraform and Boto3 examples.

Amazon Web Services Terraform · Boto3 Cost-Aware Architecture March 2026
Billing-first mindset: Every service section in this handbook calls out the primary cost drivers. AWS charges for compute hours, storage volume, data transfer (egress), and API calls. Understand these four levers before provisioning anything.

Table of Contents

Module 1: Compute & Serverless
Amazon EC2 instances, pricing models, AWS Lambda cold starts, and ECS vs EKS container orchestration.
Module 2: Storage & Databases
S3 storage classes, DynamoDB capacity modes, RDS Multi-AZ, and data lifecycle management.
Module 3: Networking & Security
VPC architecture, IAM policies and roles, Security Groups vs NACLs, and least-privilege design.
Module 4: Pricing & Cost Optimization
AWS Pricing Calculator, Cost Explorer hidden costs, and Budgets alerting to prevent bill shock.
Module 5: Common Pitfalls & Security Risks
Public S3 buckets, over-provisioned instances, hardcoded credentials, and how to fix each one.

Module 1: Compute & Serverless

AWS offers compute at every abstraction level — from bare-metal EC2 instances you fully manage, to Lambda functions where you only write the handler. Choosing the right compute model is the single highest-impact cost decision you will make.

Amazon EC2

Elastic Compute Cloud (EC2) provides resizable virtual machines in the cloud. You choose the Instance Type (CPU, memory, network), the AMI (operating system image), and the pricing model. EC2 is the foundation for most AWS workloads.

Instance Types

FamilyUse CaseExample TypevCPUs / RAM
t3 / t3aGeneral purpose, burstable — dev/test, small APIst3.medium2 / 4 GiB
m5 / m6iGeneral purpose, steady state — production web appsm6i.xlarge4 / 16 GiB
c5 / c6iCompute-optimized — batch processing, ML inferencec6i.2xlarge8 / 16 GiB
r5 / r6iMemory-optimized — in-memory caches, large databasesr6i.xlarge4 / 32 GiB
g5 / p4dGPU — ML training, video renderingg5.xlarge4 / 16 GiB + GPU

AMIs (Amazon Machine Images)

An AMI is a pre-built OS snapshot. AWS provides Amazon Linux 2023, Ubuntu, and Windows Server base images. You can create custom AMIs with your dependencies pre-installed to speed up instance boot times. Custom AMIs are stored in S3 and incur storage costs.

Pricing Models

On-Demand
Pay per second (Linux) or per hour (Windows). No commitment. Highest per-unit cost. Best for unpredictable, short-lived workloads.
Reserved Instances
1 or 3-year commitment. Up to 72% savings vs On-Demand. Choose All Upfront, Partial Upfront, or No Upfront. Best for steady-state production.
Spot Instances
Up to 90% savings. AWS can reclaim with 2 minutes notice. Best for fault-tolerant batch jobs, CI/CD runners, and data processing.
💰
Cost drivers: Instance type × hours running + EBS volume storage (GB/month) + data transfer out (egress). A forgotten m5.4xlarge running 24/7 costs ~$560/month. Always tag instances and set billing alerts.

Console Navigation

AWS Console EC2 Dashboard Launch Instance Choose AMI Select Instance Type Configure Security Group Review & Launch

Terraform — Launch an EC2 Instance

# terraform/ec2.tf — Production-ready EC2 instance

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web_server" {
  ami           = "ami-0c02fb55956c7d316"  # Amazon Linux 2023 (us-east-1)
  instance_type = "t3.medium"
  subnet_id     = aws_subnet.public.id

  vpc_security_group_ids = [aws_security_group.web_sg.id]

  # Use IAM Role instead of hardcoded credentials
  iam_instance_profile = aws_iam_instance_profile.ec2_profile.name

  root_block_device {
    volume_size = 30
    volume_type = "gp3"
    encrypted   = true
  }

  tags = {
    Name        = "web-server-prod"
    Environment = "production"
    ManagedBy   = "terraform"
  }

  metadata_options {
    http_tokens = "required"  # Enforce IMDSv2 — prevents SSRF attacks
  }
}
💡
Always enforce IMDSv2. Setting http_tokens = "required" blocks the instance metadata endpoint from being exploited via SSRF. This is a top AWS security best practice.

AWS Lambda

Lambda is a serverless compute service. You upload a function, define a trigger (API Gateway, S3 event, SQS message, schedule), and AWS handles all infrastructure. You pay only for the compute time consumed — billed per millisecond.

Key Constraints

ConstraintLimitImpact
Max execution time15 minutesLong-running jobs must use Step Functions or ECS
Memory range128 MB — 10,240 MBCPU scales proportionally with memory allocation
Deployment package50 MB zipped / 250 MB unzippedUse Lambda Layers or container images for large deps
Concurrent executions1,000 (default, can be raised)Throttled requests return HTTP 429
Ephemeral storage/tmp — up to 10 GBNot persistent across invocations

Cold Starts Explained

A cold start happens when Lambda creates a new execution environment for your function. This includes downloading your code, starting the runtime, and running your initialization code. Cold starts typically add 100ms–2s of latency depending on runtime and package size.

Cold Start Mitigation Strategies Latency-sensitive workloads

Provisioned Concurrency: Pre-warms execution environments. Eliminates cold starts but adds cost (~$0.015/GB-hour). SnapStart (Java): Caches initialized snapshots. Keep-alive pings: Schedule a CloudWatch Event to invoke every 5 minutes (budget-friendly but imprecise).

💰
Cost drivers: Number of invocations ($0.20 per 1M) + duration × memory allocated ($0.0000166667 per GB-second). The free tier includes 1M invocations and 400,000 GB-seconds per month. Provisioned concurrency adds a flat hourly charge per pre-warmed environment.

Boto3 — Invoke a Lambda Function

# invoke_lambda.py — Invoke a Lambda function programmatically
import json
import boto3

# Create Lambda client (uses IAM Role credentials automatically on EC2/Lambda)
lambda_client = boto3.client("lambda", region_name="us-east-1")

# Synchronous invocation (RequestResponse)
response = lambda_client.invoke(
    FunctionName="my-data-processor",
    InvocationType="RequestResponse",  # Use "Event" for async
    Payload=json.dumps({
        "source_bucket": "raw-data-prod",
        "object_key": "uploads/2026/03/report.csv",
    }),
)

# Parse response payload
result = json.loads(response["Payload"].read())
print(f"Status: {response['StatusCode']}")
print(f"Result: {result}")

# Async invocation — Lambda queues the event and returns immediately
async_response = lambda_client.invoke(
    FunctionName="my-data-processor",
    InvocationType="Event",
    Payload=json.dumps({"mode": "batch"}),
)
print(f"Async status: {async_response['StatusCode']}")  # 202 = accepted

Terraform — Lambda Function with IAM Role

# terraform/lambda.tf — Serverless function with proper IAM role

resource "aws_iam_role" "lambda_exec" {
  name = "lambda-exec-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_exec.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

data "archive_file" "lambda_zip" {
  type        = "zip"
  source_dir  = "${path.module}/src"
  output_path = "${path.module}/build/function.zip"
}

resource "aws_lambda_function" "processor" {
  function_name    = "my-data-processor"
  runtime          = "python3.12"
  handler          = "handler.lambda_handler"
  role             = aws_iam_role.lambda_exec.arn
  filename         = data.archive_file.lambda_zip.output_path
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256

  memory_size = 256
  timeout     = 30

  environment {
    variables = {
      ENV        = "production"
      LOG_LEVEL  = "INFO"
    }
  }

  tags = {
    ManagedBy = "terraform"
  }
}

Amazon ECS vs EKS

Both services run containerized workloads. The choice depends on your team's Kubernetes expertise and portability requirements.

DimensionECS (Elastic Container Service)EKS (Elastic Kubernetes Service)
OrchestratorAWS-native (proprietary)Kubernetes (open-source, CNCF)
Learning curveLower — simpler task definitionsSteeper — full K8s API surface
PortabilityAWS-onlyMulti-cloud (GKE, AKS compatible)
Launch typesEC2, FargateEC2, Fargate, managed node groups
Control plane costFree (you pay for compute only)$0.10/hour ($73/month) per cluster
Best forTeams new to containers; simple microservicesTeams with K8s experience; multi-cloud strategy
💰
Cost drivers (both): Fargate pricing = vCPU/hour + GB memory/hour. EC2 launch type = underlying EC2 instance cost. EKS adds $73/month per cluster for the managed control plane. For small workloads, ECS on Fargate is the most cost-effective path.
Decision Rule ECS vs EKS

Choose ECS if your team is AWS-only and wants simplicity. Choose EKS if you need Kubernetes API compatibility, use Helm charts extensively, or plan to run workloads across multiple cloud providers.

Module 2: Storage & Databases

AWS storage and database services range from object stores (S3) to fully managed relational (RDS) and NoSQL (DynamoDB) databases. Choosing the right storage tier and capacity mode is the most impactful cost decision after compute.

Amazon S3

Simple Storage Service (S3) is AWS's object storage. It stores data as objects inside buckets. Each object can be up to 5 TB. S3 provides 99.999999999% (11 nines) durability. It is the backbone for data lakes, backups, static website hosting, and log storage.

Storage Classes

ClassUse CaseRetrievalCost (per GB/month)
S3 StandardFrequently accessed dataImmediate~$0.023
S3 Intelligent-TieringUnknown or changing access patternsImmediate~$0.023 (auto-moves to lower tiers)
S3 Standard-IAInfrequent access, rapid retrieval neededImmediate~$0.0125
S3 Glacier InstantArchive with millisecond accessImmediate~$0.004
S3 Glacier FlexibleArchive, retrieval in minutes to hours1–12 hours~$0.0036
S3 Glacier Deep ArchiveLong-term compliance archive12–48 hours~$0.00099

S3 Select

S3 Select lets you retrieve a subset of data from an object using SQL expressions. Instead of downloading a 1 GB CSV and filtering locally, S3 Select pushes the filter to the storage layer — reducing data transfer by up to 400% and improving performance. It works with CSV, JSON, and Parquet files.

Bucket Policies

Bucket policies are JSON-based access control statements attached to the bucket. They define who can access which objects and under what conditions. Always deny public access by default and explicitly grant access to specific IAM principals.

// Example S3 Bucket Policy — allow read access from a specific IAM role only
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAppRoleRead",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/app-backend-role"
      },
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    }
  ]
}
💰
Cost drivers: Storage volume (GB/month) + PUT/GET requests ($0.005 per 1,000 PUTs) + data transfer out ($0.09/GB after first 100 GB). Glacier retrieval fees are the hidden cost — Expedited retrieval from Glacier Flexible costs $0.03/GB.

Boto3 — Upload an Object with Metadata

# s3_upload.py — Upload a file with custom metadata and server-side encryption
import boto3

s3_client = boto3.client("s3", region_name="us-east-1")

# Upload with metadata and encryption
s3_client.upload_file(
    Filename="reports/monthly-sales-2026-03.csv",
    Bucket="my-app-data-prod",
    Key="reports/2026/03/monthly-sales.csv",
    ExtraArgs={
        "Metadata": {
            "uploaded-by": "data-pipeline-v2",
            "report-type": "monthly-sales",
            "fiscal-quarter": "Q1-2026",
        },
        "ServerSideEncryption": "aws:kms",
        "ContentType": "text/csv",
    },
)

print("Upload complete with KMS encryption and custom metadata.")

# Verify by reading back the object metadata
head = s3_client.head_object(
    Bucket="my-app-data-prod",
    Key="reports/2026/03/monthly-sales.csv",
)
print(f"Size: {head['ContentLength']} bytes")
print(f"Metadata: {head['Metadata']}")

Terraform — S3 Bucket with Encryption & Versioning

# terraform/s3.tf — Secure, versioned S3 bucket

resource "aws_s3_bucket" "data" {
  bucket = "my-app-data-prod"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
    bucket_key_enabled = true
  }
}

# Block ALL public access — critical security control
resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Amazon DynamoDB

DynamoDB is a fully managed NoSQL key-value and document database. It delivers single-digit millisecond performance at any scale. The architecture is built on partitioning — understanding keys is essential for both performance and cost.

Key Design

Partition Key & Sort Key Critical Design Decision

The Partition Key (PK) determines which physical partition stores the item. A bad PK (e.g., a boolean or low-cardinality field) creates "hot partitions" and throttling. The Sort Key (SK) enables range queries within a partition. Together they form the composite primary key.

PatternPartition KeySort KeyQuery Example
Orders by customerCUSTOMER#123ORDER#2026-03-31All orders for customer 123
IoT sensor dataDEVICE#sensor-42TS#1711929600Readings in a time range
User profilesUSER#alicePROFILESingle item lookup

Capacity Modes

ModeHow It WorksBest ForPricing
ProvisionedYou set WCU (Write Capacity Units) and RCU (Read Capacity Units) manually or with auto-scalingPredictable, steady traffic$0.00065/WCU-hour, $0.00013/RCU-hour
On-DemandPay-per-request. No capacity planning requiredUnpredictable or spiky traffic$1.25 per 1M write units, $0.25 per 1M read units

1 WCU = 1 write/second for items up to 1 KB. 1 RCU = 1 strongly consistent read/second for items up to 4 KB (or 2 eventually consistent reads). Items larger than these thresholds consume more capacity units proportionally.

💰
Cost drivers: Capacity mode (provisioned vs on-demand) + storage ($0.25/GB/month) + data transfer out + Global Tables replication. On-Demand can be 5–7× more expensive than well-tuned Provisioned capacity for steady workloads.

Terraform — DynamoDB Table

# terraform/dynamodb.tf — DynamoDB table with composite key

resource "aws_dynamodb_table" "orders" {
  name         = "orders-prod"
  billing_mode = "PAY_PER_REQUEST"  # On-Demand pricing
  hash_key     = "PK"
  range_key    = "SK"

  attribute {
    name = "PK"
    type = "S"
  }

  attribute {
    name = "SK"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }

  server_side_encryption {
    enabled = true
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

Amazon RDS

Relational Database Service (RDS) manages relational databases — provisioning, patching, backups, and failover. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Amazon Aurora (AWS's cloud-native engine).

Multi-AZ Deployments

Multi-AZ provisions a synchronous standby replica in a different Availability Zone. If the primary fails, RDS automatically fails over to the standby (typically within 60–120 seconds). This is the standard for production databases.

Multi-AZ vs Read Replicas Know the difference

Multi-AZ: High availability — automatic failover, no read traffic served from standby. Read Replicas: Performance scaling — async replication, serves read traffic, no automatic failover to primary role. You can have both simultaneously.

💰
Cost drivers: Instance class × hours + storage type and volume (gp3/io2) + Multi-AZ doubles the instance cost + backup storage beyond the free allocation + data transfer. Aurora Serverless v2 scales to zero but has a minimum ACU charge when active.

Terraform — RDS PostgreSQL with Multi-AZ

# terraform/rds.tf — Production RDS PostgreSQL

resource "aws_db_instance" "postgres" {
  identifier     = "app-db-prod"
  engine         = "postgres"
  engine_version = "16.2"
  instance_class = "db.r6i.large"

  allocated_storage     = 100
  max_allocated_storage = 500   # Auto-scaling up to 500 GB
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "appdb"
  username = "app_admin"
  # Password sourced from AWS Secrets Manager — never hardcode
  manage_master_user_password = true

  multi_az               = true   # High availability
  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.db_sg.id]

  backup_retention_period = 7
  deletion_protection     = true
  skip_final_snapshot     = false
  final_snapshot_identifier = "app-db-prod-final"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}
💡
Never hardcode database passwords in Terraform. Use manage_master_user_password = true to let RDS store and rotate credentials in AWS Secrets Manager automatically.

Module 3: Networking & Security

Networking is the foundation of every AWS deployment. A misconfigured VPC, an overly permissive security group, or a missing IAM policy can expose your entire infrastructure. This module covers the core networking and identity building blocks.

Amazon VPC

A Virtual Private Cloud (VPC) is your isolated network within AWS. You define the IP address range (CIDR block), create subnets, attach internet gateways, and configure route tables. Every resource you launch lives inside a VPC.

Core Components

ComponentPurposeKey Detail
VPCIsolated virtual networkDefine CIDR block (e.g., 10.0.0.0/16 = 65,536 IPs)
SubnetSegment of the VPC tied to one AZPublic or private based on route table
Internet Gateway (IGW)Connects VPC to the internetOne per VPC, no bandwidth limits, no cost
NAT GatewayLets private subnets reach the internet (outbound only)$0.045/hour + $0.045/GB processed
Route TableRules determining where traffic goesEach subnet is associated with exactly one route table

Public vs Private Subnets

Public Subnet

Route table has a route to the Internet Gateway (0.0.0.0/0 → igw-xxx). Instances with public IPs can be reached from the internet. Use for load balancers, bastion hosts.

Private Subnet

No route to the Internet Gateway. Outbound internet access (for patches, API calls) goes through a NAT Gateway (0.0.0.0/0 → nat-xxx). Use for application servers, databases, and all backend services.

💰
NAT Gateway is the #1 hidden cost. At $0.045/hour ($32.40/month) + $0.045/GB processed, a NAT Gateway handling 1 TB/month costs $77/month. Use VPC endpoints (free for S3/DynamoDB) to avoid routing AWS service traffic through NAT.

Console Navigation

VPC Dashboard Your VPCs Create VPC VPC and more (wizard)

The "VPC and more" wizard creates a VPC with public/private subnets, route tables, NAT Gateway, and an Internet Gateway in one step.

Terraform — VPC with Public & Private Subnets

# terraform/vpc.tf — Production VPC layout

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "main-vpc" }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "main-igw" }
}

# Public subnet
resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "us-east-1a"
  map_public_ip_on_launch = true

  tags = { Name = "public-subnet-1a" }
}

# Private subnet
resource "aws_subnet" "private" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "us-east-1a"

  tags = { Name = "private-subnet-1a" }
}

# Route table — public (routes to Internet Gateway)
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }

  tags = { Name = "public-rt" }
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

# NAT Gateway for private subnet outbound access
resource "aws_eip" "nat" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id  # NAT GW lives in the public subnet

  tags = { Name = "main-nat" }
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat.id
  }

  tags = { Name = "private-rt" }
}

resource "aws_route_table_association" "private" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

AWS IAM

Identity and Access Management (IAM) controls who can do what on which resources. It is the single most important AWS service. Every API call is authenticated and authorized through IAM. There is no cost for IAM — it is included with every AWS account.

Core Concepts

EntityWhat It IsWhen to Use
UsersIndividual identity with long-term credentialsHuman operators needing console/CLI access
GroupsCollection of users sharing the same permissionsTeam-based access (e.g., "Developers", "DBAs")
RolesTemporary identity assumed by services or usersEC2 instances, Lambda functions, cross-account access
PoliciesJSON document defining allow/deny permissionsAttached to Users, Groups, or Roles
Rule: Always Prefer Roles Over Access Keys Security

IAM Roles provide temporary credentials that rotate automatically. Access Keys are long-lived and must be manually rotated. EC2 instances, Lambda functions, and ECS tasks should always use IAM Roles — never embedded Access Keys.

Policy Structure

// IAM Policy — least-privilege access to a specific S3 bucket
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadWrite",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    },
    {
      "Sid": "DenyAllOtherS3",
      "Effect": "Deny",
      "Action": "s3:*",
      "NotResource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    }
  ]
}

Boto3 — Assume a Role Programmatically

# assume_role.py — Assume an IAM Role and use temporary credentials
import boto3

# STS client — used to assume roles
sts_client = boto3.client("sts", region_name="us-east-1")

# Assume the cross-account or service role
assumed = sts_client.assume_role(
    RoleArn="arn:aws:iam::987654321098:role/cross-account-data-reader",
    RoleSessionName="data-pipeline-session",
    DurationSeconds=3600,  # 1 hour max for this session
)

# Extract temporary credentials
creds = assumed["Credentials"]

# Create a new session with the assumed role's credentials
session = boto3.Session(
    aws_access_key_id=creds["AccessKeyId"],
    aws_secret_access_key=creds["SecretAccessKey"],
    aws_session_token=creds["SessionToken"],
)

# Use the session to access resources in the target account
s3 = session.client("s3")
objects = s3.list_objects_v2(
    Bucket="target-account-data-bucket",
    Prefix="exports/",
    MaxKeys=10,
)

for obj in objects.get("Contents", []):
    print(f"  {obj['Key']} — {obj['Size']} bytes")

print(f"Session expires: {creds['Expiration']}")
Never use * in Action or Resource. "Action": "s3:*", "Resource": "*" grants full S3 access to every bucket in your account. Always scope policies to specific actions and specific resource ARNs.

Security Groups vs NACLs

AWS provides two layers of network filtering. Understanding the difference between stateful and stateless processing is critical for designing secure architectures.

FeatureSecurity Group (SG)Network ACL (NACL)
LevelInstance (ENI) levelSubnet level
StatefulnessStateful — return traffic is automatically allowedStateless — must explicitly allow both inbound and outbound
Default behaviorDenies all inbound, allows all outboundAllows all inbound and outbound
Rule typeAllow rules only (no deny rules)Both Allow and Deny rules with priority numbers
EvaluationAll rules evaluated togetherRules evaluated in number order — first match wins
ScopeApplied to specific instancesApplied to all instances in the subnet
Stateful vs Stateless — What It Means

Stateful (SG): If you allow inbound HTTP (port 80), the response traffic is automatically allowed out — you don't need a separate outbound rule. Stateless (NACL): You must create both an inbound rule (port 80) AND an outbound rule for ephemeral ports (1024–65535) for the response to reach the client.

Terraform — Security Group Example

# terraform/security_group.tf — Web server security group

resource "aws_security_group" "web_sg" {
  name        = "web-server-sg"
  description = "Allow HTTPS inbound and all outbound"
  vpc_id      = aws_vpc.main.id

  # Inbound — HTTPS only
  ingress {
    description = "HTTPS from anywhere"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Outbound — all traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "web-server-sg" }
}

# Database SG — only accepts traffic from the web tier
resource "aws_security_group" "db_sg" {
  name        = "database-sg"
  description = "PostgreSQL from web tier only"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "PostgreSQL from web servers"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.web_sg.id]  # Reference by SG, not CIDR
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "database-sg" }
}
💡
Reference Security Groups by ID, not CIDR. In the database SG above, we reference security_groups = [web_sg.id] instead of a CIDR range. This means only instances attached to the web SG can reach the database — regardless of their IP address.

Module 4: Pricing & Cost Optimization

AWS bills by the second, by the byte, and by the API call. Without active cost management, a development account can easily reach $1,000+/month from forgotten resources. This module covers the three essential tools for cost visibility and control.

AWS Pricing Calculator

The AWS Pricing Calculator (calculator.aws) lets you model complex multi-service architectures before deploying. You add each service, configure its parameters (instance type, storage, requests/month), and get a monthly/annual estimate.

How to Estimate a Typical Architecture

Open calculator.aws Create Estimate Add Service (EC2, RDS, etc.) Configure Parameters Review Total

Real-World Example Estimate

ServiceConfigurationMonthly Cost
EC2 (web tier)2× m6i.large, On-Demand, us-east-1~$140
RDS PostgreSQLdb.r6i.large, Multi-AZ, 100 GB gp3~$370
S3500 GB Standard, 10M GET, 1M PUT~$17
NAT Gateway1 gateway, 500 GB processed~$55
ALBApplication Load Balancer, 50 LCU-hours~$35
Data Transfer200 GB egress to internet~$18
Total Estimate~$635/month
Always include NAT Gateway and Data Transfer in estimates. These are the two most commonly overlooked cost components. Teams frequently estimate compute and storage but forget networking costs entirely, leading to 20–40% budget overruns.

Cost Explorer

Cost Explorer is your post-deployment cost analysis tool. It visualizes spending trends, breaks down costs by service/region/tag, and identifies anomalies. Enable it from the Billing Dashboard — it takes 24 hours to populate historical data.

Finding Hidden Costs

The most common "surprise" line items in AWS bills:

NAT Gateway Processing
$0.045/GB for all traffic that flows through the NAT. A chatty microservice downloading 2 TB/month from external APIs costs $90 in NAT fees alone — on top of data transfer charges.
Data Transfer (Egress)
$0.09/GB after the first 100 GB free. Cross-region transfer costs $0.02/GB. Inter-AZ traffic costs $0.01/GB per direction — this adds up quickly with multi-AZ deployments.
Elastic IPs (Unused)
An Elastic IP is free while attached to a running instance. Unattached EIPs cost $0.005/hour ($3.60/month each). Check for orphaned EIPs regularly.
EBS Snapshots
Snapshots are incremental but accumulate. Hundreds of old snapshots from terminated instances can cost $50–200/month. Set lifecycle policies to auto-delete old snapshots.

Console Navigation

AWS Console Billing & Cost Management Cost Explorer Filter by Service or Tag
💡
Tag everything. Cost Explorer can group costs by tags like Environment, Team, or Project. Without tags, you cannot attribute costs — and you cannot optimize what you cannot measure.

AWS Budgets

AWS Budgets lets you set custom spending thresholds and receive alerts via email or SNS when actual or forecasted costs exceed your budget. This is the simplest and most effective way to prevent unexpected bills.

Setting Up a Budget

Billing Dashboard Budgets Create Budget Cost Budget Set Amount & Alerts

Recommended Alert Thresholds

ThresholdAlert TypeAction
50% of budgetEmail to team leadEarly awareness — check for anomalies
80% of budgetEmail to team + SNS topicReview and take corrective action
100% of budgetEmail + SNS + Lambda triggerAutomated remediation (stop non-prod instances)
Forecasted > 120%Email to finance + engineering leadImmediate investigation required

Terraform — Budget with Alerts

# terraform/budgets.tf — Monthly budget with email alerts

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-account-budget"
  budget_type  = "COST"
  limit_amount = "500"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # Alert at 80% of actual spend
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["cloud-team@example.com"]
  }

  # Alert when forecast exceeds budget
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["cloud-team@example.com", "finance@example.com"]
  }
}
Budgets don't stop spending. By default, AWS Budgets only sends notifications. To actually stop resources, you need to combine Budgets with an SNS topic that triggers a Lambda function to take action (e.g., stopping EC2 instances, disabling access keys).

Module 5: Common Pitfalls & Security Risks

These three patterns account for the majority of AWS security incidents and cost overruns. Understanding them is more valuable than memorizing individual service features.

Pitfall #1: Public S3 Buckets

Publicly accessible S3 buckets are the #1 cause of cloud data breaches. Misconfigured bucket policies or legacy ACLs can expose customer data, credentials, database backups, and intellectual property to anyone on the internet.

How It Happens

Prevention: S3 Block Public Access Account-Level Setting

Enable S3 Block Public Access at the account level (not just the bucket level). This overrides any bucket policy or ACL that attempts to make data public. It is enabled by default for new accounts since April 2023.

Console Check

S3 Dashboard Block Public Access settings for this account Verify all 4 options are ON
# terraform/s3_account_block.tf — Account-level public access block

resource "aws_s3_account_public_access_block" "account" {
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
🚨
Real-world impact: In 2017, a misconfigured S3 bucket exposed 198 million US voter records. In 2019, Capital One's breach exposed 100 million credit applications via an SSRF attack that accessed S3. Always enable Block Public Access and audit bucket policies with AWS Config rules.

Pitfall #2: Over-Provisioning

Developers and architects consistently choose instances that are 2–4× larger than needed. This is the most common and most expensive mistake in cloud computing — often costing organizations thousands per month in wasted compute.

Why It Happens

How to Fix It

ToolWhat It DoesAction
AWS Compute OptimizerAnalyzes CloudWatch metrics and recommends right-sized instancesReview recommendations monthly
Trusted AdvisorFlags underutilized EC2 instances (CPU < 10% for 14 days)Downsize or terminate
CloudWatch AlarmsMonitor CPU, memory, and network utilizationSet alerts for avg CPU < 20%
Savings PlansCommit to $/hour spend (not instance type) for 1–3 yearsApply after right-sizing
Right-Sizing Workflow Monthly Review

Step 1: Enable Compute Optimizer (free). Step 2: Wait 14 days for baseline data. Step 3: Review "Over-provisioned" findings. Step 4: Resize instances (stop → change type → start). Step 5: Only then commit to Reserved Instances or Savings Plans.

💰
Real savings example: Downsizing 10 EC2 instances from m5.2xlarge ($0.384/hr) to m5.large ($0.096/hr) saves $2,073/month — over $24,000/year. This is a 10-minute change in the console.

Pitfall #3: Hardcoded Credentials

Embedding AWS Access Key IDs and Secret Access Keys directly in source code, config files, or environment variables on EC2 instances is a severe security anti-pattern. Leaked credentials are the fastest path to a compromised AWS account.

Common Anti-Patterns

# ❌ NEVER DO THIS — hardcoded credentials in application code
import boto3

# These credentials will end up in Git history, CI logs, and error reports
client = boto3.client(
    "s3",
    aws_access_key_id="AKIAIOSFODNN7EXAMPLE",        # ❌ NEVER
    aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxR...",  # ❌ NEVER
)

The Correct Approach: IAM Roles

# ✅ CORRECT — Boto3 automatically uses IAM Role credentials
import boto3

# When running on EC2, Lambda, or ECS, boto3 automatically discovers
# temporary credentials from the instance metadata service (IMDS)
# or the ECS task role. No credentials needed in code.
client = boto3.client("s3", region_name="us-east-1")

# This "just works" because the EC2 instance / Lambda function
# has an IAM Role attached with the necessary S3 permissions
response = client.list_objects_v2(Bucket="my-app-data-prod")

How IAM Roles Work for Services

ServiceHow Role Is AttachedCredential Source
EC2Instance Profile (IAM Role)Instance Metadata Service (IMDSv2)
LambdaExecution Role (configured in function settings)Environment variables (auto-injected)
ECSTask Role (in task definition)Task metadata endpoint
EKSIAM Roles for Service Accounts (IRSA)OIDC token exchange
If You Find Leaked Credentials Incident Response

Step 1: Immediately deactivate the Access Key in IAM (do NOT delete yet — audit first). Step 2: Check CloudTrail for unauthorized API calls using those credentials. Step 3: Rotate any secrets the compromised role had access to. Step 4: Enable GuardDuty to detect future credential misuse. Step 5: Delete the key after investigation.

Prevention: Use AWS Secrets Manager

For secrets that cannot be replaced by IAM Roles (third-party API keys, database passwords for non-RDS databases), use AWS Secrets Manager to store and rotate them.

# retrieve_secret.py — Fetch a secret from AWS Secrets Manager
import json
import boto3

# Uses IAM Role credentials — no hardcoded keys needed
secrets_client = boto3.client("secretsmanager", region_name="us-east-1")

response = secrets_client.get_secret_value(
    SecretId="prod/myapp/db-credentials"
)

secret = json.loads(response["SecretString"])
db_host = secret["host"]
db_user = secret["username"]
db_pass = secret["password"]

print(f"Connecting to {db_host} as {db_user}")
🚨
AWS scans GitHub for leaked keys. If AWS detects an Access Key in a public GitHub repository, they notify you and may automatically quarantine the key. But attackers scan faster — automated bots can find and exploit leaked keys within minutes. Never commit credentials to version control.