Production-Ready AWS Reference

AWS Cloud Engineer Handbook

A practical, billing-aware guide to AWS architecture — compute, storage, networking, cost optimization, and security patterns with Terraform and Boto3 examples.

Amazon Web Services Terraform · Boto3 Cost-Aware Architecture March 2026

ℹ

Billing-first mindset: Every service section in this handbook calls out the primary cost drivers. AWS charges for compute hours, storage volume, data transfer (egress), and API calls. Understand these four levers before provisioning anything.

Module 1: Compute & Serverless

Amazon EC2 instances, pricing models, AWS Lambda cold starts, and ECS vs EKS container orchestration.

Module 2: Storage & Databases

S3 storage classes, DynamoDB capacity modes, RDS Multi-AZ, and data lifecycle management.

Module 3: Networking & Security

VPC architecture, IAM policies and roles, Security Groups vs NACLs, and least-privilege design.

Module 4: Pricing & Cost Optimization

AWS Pricing Calculator, Cost Explorer hidden costs, and Budgets alerting to prevent bill shock.

Module 5: Common Pitfalls & Security Risks

Public S3 buckets, over-provisioned instances, hardcoded credentials, and how to fix each one.

Module 1: Compute & Serverless

AWS offers compute at every abstraction level — from bare-metal EC2 instances you fully manage, to Lambda functions where you only write the handler. Choosing the right compute model is the single highest-impact cost decision you will make.

Amazon EC2

Elastic Compute Cloud (EC2) provides resizable virtual machines in the cloud. You choose the Instance Type (CPU, memory, network), the AMI (operating system image), and the pricing model. EC2 is the foundation for most AWS workloads.

Instance Types

Family	Use Case	Example Type	vCPUs / RAM
t3 / t3a	General purpose, burstable — dev/test, small APIs	`t3.medium`	2 / 4 GiB
m5 / m6i	General purpose, steady state — production web apps	`m6i.xlarge`	4 / 16 GiB
c5 / c6i	Compute-optimized — batch processing, ML inference	`c6i.2xlarge`	8 / 16 GiB
r5 / r6i	Memory-optimized — in-memory caches, large databases	`r6i.xlarge`	4 / 32 GiB
g5 / p4d	GPU — ML training, video rendering	`g5.xlarge`	4 / 16 GiB + GPU

AMIs (Amazon Machine Images)

An AMI is a pre-built OS snapshot. AWS provides Amazon Linux 2023, Ubuntu, and Windows Server base images. You can create custom AMIs with your dependencies pre-installed to speed up instance boot times. Custom AMIs are stored in S3 and incur storage costs.

Pricing Models

On-Demand

Pay per second (Linux) or per hour (Windows). No commitment. Highest per-unit cost. Best for unpredictable, short-lived workloads.

Reserved Instances

1 or 3-year commitment. Up to 72% savings vs On-Demand. Choose All Upfront, Partial Upfront, or No Upfront. Best for steady-state production.

Spot Instances

Up to 90% savings. AWS can reclaim with 2 minutes notice. Best for fault-tolerant batch jobs, CI/CD runners, and data processing.

💰

Cost drivers: Instance type × hours running + EBS volume storage (GB/month) + data transfer out (egress). A forgotten m5.4xlarge running 24/7 costs ~$560/month. Always tag instances and set billing alerts.

Console Navigation

AWS Console → EC2 Dashboard → Launch Instance → Choose AMI → Select Instance Type → Configure Security Group → Review & Launch

Terraform — Launch an EC2 Instance

# terraform/ec2.tf — Production-ready EC2 instance

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web_server" {
  ami           = "ami-0c02fb55956c7d316"  # Amazon Linux 2023 (us-east-1)
  instance_type = "t3.medium"
  subnet_id     = aws_subnet.public.id

  vpc_security_group_ids = [aws_security_group.web_sg.id]

  # Use IAM Role instead of hardcoded credentials
  iam_instance_profile = aws_iam_instance_profile.ec2_profile.name

  root_block_device {
    volume_size = 30
    volume_type = "gp3"
    encrypted   = true
  }

  tags = {
    Name        = "web-server-prod"
    Environment = "production"
    ManagedBy   = "terraform"
  }

  metadata_options {
    http_tokens = "required"  # Enforce IMDSv2 — prevents SSRF attacks
  }
}

💡

Always enforce IMDSv2. Setting http_tokens = "required" blocks the instance metadata endpoint from being exploited via SSRF. This is a top AWS security best practice.

AWS Lambda

Lambda is a serverless compute service. You upload a function, define a trigger (API Gateway, S3 event, SQS message, schedule), and AWS handles all infrastructure. You pay only for the compute time consumed — billed per millisecond.

Key Constraints

Constraint	Limit	Impact
Max execution time	15 minutes	Long-running jobs must use Step Functions or ECS
Memory range	128 MB — 10,240 MB	CPU scales proportionally with memory allocation
Deployment package	50 MB zipped / 250 MB unzipped	Use Lambda Layers or container images for large deps
Concurrent executions	1,000 (default, can be raised)	Throttled requests return HTTP 429
Ephemeral storage	/tmp — up to 10 GB	Not persistent across invocations

Cold Starts Explained

A cold start happens when Lambda creates a new execution environment for your function. This includes downloading your code, starting the runtime, and running your initialization code. Cold starts typically add 100ms–2s of latency depending on runtime and package size.

Cold Start Mitigation Strategies Latency-sensitive workloads

Provisioned Concurrency: Pre-warms execution environments. Eliminates cold starts but adds cost (~$0.015/GB-hour). SnapStart (Java): Caches initialized snapshots. Keep-alive pings: Schedule a CloudWatch Event to invoke every 5 minutes (budget-friendly but imprecise).

💰

Cost drivers: Number of invocations ($0.20 per 1M) + duration × memory allocated ($0.0000166667 per GB-second). The free tier includes 1M invocations and 400,000 GB-seconds per month. Provisioned concurrency adds a flat hourly charge per pre-warmed environment.

Boto3 — Invoke a Lambda Function

# invoke_lambda.py — Invoke a Lambda function programmatically
import json
import boto3

# Create Lambda client (uses IAM Role credentials automatically on EC2/Lambda)
lambda_client = boto3.client("lambda", region_name="us-east-1")

# Synchronous invocation (RequestResponse)
response = lambda_client.invoke(
    FunctionName="my-data-processor",
    InvocationType="RequestResponse",  # Use "Event" for async
    Payload=json.dumps({
        "source_bucket": "raw-data-prod",
        "object_key": "uploads/2026/03/report.csv",
    }),
)

# Parse response payload
result = json.loads(response["Payload"].read())
print(f"Status: {response['StatusCode']}")
print(f"Result: {result}")

# Async invocation — Lambda queues the event and returns immediately
async_response = lambda_client.invoke(
    FunctionName="my-data-processor",
    InvocationType="Event",
    Payload=json.dumps({"mode": "batch"}),
)
print(f"Async status: {async_response['StatusCode']}")  # 202 = accepted

Terraform — Lambda Function with IAM Role

# terraform/lambda.tf — Serverless function with proper IAM role

resource "aws_iam_role" "lambda_exec" {
  name = "lambda-exec-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_exec.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

data "archive_file" "lambda_zip" {
  type        = "zip"
  source_dir  = "${path.module}/src"
  output_path = "${path.module}/build/function.zip"
}

resource "aws_lambda_function" "processor" {
  function_name    = "my-data-processor"
  runtime          = "python3.12"
  handler          = "handler.lambda_handler"
  role             = aws_iam_role.lambda_exec.arn
  filename         = data.archive_file.lambda_zip.output_path
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256

  memory_size = 256
  timeout     = 30

  environment {
    variables = {
      ENV        = "production"
      LOG_LEVEL  = "INFO"
    }
  }

  tags = {
    ManagedBy = "terraform"
  }
}

Amazon ECS vs EKS

Both services run containerized workloads. The choice depends on your team's Kubernetes expertise and portability requirements.

Dimension	ECS (Elastic Container Service)	EKS (Elastic Kubernetes Service)
Orchestrator	AWS-native (proprietary)	Kubernetes (open-source, CNCF)
Learning curve	Lower — simpler task definitions	Steeper — full K8s API surface
Portability	AWS-only	Multi-cloud (GKE, AKS compatible)
Launch types	EC2, Fargate	EC2, Fargate, managed node groups
Control plane cost	Free (you pay for compute only)	$0.10/hour ($73/month) per cluster
Best for	Teams new to containers; simple microservices	Teams with K8s experience; multi-cloud strategy

💰

Cost drivers (both): Fargate pricing = vCPU/hour + GB memory/hour. EC2 launch type = underlying EC2 instance cost. EKS adds $73/month per cluster for the managed control plane. For small workloads, ECS on Fargate is the most cost-effective path.

Decision Rule ECS vs EKS

Choose ECS if your team is AWS-only and wants simplicity. Choose EKS if you need Kubernetes API compatibility, use Helm charts extensively, or plan to run workloads across multiple cloud providers.

Module 2: Storage & Databases

AWS storage and database services range from object stores (S3) to fully managed relational (RDS) and NoSQL (DynamoDB) databases. Choosing the right storage tier and capacity mode is the most impactful cost decision after compute.

Amazon S3

Simple Storage Service (S3) is AWS's object storage. It stores data as objects inside buckets. Each object can be up to 5 TB. S3 provides 99.999999999% (11 nines) durability. It is the backbone for data lakes, backups, static website hosting, and log storage.

Storage Classes

Class	Use Case	Retrieval	Cost (per GB/month)
S3 Standard	Frequently accessed data	Immediate	~$0.023
S3 Intelligent-Tiering	Unknown or changing access patterns	Immediate	~$0.023 (auto-moves to lower tiers)
S3 Standard-IA	Infrequent access, rapid retrieval needed	Immediate	~$0.0125
S3 Glacier Instant	Archive with millisecond access	Immediate	~$0.004
S3 Glacier Flexible	Archive, retrieval in minutes to hours	1–12 hours	~$0.0036
S3 Glacier Deep Archive	Long-term compliance archive	12–48 hours	~$0.00099

S3 Select

S3 Select lets you retrieve a subset of data from an object using SQL expressions. Instead of downloading a 1 GB CSV and filtering locally, S3 Select pushes the filter to the storage layer — reducing data transfer by up to 400% and improving performance. It works with CSV, JSON, and Parquet files.

Bucket Policies

Bucket policies are JSON-based access control statements attached to the bucket. They define who can access which objects and under what conditions. Always deny public access by default and explicitly grant access to specific IAM principals.

// Example S3 Bucket Policy — allow read access from a specific IAM role only
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAppRoleRead",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/app-backend-role"
      },
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    }
  ]
}

💰

Cost drivers: Storage volume (GB/month) + PUT/GET requests ($0.005 per 1,000 PUTs) + data transfer out ($0.09/GB after first 100 GB). Glacier retrieval fees are the hidden cost — Expedited retrieval from Glacier Flexible costs $0.03/GB.

Boto3 — Upload an Object with Metadata

# s3_upload.py — Upload a file with custom metadata and server-side encryption
import boto3

s3_client = boto3.client("s3", region_name="us-east-1")

# Upload with metadata and encryption
s3_client.upload_file(
    Filename="reports/monthly-sales-2026-03.csv",
    Bucket="my-app-data-prod",
    Key="reports/2026/03/monthly-sales.csv",
    ExtraArgs={
        "Metadata": {
            "uploaded-by": "data-pipeline-v2",
            "report-type": "monthly-sales",
            "fiscal-quarter": "Q1-2026",
        },
        "ServerSideEncryption": "aws:kms",
        "ContentType": "text/csv",
    },
)

print("Upload complete with KMS encryption and custom metadata.")

# Verify by reading back the object metadata
head = s3_client.head_object(
    Bucket="my-app-data-prod",
    Key="reports/2026/03/monthly-sales.csv",
)
print(f"Size: {head['ContentLength']} bytes")
print(f"Metadata: {head['Metadata']}")

Terraform — S3 Bucket with Encryption & Versioning

# terraform/s3.tf — Secure, versioned S3 bucket

resource "aws_s3_bucket" "data" {
  bucket = "my-app-data-prod"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
    bucket_key_enabled = true
  }
}

# Block ALL public access — critical security control
resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Amazon DynamoDB

DynamoDB is a fully managed NoSQL key-value and document database. It delivers single-digit millisecond performance at any scale. The architecture is built on partitioning — understanding keys is essential for both performance and cost.

Key Design

Partition Key & Sort Key Critical Design Decision

The Partition Key (PK) determines which physical partition stores the item. A bad PK (e.g., a boolean or low-cardinality field) creates "hot partitions" and throttling. The Sort Key (SK) enables range queries within a partition. Together they form the composite primary key.

Pattern	Partition Key	Sort Key	Query Example
Orders by customer	`CUSTOMER#123`	`ORDER#2026-03-31`	All orders for customer 123
IoT sensor data	`DEVICE#sensor-42`	`TS#1711929600`	Readings in a time range
User profiles	`USER#alice`	`PROFILE`	Single item lookup

Capacity Modes

Mode	How It Works	Best For	Pricing
Provisioned	You set WCU (Write Capacity Units) and RCU (Read Capacity Units) manually or with auto-scaling	Predictable, steady traffic	$0.00065/WCU-hour, $0.00013/RCU-hour
On-Demand	Pay-per-request. No capacity planning required	Unpredictable or spiky traffic	$1.25 per 1M write units, $0.25 per 1M read units

1 WCU = 1 write/second for items up to 1 KB. 1 RCU = 1 strongly consistent read/second for items up to 4 KB (or 2 eventually consistent reads). Items larger than these thresholds consume more capacity units proportionally.

💰

Cost drivers: Capacity mode (provisioned vs on-demand) + storage ($0.25/GB/month) + data transfer out + Global Tables replication. On-Demand can be 5–7× more expensive than well-tuned Provisioned capacity for steady workloads.

Terraform — DynamoDB Table

# terraform/dynamodb.tf — DynamoDB table with composite key

resource "aws_dynamodb_table" "orders" {
  name         = "orders-prod"
  billing_mode = "PAY_PER_REQUEST"  # On-Demand pricing
  hash_key     = "PK"
  range_key    = "SK"

  attribute {
    name = "PK"
    type = "S"
  }

  attribute {
    name = "SK"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }

  server_side_encryption {
    enabled = true
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

Amazon RDS

Relational Database Service (RDS) manages relational databases — provisioning, patching, backups, and failover. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Amazon Aurora (AWS's cloud-native engine).

Multi-AZ Deployments

Multi-AZ provisions a synchronous standby replica in a different Availability Zone. If the primary fails, RDS automatically fails over to the standby (typically within 60–120 seconds). This is the standard for production databases.

Multi-AZ vs Read Replicas Know the difference

Multi-AZ: High availability — automatic failover, no read traffic served from standby. Read Replicas: Performance scaling — async replication, serves read traffic, no automatic failover to primary role. You can have both simultaneously.

💰

Cost drivers: Instance class × hours + storage type and volume (gp3/io2) + Multi-AZ doubles the instance cost + backup storage beyond the free allocation + data transfer. Aurora Serverless v2 scales to zero but has a minimum ACU charge when active.

Terraform — RDS PostgreSQL with Multi-AZ

# terraform/rds.tf — Production RDS PostgreSQL

resource "aws_db_instance" "postgres" {
  identifier     = "app-db-prod"
  engine         = "postgres"
  engine_version = "16.2"
  instance_class = "db.r6i.large"

  allocated_storage     = 100
  max_allocated_storage = 500   # Auto-scaling up to 500 GB
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "appdb"
  username = "app_admin"
  # Password sourced from AWS Secrets Manager — never hardcode
  manage_master_user_password = true

  multi_az               = true   # High availability
  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.db_sg.id]

  backup_retention_period = 7
  deletion_protection     = true
  skip_final_snapshot     = false
  final_snapshot_identifier = "app-db-prod-final"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

💡

Never hardcode database passwords in Terraform. Use manage_master_user_password = true to let RDS store and rotate credentials in AWS Secrets Manager automatically.

Module 3: Networking & Security

Networking is the foundation of every AWS deployment. A misconfigured VPC, an overly permissive security group, or a missing IAM policy can expose your entire infrastructure. This module covers the core networking and identity building blocks.

Amazon VPC

A Virtual Private Cloud (VPC) is your isolated network within AWS. You define the IP address range (CIDR block), create subnets, attach internet gateways, and configure route tables. Every resource you launch lives inside a VPC.

Core Components

Component	Purpose	Key Detail
VPC	Isolated virtual network	Define CIDR block (e.g., `10.0.0.0/16` = 65,536 IPs)
Subnet	Segment of the VPC tied to one AZ	Public or private based on route table
Internet Gateway (IGW)	Connects VPC to the internet	One per VPC, no bandwidth limits, no cost
NAT Gateway	Lets private subnets reach the internet (outbound only)	$0.045/hour + $0.045/GB processed
Route Table	Rules determining where traffic goes	Each subnet is associated with exactly one route table

Public vs Private Subnets

Public Subnet

Route table has a route to the Internet Gateway (0.0.0.0/0 → igw-xxx). Instances with public IPs can be reached from the internet. Use for load balancers, bastion hosts.

Private Subnet

No route to the Internet Gateway. Outbound internet access (for patches, API calls) goes through a NAT Gateway (0.0.0.0/0 → nat-xxx). Use for application servers, databases, and all backend services.

💰

NAT Gateway is the #1 hidden cost. At $0.045/hour ($32.40/month) + $0.045/GB processed, a NAT Gateway handling 1 TB/month costs $77/month. Use VPC endpoints (free for S3/DynamoDB) to avoid routing AWS service traffic through NAT.

Console Navigation

VPC Dashboard → Your VPCs → Create VPC → VPC and more (wizard)

The "VPC and more" wizard creates a VPC with public/private subnets, route tables, NAT Gateway, and an Internet Gateway in one step.

Terraform — VPC with Public & Private Subnets

# terraform/vpc.tf — Production VPC layout

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "main-vpc" }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "main-igw" }
}

# Public subnet
resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "us-east-1a"
  map_public_ip_on_launch = true

  tags = { Name = "public-subnet-1a" }
}

# Private subnet
resource "aws_subnet" "private" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "us-east-1a"

  tags = { Name = "private-subnet-1a" }
}

# Route table — public (routes to Internet Gateway)
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }

  tags = { Name = "public-rt" }
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

# NAT Gateway for private subnet outbound access
resource "aws_eip" "nat" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id  # NAT GW lives in the public subnet

  tags = { Name = "main-nat" }
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat.id
  }

  tags = { Name = "private-rt" }
}

resource "aws_route_table_association" "private" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

AWS IAM

Identity and Access Management (IAM) controls who can do what on which resources. It is the single most important AWS service. Every API call is authenticated and authorized through IAM. There is no cost for IAM — it is included with every AWS account.

Core Concepts

Entity	What It Is	When to Use
Users	Individual identity with long-term credentials	Human operators needing console/CLI access
Groups	Collection of users sharing the same permissions	Team-based access (e.g., "Developers", "DBAs")
Roles	Temporary identity assumed by services or users	EC2 instances, Lambda functions, cross-account access
Policies	JSON document defining allow/deny permissions	Attached to Users, Groups, or Roles

Rule: Always Prefer Roles Over Access Keys Security

IAM Roles provide temporary credentials that rotate automatically. Access Keys are long-lived and must be manually rotated. EC2 instances, Lambda functions, and ECS tasks should always use IAM Roles — never embedded Access Keys.

Policy Structure

// IAM Policy — least-privilege access to a specific S3 bucket
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadWrite",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    },
    {
      "Sid": "DenyAllOtherS3",
      "Effect": "Deny",
      "Action": "s3:*",
      "NotResource": [
        "arn:aws:s3:::my-app-data-prod",
        "arn:aws:s3:::my-app-data-prod/*"
      ]
    }
  ]
}

Boto3 — Assume a Role Programmatically

# assume_role.py — Assume an IAM Role and use temporary credentials
import boto3

# STS client — used to assume roles
sts_client = boto3.client("sts", region_name="us-east-1")

# Assume the cross-account or service role
assumed = sts_client.assume_role(
    RoleArn="arn:aws:iam::987654321098:role/cross-account-data-reader",
    RoleSessionName="data-pipeline-session",
    DurationSeconds=3600,  # 1 hour max for this session
)

# Extract temporary credentials
creds = assumed["Credentials"]

# Create a new session with the assumed role's credentials
session = boto3.Session(
    aws_access_key_id=creds["AccessKeyId"],
    aws_secret_access_key=creds["SecretAccessKey"],
    aws_session_token=creds["SessionToken"],
)

# Use the session to access resources in the target account
s3 = session.client("s3")
objects = s3.list_objects_v2(
    Bucket="target-account-data-bucket",
    Prefix="exports/",
    MaxKeys=10,
)

for obj in objects.get("Contents", []):
    print(f"  {obj['Key']} — {obj['Size']} bytes")

print(f"Session expires: {creds['Expiration']}")

⚠

Never use * in Action or Resource. "Action": "s3:*", "Resource": "*" grants full S3 access to every bucket in your account. Always scope policies to specific actions and specific resource ARNs.

Security Groups vs NACLs

AWS provides two layers of network filtering. Understanding the difference between stateful and stateless processing is critical for designing secure architectures.

Feature	Security Group (SG)	Network ACL (NACL)
Level	Instance (ENI) level	Subnet level
Statefulness	Stateful — return traffic is automatically allowed	Stateless — must explicitly allow both inbound and outbound
Default behavior	Denies all inbound, allows all outbound	Allows all inbound and outbound
Rule type	Allow rules only (no deny rules)	Both Allow and Deny rules with priority numbers
Evaluation	All rules evaluated together	Rules evaluated in number order — first match wins
Scope	Applied to specific instances	Applied to all instances in the subnet

Stateful vs Stateless — What It Means

Stateful (SG): If you allow inbound HTTP (port 80), the response traffic is automatically allowed out — you don't need a separate outbound rule. Stateless (NACL): You must create both an inbound rule (port 80) AND an outbound rule for ephemeral ports (1024–65535) for the response to reach the client.

Terraform — Security Group Example

# terraform/security_group.tf — Web server security group

resource "aws_security_group" "web_sg" {
  name        = "web-server-sg"
  description = "Allow HTTPS inbound and all outbound"
  vpc_id      = aws_vpc.main.id

  # Inbound — HTTPS only
  ingress {
    description = "HTTPS from anywhere"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Outbound — all traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "web-server-sg" }
}

# Database SG — only accepts traffic from the web tier
resource "aws_security_group" "db_sg" {
  name        = "database-sg"
  description = "PostgreSQL from web tier only"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "PostgreSQL from web servers"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.web_sg.id]  # Reference by SG, not CIDR
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "database-sg" }
}

💡

Reference Security Groups by ID, not CIDR. In the database SG above, we reference security_groups = [web_sg.id] instead of a CIDR range. This means only instances attached to the web SG can reach the database — regardless of their IP address.

Module 4: Pricing & Cost Optimization

AWS bills by the second, by the byte, and by the API call. Without active cost management, a development account can easily reach $1,000+/month from forgotten resources. This module covers the three essential tools for cost visibility and control.

AWS Pricing Calculator

The AWS Pricing Calculator (calculator.aws) lets you model complex multi-service architectures before deploying. You add each service, configure its parameters (instance type, storage, requests/month), and get a monthly/annual estimate.

How to Estimate a Typical Architecture

Open calculator.aws → Create Estimate → Add Service (EC2, RDS, etc.) → Configure Parameters → Review Total

Real-World Example Estimate

Service	Configuration	Monthly Cost
EC2 (web tier)	2× m6i.large, On-Demand, us-east-1	~$140
RDS PostgreSQL	db.r6i.large, Multi-AZ, 100 GB gp3	~$370
S3	500 GB Standard, 10M GET, 1M PUT	~$17
NAT Gateway	1 gateway, 500 GB processed	~$55
ALB	Application Load Balancer, 50 LCU-hours	~$35
Data Transfer	200 GB egress to internet	~$18
Total Estimate		~$635/month

ℹ

Always include NAT Gateway and Data Transfer in estimates. These are the two most commonly overlooked cost components. Teams frequently estimate compute and storage but forget networking costs entirely, leading to 20–40% budget overruns.

Cost Explorer

Cost Explorer is your post-deployment cost analysis tool. It visualizes spending trends, breaks down costs by service/region/tag, and identifies anomalies. Enable it from the Billing Dashboard — it takes 24 hours to populate historical data.

Finding Hidden Costs

The most common "surprise" line items in AWS bills:

NAT Gateway Processing

$0.045/GB for all traffic that flows through the NAT. A chatty microservice downloading 2 TB/month from external APIs costs $90 in NAT fees alone — on top of data transfer charges.

Data Transfer (Egress)

$0.09/GB after the first 100 GB free. Cross-region transfer costs $0.02/GB. Inter-AZ traffic costs $0.01/GB per direction — this adds up quickly with multi-AZ deployments.

Elastic IPs (Unused)

An Elastic IP is free while attached to a running instance. Unattached EIPs cost $0.005/hour ($3.60/month each). Check for orphaned EIPs regularly.

EBS Snapshots

Snapshots are incremental but accumulate. Hundreds of old snapshots from terminated instances can cost $50–200/month. Set lifecycle policies to auto-delete old snapshots.

Console Navigation

AWS Console → Billing & Cost Management → Cost Explorer → Filter by Service or Tag

💡

Tag everything. Cost Explorer can group costs by tags like Environment, Team, or Project. Without tags, you cannot attribute costs — and you cannot optimize what you cannot measure.

AWS Budgets

AWS Budgets lets you set custom spending thresholds and receive alerts via email or SNS when actual or forecasted costs exceed your budget. This is the simplest and most effective way to prevent unexpected bills.

Setting Up a Budget

Billing Dashboard → Budgets → Create Budget → Cost Budget → Set Amount & Alerts

Recommended Alert Thresholds

Threshold	Alert Type	Action
50% of budget	Email to team lead	Early awareness — check for anomalies
80% of budget	Email to team + SNS topic	Review and take corrective action
100% of budget	Email + SNS + Lambda trigger	Automated remediation (stop non-prod instances)
Forecasted > 120%	Email to finance + engineering lead	Immediate investigation required

Terraform — Budget with Alerts

# terraform/budgets.tf — Monthly budget with email alerts

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-account-budget"
  budget_type  = "COST"
  limit_amount = "500"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # Alert at 80% of actual spend
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["cloud-team@example.com"]
  }

  # Alert when forecast exceeds budget
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["cloud-team@example.com", "finance@example.com"]
  }
}

⚠

Budgets don't stop spending. By default, AWS Budgets only sends notifications. To actually stop resources, you need to combine Budgets with an SNS topic that triggers a Lambda function to take action (e.g., stopping EC2 instances, disabling access keys).

Module 5: Common Pitfalls & Security Risks

These three patterns account for the majority of AWS security incidents and cost overruns. Understanding them is more valuable than memorizing individual service features.

Pitfall #1: Public S3 Buckets

Publicly accessible S3 buckets are the #1 cause of cloud data breaches. Misconfigured bucket policies or legacy ACLs can expose customer data, credentials, database backups, and intellectual property to anyone on the internet.

How It Happens

Using "Principal": "*" in a bucket policy without understanding it grants access to the entire internet
Legacy ACLs like public-read or public-read-write from before S3 Block Public Access existed
Granting public access "temporarily" for testing and forgetting to revoke it
Static website hosting on S3 where the data bucket is confused with the website bucket

Prevention: S3 Block Public Access Account-Level Setting

Enable S3 Block Public Access at the account level (not just the bucket level). This overrides any bucket policy or ACL that attempts to make data public. It is enabled by default for new accounts since April 2023.

Console Check

S3 Dashboard → Block Public Access settings for this account → Verify all 4 options are ON

# terraform/s3_account_block.tf — Account-level public access block

resource "aws_s3_account_public_access_block" "account" {
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

🚨

Real-world impact: In 2017, a misconfigured S3 bucket exposed 198 million US voter records. In 2019, Capital One's breach exposed 100 million credit applications via an SSRF attack that accessed S3. Always enable Block Public Access and audit bucket policies with AWS Config rules.

Pitfall #2: Over-Provisioning

Developers and architects consistently choose instances that are 2–4× larger than needed. This is the most common and most expensive mistake in cloud computing — often costing organizations thousands per month in wasted compute.

Why It Happens

"Just to be safe" — choosing m5.2xlarge when m5.large would suffice
Running production-grade instances for development environments 24/7
Not monitoring CPU/memory utilization after deployment
Provisioned capacity on DynamoDB or RDS far exceeding actual peak load

How to Fix It

Tool	What It Does	Action
AWS Compute Optimizer	Analyzes CloudWatch metrics and recommends right-sized instances	Review recommendations monthly
Trusted Advisor	Flags underutilized EC2 instances (CPU < 10% for 14 days)	Downsize or terminate
CloudWatch Alarms	Monitor CPU, memory, and network utilization	Set alerts for avg CPU < 20%
Savings Plans	Commit to $/hour spend (not instance type) for 1–3 years	Apply after right-sizing

Right-Sizing Workflow Monthly Review

Step 1: Enable Compute Optimizer (free). Step 2: Wait 14 days for baseline data. Step 3: Review "Over-provisioned" findings. Step 4: Resize instances (stop → change type → start). Step 5: Only then commit to Reserved Instances or Savings Plans.

💰

Real savings example: Downsizing 10 EC2 instances from m5.2xlarge ($0.384/hr) to m5.large ($0.096/hr) saves $2,073/month — over $24,000/year. This is a 10-minute change in the console.

Pitfall #3: Hardcoded Credentials

Embedding AWS Access Key IDs and Secret Access Keys directly in source code, config files, or environment variables on EC2 instances is a severe security anti-pattern. Leaked credentials are the fastest path to a compromised AWS account.

Common Anti-Patterns

# ❌ NEVER DO THIS — hardcoded credentials in application code
import boto3

# These credentials will end up in Git history, CI logs, and error reports
client = boto3.client(
    "s3",
    aws_access_key_id="AKIAIOSFODNN7EXAMPLE",        # ❌ NEVER
    aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxR...",  # ❌ NEVER
)

The Correct Approach: IAM Roles

# ✅ CORRECT — Boto3 automatically uses IAM Role credentials
import boto3

# When running on EC2, Lambda, or ECS, boto3 automatically discovers
# temporary credentials from the instance metadata service (IMDS)
# or the ECS task role. No credentials needed in code.
client = boto3.client("s3", region_name="us-east-1")

# This "just works" because the EC2 instance / Lambda function
# has an IAM Role attached with the necessary S3 permissions
response = client.list_objects_v2(Bucket="my-app-data-prod")

How IAM Roles Work for Services

Service	How Role Is Attached	Credential Source
EC2	Instance Profile (IAM Role)	Instance Metadata Service (IMDSv2)
Lambda	Execution Role (configured in function settings)	Environment variables (auto-injected)
ECS	Task Role (in task definition)	Task metadata endpoint
EKS	IAM Roles for Service Accounts (IRSA)	OIDC token exchange

If You Find Leaked Credentials Incident Response

Step 1: Immediately deactivate the Access Key in IAM (do NOT delete yet — audit first). Step 2: Check CloudTrail for unauthorized API calls using those credentials. Step 3: Rotate any secrets the compromised role had access to. Step 4: Enable GuardDuty to detect future credential misuse. Step 5: Delete the key after investigation.

Prevention: Use AWS Secrets Manager

For secrets that cannot be replaced by IAM Roles (third-party API keys, database passwords for non-RDS databases), use AWS Secrets Manager to store and rotate them.

# retrieve_secret.py — Fetch a secret from AWS Secrets Manager
import json
import boto3

# Uses IAM Role credentials — no hardcoded keys needed
secrets_client = boto3.client("secretsmanager", region_name="us-east-1")

response = secrets_client.get_secret_value(
    SecretId="prod/myapp/db-credentials"
)

secret = json.loads(response["SecretString"])
db_host = secret["host"]
db_user = secret["username"]
db_pass = secret["password"]

print(f"Connecting to {db_host} as {db_user}")

🚨

AWS scans GitHub for leaked keys. If AWS detects an Access Key in a public GitHub repository, they notify you and may automatically quarantine the key. But attackers scan faster — automated bots can find and exploit leaked keys within minutes. Never commit credentials to version control.

AWS Cloud Engineer Handbook

Table of Contents

Module 1: Compute & Serverless

Amazon EC2

Instance Types

AMIs (Amazon Machine Images)

Pricing Models

Console Navigation

Terraform — Launch an EC2 Instance

AWS Lambda

Key Constraints

Cold Starts Explained

Boto3 — Invoke a Lambda Function

Terraform — Lambda Function with IAM Role

Amazon ECS vs EKS

Module 2: Storage & Databases

Amazon S3

Storage Classes

S3 Select

Bucket Policies

Boto3 — Upload an Object with Metadata

Terraform — S3 Bucket with Encryption & Versioning

Amazon DynamoDB

Key Design

Capacity Modes

Terraform — DynamoDB Table

Amazon RDS

Multi-AZ Deployments

Terraform — RDS PostgreSQL with Multi-AZ

Module 3: Networking & Security

Amazon VPC

Core Components

Public vs Private Subnets

Console Navigation

Terraform — VPC with Public & Private Subnets

AWS IAM

Core Concepts

Policy Structure

Boto3 — Assume a Role Programmatically

Security Groups vs NACLs

Terraform — Security Group Example

Module 4: Pricing & Cost Optimization

AWS Pricing Calculator

How to Estimate a Typical Architecture

Real-World Example Estimate

Cost Explorer

Finding Hidden Costs

Console Navigation

AWS Budgets

Setting Up a Budget

Recommended Alert Thresholds

Terraform — Budget with Alerts

Module 5: Common Pitfalls & Security Risks

Pitfall #1: Public S3 Buckets

How It Happens

Console Check

Pitfall #2: Over-Provisioning

Why It Happens

How to Fix It

Pitfall #3: Hardcoded Credentials

Common Anti-Patterns

The Correct Approach: IAM Roles

How IAM Roles Work for Services

Prevention: Use AWS Secrets Manager