AWS Cloud Engineer Handbook
A practical, billing-aware guide to AWS architecture — compute, storage, networking, cost optimization, and security patterns with Terraform and Boto3 examples.
Table of Contents
Module 1: Compute & Serverless
AWS offers compute at every abstraction level — from bare-metal EC2 instances you fully manage, to Lambda functions where you only write the handler. Choosing the right compute model is the single highest-impact cost decision you will make.
Amazon EC2
Elastic Compute Cloud (EC2) provides resizable virtual machines in the cloud. You choose the Instance Type (CPU, memory, network), the AMI (operating system image), and the pricing model. EC2 is the foundation for most AWS workloads.
Instance Types
| Family | Use Case | Example Type | vCPUs / RAM |
|---|---|---|---|
| t3 / t3a | General purpose, burstable — dev/test, small APIs | t3.medium | 2 / 4 GiB |
| m5 / m6i | General purpose, steady state — production web apps | m6i.xlarge | 4 / 16 GiB |
| c5 / c6i | Compute-optimized — batch processing, ML inference | c6i.2xlarge | 8 / 16 GiB |
| r5 / r6i | Memory-optimized — in-memory caches, large databases | r6i.xlarge | 4 / 32 GiB |
| g5 / p4d | GPU — ML training, video rendering | g5.xlarge | 4 / 16 GiB + GPU |
AMIs (Amazon Machine Images)
An AMI is a pre-built OS snapshot. AWS provides Amazon Linux 2023, Ubuntu, and Windows Server base images. You can create custom AMIs with your dependencies pre-installed to speed up instance boot times. Custom AMIs are stored in S3 and incur storage costs.
Pricing Models
m5.4xlarge running 24/7 costs ~$560/month. Always tag instances and set billing alerts.Console Navigation
Terraform — Launch an EC2 Instance
# terraform/ec2.tf — Production-ready EC2 instance provider "aws" { region = "us-east-1" } resource "aws_instance" "web_server" { ami = "ami-0c02fb55956c7d316" # Amazon Linux 2023 (us-east-1) instance_type = "t3.medium" subnet_id = aws_subnet.public.id vpc_security_group_ids = [aws_security_group.web_sg.id] # Use IAM Role instead of hardcoded credentials iam_instance_profile = aws_iam_instance_profile.ec2_profile.name root_block_device { volume_size = 30 volume_type = "gp3" encrypted = true } tags = { Name = "web-server-prod" Environment = "production" ManagedBy = "terraform" } metadata_options { http_tokens = "required" # Enforce IMDSv2 — prevents SSRF attacks } }
http_tokens = "required" blocks the instance metadata endpoint from being exploited via SSRF. This is a top AWS security best practice.AWS Lambda
Lambda is a serverless compute service. You upload a function, define a trigger (API Gateway, S3 event, SQS message, schedule), and AWS handles all infrastructure. You pay only for the compute time consumed — billed per millisecond.
Key Constraints
| Constraint | Limit | Impact |
|---|---|---|
| Max execution time | 15 minutes | Long-running jobs must use Step Functions or ECS |
| Memory range | 128 MB — 10,240 MB | CPU scales proportionally with memory allocation |
| Deployment package | 50 MB zipped / 250 MB unzipped | Use Lambda Layers or container images for large deps |
| Concurrent executions | 1,000 (default, can be raised) | Throttled requests return HTTP 429 |
| Ephemeral storage | /tmp — up to 10 GB | Not persistent across invocations |
Cold Starts Explained
A cold start happens when Lambda creates a new execution environment for your function. This includes downloading your code, starting the runtime, and running your initialization code. Cold starts typically add 100ms–2s of latency depending on runtime and package size.
Provisioned Concurrency: Pre-warms execution environments. Eliminates cold starts but adds cost (~$0.015/GB-hour). SnapStart (Java): Caches initialized snapshots. Keep-alive pings: Schedule a CloudWatch Event to invoke every 5 minutes (budget-friendly but imprecise).
Boto3 — Invoke a Lambda Function
# invoke_lambda.py — Invoke a Lambda function programmatically import json import boto3 # Create Lambda client (uses IAM Role credentials automatically on EC2/Lambda) lambda_client = boto3.client("lambda", region_name="us-east-1") # Synchronous invocation (RequestResponse) response = lambda_client.invoke( FunctionName="my-data-processor", InvocationType="RequestResponse", # Use "Event" for async Payload=json.dumps({ "source_bucket": "raw-data-prod", "object_key": "uploads/2026/03/report.csv", }), ) # Parse response payload result = json.loads(response["Payload"].read()) print(f"Status: {response['StatusCode']}") print(f"Result: {result}") # Async invocation — Lambda queues the event and returns immediately async_response = lambda_client.invoke( FunctionName="my-data-processor", InvocationType="Event", Payload=json.dumps({"mode": "batch"}), ) print(f"Async status: {async_response['StatusCode']}") # 202 = accepted
Terraform — Lambda Function with IAM Role
# terraform/lambda.tf — Serverless function with proper IAM role resource "aws_iam_role" "lambda_exec" { name = "lambda-exec-role" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "lambda.amazonaws.com" } }] }) } resource "aws_iam_role_policy_attachment" "lambda_basic" { role = aws_iam_role.lambda_exec.name policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" } data "archive_file" "lambda_zip" { type = "zip" source_dir = "${path.module}/src" output_path = "${path.module}/build/function.zip" } resource "aws_lambda_function" "processor" { function_name = "my-data-processor" runtime = "python3.12" handler = "handler.lambda_handler" role = aws_iam_role.lambda_exec.arn filename = data.archive_file.lambda_zip.output_path source_code_hash = data.archive_file.lambda_zip.output_base64sha256 memory_size = 256 timeout = 30 environment { variables = { ENV = "production" LOG_LEVEL = "INFO" } } tags = { ManagedBy = "terraform" } }
Amazon ECS vs EKS
Both services run containerized workloads. The choice depends on your team's Kubernetes expertise and portability requirements.
| Dimension | ECS (Elastic Container Service) | EKS (Elastic Kubernetes Service) |
|---|---|---|
| Orchestrator | AWS-native (proprietary) | Kubernetes (open-source, CNCF) |
| Learning curve | Lower — simpler task definitions | Steeper — full K8s API surface |
| Portability | AWS-only | Multi-cloud (GKE, AKS compatible) |
| Launch types | EC2, Fargate | EC2, Fargate, managed node groups |
| Control plane cost | Free (you pay for compute only) | $0.10/hour ($73/month) per cluster |
| Best for | Teams new to containers; simple microservices | Teams with K8s experience; multi-cloud strategy |
Choose ECS if your team is AWS-only and wants simplicity. Choose EKS if you need Kubernetes API compatibility, use Helm charts extensively, or plan to run workloads across multiple cloud providers.
Module 2: Storage & Databases
AWS storage and database services range from object stores (S3) to fully managed relational (RDS) and NoSQL (DynamoDB) databases. Choosing the right storage tier and capacity mode is the most impactful cost decision after compute.
Amazon S3
Simple Storage Service (S3) is AWS's object storage. It stores data as objects inside buckets. Each object can be up to 5 TB. S3 provides 99.999999999% (11 nines) durability. It is the backbone for data lakes, backups, static website hosting, and log storage.
Storage Classes
| Class | Use Case | Retrieval | Cost (per GB/month) |
|---|---|---|---|
| S3 Standard | Frequently accessed data | Immediate | ~$0.023 |
| S3 Intelligent-Tiering | Unknown or changing access patterns | Immediate | ~$0.023 (auto-moves to lower tiers) |
| S3 Standard-IA | Infrequent access, rapid retrieval needed | Immediate | ~$0.0125 |
| S3 Glacier Instant | Archive with millisecond access | Immediate | ~$0.004 |
| S3 Glacier Flexible | Archive, retrieval in minutes to hours | 1–12 hours | ~$0.0036 |
| S3 Glacier Deep Archive | Long-term compliance archive | 12–48 hours | ~$0.00099 |
S3 Select
S3 Select lets you retrieve a subset of data from an object using SQL expressions. Instead of downloading a 1 GB CSV and filtering locally, S3 Select pushes the filter to the storage layer — reducing data transfer by up to 400% and improving performance. It works with CSV, JSON, and Parquet files.
Bucket Policies
Bucket policies are JSON-based access control statements attached to the bucket. They define who can access which objects and under what conditions. Always deny public access by default and explicitly grant access to specific IAM principals.
// Example S3 Bucket Policy — allow read access from a specific IAM role only { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAppRoleRead", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012:role/app-backend-role" }, "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::my-app-data-prod", "arn:aws:s3:::my-app-data-prod/*" ] } ] }
Boto3 — Upload an Object with Metadata
# s3_upload.py — Upload a file with custom metadata and server-side encryption import boto3 s3_client = boto3.client("s3", region_name="us-east-1") # Upload with metadata and encryption s3_client.upload_file( Filename="reports/monthly-sales-2026-03.csv", Bucket="my-app-data-prod", Key="reports/2026/03/monthly-sales.csv", ExtraArgs={ "Metadata": { "uploaded-by": "data-pipeline-v2", "report-type": "monthly-sales", "fiscal-quarter": "Q1-2026", }, "ServerSideEncryption": "aws:kms", "ContentType": "text/csv", }, ) print("Upload complete with KMS encryption and custom metadata.") # Verify by reading back the object metadata head = s3_client.head_object( Bucket="my-app-data-prod", Key="reports/2026/03/monthly-sales.csv", ) print(f"Size: {head['ContentLength']} bytes") print(f"Metadata: {head['Metadata']}")
Terraform — S3 Bucket with Encryption & Versioning
# terraform/s3.tf — Secure, versioned S3 bucket resource "aws_s3_bucket" "data" { bucket = "my-app-data-prod" tags = { Environment = "production" ManagedBy = "terraform" } } resource "aws_s3_bucket_versioning" "data" { bucket = aws_s3_bucket.data.id versioning_configuration { status = "Enabled" } } resource "aws_s3_bucket_server_side_encryption_configuration" "data" { bucket = aws_s3_bucket.data.id rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } bucket_key_enabled = true } } # Block ALL public access — critical security control resource "aws_s3_bucket_public_access_block" "data" { bucket = aws_s3_bucket.data.id block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true }
Amazon DynamoDB
DynamoDB is a fully managed NoSQL key-value and document database. It delivers single-digit millisecond performance at any scale. The architecture is built on partitioning — understanding keys is essential for both performance and cost.
Key Design
The Partition Key (PK) determines which physical partition stores the item. A bad PK (e.g., a boolean or low-cardinality field) creates "hot partitions" and throttling. The Sort Key (SK) enables range queries within a partition. Together they form the composite primary key.
| Pattern | Partition Key | Sort Key | Query Example |
|---|---|---|---|
| Orders by customer | CUSTOMER#123 | ORDER#2026-03-31 | All orders for customer 123 |
| IoT sensor data | DEVICE#sensor-42 | TS#1711929600 | Readings in a time range |
| User profiles | USER#alice | PROFILE | Single item lookup |
Capacity Modes
| Mode | How It Works | Best For | Pricing |
|---|---|---|---|
| Provisioned | You set WCU (Write Capacity Units) and RCU (Read Capacity Units) manually or with auto-scaling | Predictable, steady traffic | $0.00065/WCU-hour, $0.00013/RCU-hour |
| On-Demand | Pay-per-request. No capacity planning required | Unpredictable or spiky traffic | $1.25 per 1M write units, $0.25 per 1M read units |
1 WCU = 1 write/second for items up to 1 KB. 1 RCU = 1 strongly consistent read/second for items up to 4 KB (or 2 eventually consistent reads). Items larger than these thresholds consume more capacity units proportionally.
Terraform — DynamoDB Table
# terraform/dynamodb.tf — DynamoDB table with composite key resource "aws_dynamodb_table" "orders" { name = "orders-prod" billing_mode = "PAY_PER_REQUEST" # On-Demand pricing hash_key = "PK" range_key = "SK" attribute { name = "PK" type = "S" } attribute { name = "SK" type = "S" } point_in_time_recovery { enabled = true } server_side_encryption { enabled = true } tags = { Environment = "production" ManagedBy = "terraform" } }
Amazon RDS
Relational Database Service (RDS) manages relational databases — provisioning, patching, backups, and failover. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Amazon Aurora (AWS's cloud-native engine).
Multi-AZ Deployments
Multi-AZ provisions a synchronous standby replica in a different Availability Zone. If the primary fails, RDS automatically fails over to the standby (typically within 60–120 seconds). This is the standard for production databases.
Multi-AZ: High availability — automatic failover, no read traffic served from standby. Read Replicas: Performance scaling — async replication, serves read traffic, no automatic failover to primary role. You can have both simultaneously.
Terraform — RDS PostgreSQL with Multi-AZ
# terraform/rds.tf — Production RDS PostgreSQL resource "aws_db_instance" "postgres" { identifier = "app-db-prod" engine = "postgres" engine_version = "16.2" instance_class = "db.r6i.large" allocated_storage = 100 max_allocated_storage = 500 # Auto-scaling up to 500 GB storage_type = "gp3" storage_encrypted = true db_name = "appdb" username = "app_admin" # Password sourced from AWS Secrets Manager — never hardcode manage_master_user_password = true multi_az = true # High availability db_subnet_group_name = aws_db_subnet_group.private.name vpc_security_group_ids = [aws_security_group.db_sg.id] backup_retention_period = 7 deletion_protection = true skip_final_snapshot = false final_snapshot_identifier = "app-db-prod-final" tags = { Environment = "production" ManagedBy = "terraform" } }
manage_master_user_password = true to let RDS store and rotate credentials in AWS Secrets Manager automatically.Module 3: Networking & Security
Networking is the foundation of every AWS deployment. A misconfigured VPC, an overly permissive security group, or a missing IAM policy can expose your entire infrastructure. This module covers the core networking and identity building blocks.
Amazon VPC
A Virtual Private Cloud (VPC) is your isolated network within AWS. You define the IP address range (CIDR block), create subnets, attach internet gateways, and configure route tables. Every resource you launch lives inside a VPC.
Core Components
| Component | Purpose | Key Detail |
|---|---|---|
| VPC | Isolated virtual network | Define CIDR block (e.g., 10.0.0.0/16 = 65,536 IPs) |
| Subnet | Segment of the VPC tied to one AZ | Public or private based on route table |
| Internet Gateway (IGW) | Connects VPC to the internet | One per VPC, no bandwidth limits, no cost |
| NAT Gateway | Lets private subnets reach the internet (outbound only) | $0.045/hour + $0.045/GB processed |
| Route Table | Rules determining where traffic goes | Each subnet is associated with exactly one route table |
Public vs Private Subnets
Route table has a route to the Internet Gateway (0.0.0.0/0 → igw-xxx). Instances with public IPs can be reached from the internet. Use for load balancers, bastion hosts.
No route to the Internet Gateway. Outbound internet access (for patches, API calls) goes through a NAT Gateway (0.0.0.0/0 → nat-xxx). Use for application servers, databases, and all backend services.
Console Navigation
The "VPC and more" wizard creates a VPC with public/private subnets, route tables, NAT Gateway, and an Internet Gateway in one step.
Terraform — VPC with Public & Private Subnets
# terraform/vpc.tf — Production VPC layout resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_support = true enable_dns_hostnames = true tags = { Name = "main-vpc" } } resource "aws_internet_gateway" "igw" { vpc_id = aws_vpc.main.id tags = { Name = "main-igw" } } # Public subnet resource "aws_subnet" "public" { vpc_id = aws_vpc.main.id cidr_block = "10.0.1.0/24" availability_zone = "us-east-1a" map_public_ip_on_launch = true tags = { Name = "public-subnet-1a" } } # Private subnet resource "aws_subnet" "private" { vpc_id = aws_vpc.main.id cidr_block = "10.0.10.0/24" availability_zone = "us-east-1a" tags = { Name = "private-subnet-1a" } } # Route table — public (routes to Internet Gateway) resource "aws_route_table" "public" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" gateway_id = aws_internet_gateway.igw.id } tags = { Name = "public-rt" } } resource "aws_route_table_association" "public" { subnet_id = aws_subnet.public.id route_table_id = aws_route_table.public.id } # NAT Gateway for private subnet outbound access resource "aws_eip" "nat" { domain = "vpc" } resource "aws_nat_gateway" "nat" { allocation_id = aws_eip.nat.id subnet_id = aws_subnet.public.id # NAT GW lives in the public subnet tags = { Name = "main-nat" } } resource "aws_route_table" "private" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" nat_gateway_id = aws_nat_gateway.nat.id } tags = { Name = "private-rt" } } resource "aws_route_table_association" "private" { subnet_id = aws_subnet.private.id route_table_id = aws_route_table.private.id }
AWS IAM
Identity and Access Management (IAM) controls who can do what on which resources. It is the single most important AWS service. Every API call is authenticated and authorized through IAM. There is no cost for IAM — it is included with every AWS account.
Core Concepts
| Entity | What It Is | When to Use |
|---|---|---|
| Users | Individual identity with long-term credentials | Human operators needing console/CLI access |
| Groups | Collection of users sharing the same permissions | Team-based access (e.g., "Developers", "DBAs") |
| Roles | Temporary identity assumed by services or users | EC2 instances, Lambda functions, cross-account access |
| Policies | JSON document defining allow/deny permissions | Attached to Users, Groups, or Roles |
IAM Roles provide temporary credentials that rotate automatically. Access Keys are long-lived and must be manually rotated. EC2 instances, Lambda functions, and ECS tasks should always use IAM Roles — never embedded Access Keys.
Policy Structure
// IAM Policy — least-privilege access to a specific S3 bucket { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowS3ReadWrite", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-app-data-prod", "arn:aws:s3:::my-app-data-prod/*" ] }, { "Sid": "DenyAllOtherS3", "Effect": "Deny", "Action": "s3:*", "NotResource": [ "arn:aws:s3:::my-app-data-prod", "arn:aws:s3:::my-app-data-prod/*" ] } ] }
Boto3 — Assume a Role Programmatically
# assume_role.py — Assume an IAM Role and use temporary credentials import boto3 # STS client — used to assume roles sts_client = boto3.client("sts", region_name="us-east-1") # Assume the cross-account or service role assumed = sts_client.assume_role( RoleArn="arn:aws:iam::987654321098:role/cross-account-data-reader", RoleSessionName="data-pipeline-session", DurationSeconds=3600, # 1 hour max for this session ) # Extract temporary credentials creds = assumed["Credentials"] # Create a new session with the assumed role's credentials session = boto3.Session( aws_access_key_id=creds["AccessKeyId"], aws_secret_access_key=creds["SecretAccessKey"], aws_session_token=creds["SessionToken"], ) # Use the session to access resources in the target account s3 = session.client("s3") objects = s3.list_objects_v2( Bucket="target-account-data-bucket", Prefix="exports/", MaxKeys=10, ) for obj in objects.get("Contents", []): print(f" {obj['Key']} — {obj['Size']} bytes") print(f"Session expires: {creds['Expiration']}")
* in Action or Resource. "Action": "s3:*", "Resource": "*" grants full S3 access to every bucket in your account. Always scope policies to specific actions and specific resource ARNs.Security Groups vs NACLs
AWS provides two layers of network filtering. Understanding the difference between stateful and stateless processing is critical for designing secure architectures.
| Feature | Security Group (SG) | Network ACL (NACL) |
|---|---|---|
| Level | Instance (ENI) level | Subnet level |
| Statefulness | Stateful — return traffic is automatically allowed | Stateless — must explicitly allow both inbound and outbound |
| Default behavior | Denies all inbound, allows all outbound | Allows all inbound and outbound |
| Rule type | Allow rules only (no deny rules) | Both Allow and Deny rules with priority numbers |
| Evaluation | All rules evaluated together | Rules evaluated in number order — first match wins |
| Scope | Applied to specific instances | Applied to all instances in the subnet |
Stateful (SG): If you allow inbound HTTP (port 80), the response traffic is automatically allowed out — you don't need a separate outbound rule. Stateless (NACL): You must create both an inbound rule (port 80) AND an outbound rule for ephemeral ports (1024–65535) for the response to reach the client.
Terraform — Security Group Example
# terraform/security_group.tf — Web server security group resource "aws_security_group" "web_sg" { name = "web-server-sg" description = "Allow HTTPS inbound and all outbound" vpc_id = aws_vpc.main.id # Inbound — HTTPS only ingress { description = "HTTPS from anywhere" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } # Outbound — all traffic egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "web-server-sg" } } # Database SG — only accepts traffic from the web tier resource "aws_security_group" "db_sg" { name = "database-sg" description = "PostgreSQL from web tier only" vpc_id = aws_vpc.main.id ingress { description = "PostgreSQL from web servers" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.web_sg.id] # Reference by SG, not CIDR } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "database-sg" } }
security_groups = [web_sg.id] instead of a CIDR range. This means only instances attached to the web SG can reach the database — regardless of their IP address.Module 4: Pricing & Cost Optimization
AWS bills by the second, by the byte, and by the API call. Without active cost management, a development account can easily reach $1,000+/month from forgotten resources. This module covers the three essential tools for cost visibility and control.
AWS Pricing Calculator
The AWS Pricing Calculator (calculator.aws) lets you model complex multi-service architectures before deploying. You add each service, configure its parameters (instance type, storage, requests/month), and get a monthly/annual estimate.
How to Estimate a Typical Architecture
Real-World Example Estimate
| Service | Configuration | Monthly Cost |
|---|---|---|
| EC2 (web tier) | 2× m6i.large, On-Demand, us-east-1 | ~$140 |
| RDS PostgreSQL | db.r6i.large, Multi-AZ, 100 GB gp3 | ~$370 |
| S3 | 500 GB Standard, 10M GET, 1M PUT | ~$17 |
| NAT Gateway | 1 gateway, 500 GB processed | ~$55 |
| ALB | Application Load Balancer, 50 LCU-hours | ~$35 |
| Data Transfer | 200 GB egress to internet | ~$18 |
| Total Estimate | ~$635/month | |
Cost Explorer
Cost Explorer is your post-deployment cost analysis tool. It visualizes spending trends, breaks down costs by service/region/tag, and identifies anomalies. Enable it from the Billing Dashboard — it takes 24 hours to populate historical data.
Finding Hidden Costs
The most common "surprise" line items in AWS bills:
Console Navigation
Environment, Team, or Project. Without tags, you cannot attribute costs — and you cannot optimize what you cannot measure.AWS Budgets
AWS Budgets lets you set custom spending thresholds and receive alerts via email or SNS when actual or forecasted costs exceed your budget. This is the simplest and most effective way to prevent unexpected bills.
Setting Up a Budget
Recommended Alert Thresholds
| Threshold | Alert Type | Action |
|---|---|---|
| 50% of budget | Email to team lead | Early awareness — check for anomalies |
| 80% of budget | Email to team + SNS topic | Review and take corrective action |
| 100% of budget | Email + SNS + Lambda trigger | Automated remediation (stop non-prod instances) |
| Forecasted > 120% | Email to finance + engineering lead | Immediate investigation required |
Terraform — Budget with Alerts
# terraform/budgets.tf — Monthly budget with email alerts resource "aws_budgets_budget" "monthly" { name = "monthly-account-budget" budget_type = "COST" limit_amount = "500" limit_unit = "USD" time_unit = "MONTHLY" # Alert at 80% of actual spend notification { comparison_operator = "GREATER_THAN" threshold = 80 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = ["cloud-team@example.com"] } # Alert when forecast exceeds budget notification { comparison_operator = "GREATER_THAN" threshold = 100 threshold_type = "PERCENTAGE" notification_type = "FORECASTED" subscriber_email_addresses = ["cloud-team@example.com", "finance@example.com"] } }
Module 5: Common Pitfalls & Security Risks
These three patterns account for the majority of AWS security incidents and cost overruns. Understanding them is more valuable than memorizing individual service features.
Pitfall #1: Public S3 Buckets
Publicly accessible S3 buckets are the #1 cause of cloud data breaches. Misconfigured bucket policies or legacy ACLs can expose customer data, credentials, database backups, and intellectual property to anyone on the internet.
How It Happens
- Using
"Principal": "*"in a bucket policy without understanding it grants access to the entire internet - Legacy ACLs like
public-readorpublic-read-writefrom before S3 Block Public Access existed - Granting public access "temporarily" for testing and forgetting to revoke it
- Static website hosting on S3 where the data bucket is confused with the website bucket
Enable S3 Block Public Access at the account level (not just the bucket level). This overrides any bucket policy or ACL that attempts to make data public. It is enabled by default for new accounts since April 2023.
Console Check
# terraform/s3_account_block.tf — Account-level public access block resource "aws_s3_account_public_access_block" "account" { block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true }
Pitfall #2: Over-Provisioning
Developers and architects consistently choose instances that are 2–4× larger than needed. This is the most common and most expensive mistake in cloud computing — often costing organizations thousands per month in wasted compute.
Why It Happens
- "Just to be safe" — choosing m5.2xlarge when m5.large would suffice
- Running production-grade instances for development environments 24/7
- Not monitoring CPU/memory utilization after deployment
- Provisioned capacity on DynamoDB or RDS far exceeding actual peak load
How to Fix It
| Tool | What It Does | Action |
|---|---|---|
| AWS Compute Optimizer | Analyzes CloudWatch metrics and recommends right-sized instances | Review recommendations monthly |
| Trusted Advisor | Flags underutilized EC2 instances (CPU < 10% for 14 days) | Downsize or terminate |
| CloudWatch Alarms | Monitor CPU, memory, and network utilization | Set alerts for avg CPU < 20% |
| Savings Plans | Commit to $/hour spend (not instance type) for 1–3 years | Apply after right-sizing |
Step 1: Enable Compute Optimizer (free). Step 2: Wait 14 days for baseline data. Step 3: Review "Over-provisioned" findings. Step 4: Resize instances (stop → change type → start). Step 5: Only then commit to Reserved Instances or Savings Plans.
m5.2xlarge ($0.384/hr) to m5.large ($0.096/hr) saves $2,073/month — over $24,000/year. This is a 10-minute change in the console.Pitfall #3: Hardcoded Credentials
Embedding AWS Access Key IDs and Secret Access Keys directly in source code, config files, or environment variables on EC2 instances is a severe security anti-pattern. Leaked credentials are the fastest path to a compromised AWS account.
Common Anti-Patterns
# ❌ NEVER DO THIS — hardcoded credentials in application code import boto3 # These credentials will end up in Git history, CI logs, and error reports client = boto3.client( "s3", aws_access_key_id="AKIAIOSFODNN7EXAMPLE", # ❌ NEVER aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxR...", # ❌ NEVER )
The Correct Approach: IAM Roles
# ✅ CORRECT — Boto3 automatically uses IAM Role credentials import boto3 # When running on EC2, Lambda, or ECS, boto3 automatically discovers # temporary credentials from the instance metadata service (IMDS) # or the ECS task role. No credentials needed in code. client = boto3.client("s3", region_name="us-east-1") # This "just works" because the EC2 instance / Lambda function # has an IAM Role attached with the necessary S3 permissions response = client.list_objects_v2(Bucket="my-app-data-prod")
How IAM Roles Work for Services
| Service | How Role Is Attached | Credential Source |
|---|---|---|
| EC2 | Instance Profile (IAM Role) | Instance Metadata Service (IMDSv2) |
| Lambda | Execution Role (configured in function settings) | Environment variables (auto-injected) |
| ECS | Task Role (in task definition) | Task metadata endpoint |
| EKS | IAM Roles for Service Accounts (IRSA) | OIDC token exchange |
Step 1: Immediately deactivate the Access Key in IAM (do NOT delete yet — audit first). Step 2: Check CloudTrail for unauthorized API calls using those credentials. Step 3: Rotate any secrets the compromised role had access to. Step 4: Enable GuardDuty to detect future credential misuse. Step 5: Delete the key after investigation.
Prevention: Use AWS Secrets Manager
For secrets that cannot be replaced by IAM Roles (third-party API keys, database passwords for non-RDS databases), use AWS Secrets Manager to store and rotate them.
# retrieve_secret.py — Fetch a secret from AWS Secrets Manager import json import boto3 # Uses IAM Role credentials — no hardcoded keys needed secrets_client = boto3.client("secretsmanager", region_name="us-east-1") response = secrets_client.get_secret_value( SecretId="prod/myapp/db-credentials" ) secret = json.loads(response["SecretString"]) db_host = secret["host"] db_user = secret["username"] db_pass = secret["password"] print(f"Connecting to {db_host} as {db_user}")