Multi-Cloud Architecture & Scaling Standards — v1
Multi-Cloud
Architecture & Scaling Standards Handbook
AWS Azure GCP
Architecture Reference · April 2026

Multi-Cloud Architecture
& Scaling Standards

Production-ready reference guidance for scalable, secure systems across AWS, Azure, and Google Cloud Platform — written for engineers and technical leads who need consistent, defensible architecture decisions.

AWS · Azure · GCP Scalability Standards Security Baselines 4 Architecture Patterns
Foundation

Core Principles

These principles apply universally, regardless of which pattern you choose. They represent the minimum standard for any production system and should be enforced at the architecture review stage, not retrofitted later.

⚠️
Design for Failure

Assume every component can fail. Build retry logic, circuit breakers, graceful degradation, and health probes into every layer — not as an afterthought.

⚙️
Prefer Managed Services

Managed services reduce operational surface area and patch burden. Choose them when they materially reduce toil without introducing unacceptable vendor lock-in.

↔️
Scale Horizontally First

Horizontal scaling is more resilient and cost-effective than vertical. Reserve vertical scaling for transactional databases where single-node performance is the bottleneck.

🔒
Separate Public and Private Planes

Internet-facing ingress must never route directly to compute or data systems. Public subnets are for edge components only — load balancers, approved bastions.

🪪
Standardise Identity & Encryption

Platform-native identities, short-lived credentials, TLS everywhere, and centralized secrets management are non-negotiable baselines across all cloud patterns.

📐
Match Architecture to Team Maturity

Choose the simplest architecture the team can operate safely under incident conditions. Fashionable complexity that cannot be debugged at 3am is an operational liability.

Decision Guide

Architecture Picker

Pattern
Best for
Operational complexity
Relative cost
3-Tier Web
Monoliths, lift-and-shift, small teams
Low
Low-Medium
Kubernetes Microservices
Multi-team, polyglot, bounded contexts
High
Medium-High
Serverless Event-Driven
Spiky traffic, async workflows, glue code
Medium
Low (variable)
Multi-Region HA
Tier-0, global users, compliance/DR
Very High
High
Module 01

The Classic 3-Tier Web Architecture

The 3-tier model separates presentation, application logic, and persistence into distinct layers with independent scaling, security boundaries, and deployment controls. It remains the most reliable foundation for enterprise systems — proven, predictable, and operationally straightforward.

USERS Global Traffic PUBLIC LOAD BALANCER ALB · Application Gateway · Cloud LB WEB SERVER 1 Presentation Layer WEB SERVER 2 Presentation Layer WEB SERVER N Auto-Scaling Group APP SERVER 1 Business Logic APP SERVER 2 Business Logic APP SERVER N Business Logic PRIMARY SQL DATABASE READ REPLICA Presentation Logic Data
✓ Use when
  • Deploying a monolithic or near-monolithic codebase
  • Performing lift-and-shift from on-premises environments
  • The team prefers VM-level control over containers/serverless
  • Stateful application behaviour requires OS-level configuration
  • Operations team is comfortable with traditional server management
✗ Avoid when
  • Rapid independent service deployment is the primary requirement
  • Multiple teams need to own separate release lifecycles
  • The codebase is already cleanly decomposed into bounded contexts
  • Traffic is highly spiky and elastic scaling must reach zero

Scaling Strategy

🔄
Horizontal Auto-Scaling

Place web and app tiers in separate auto-scaling pools. Scale independently on CPU, memory, request count, queue depth, or P99 latency.

🗄️
Read Replicas

Route read-heavy traffic to read replicas. Direct all writes to the primary. Scale reads horizontally before scaling the primary vertically.

Stateless Instances

Externalise session state to Redis or Memcached. Stateless instances scale freely without session affinity problems.

📊
Capacity Triggers

Define predictive and reactive triggers. Pre-scale for known events. Use scheduled scaling for known daily traffic curves.

Cloud-Specific Service Mappings

Layer AWS Azure GCP
Public ingress Application Load Balancer Application Gateway Cloud Load Balancing
Web tier compute EC2 Auto Scaling Group VM Scale Sets Managed Instance Group
App tier compute EC2 ASG VM Scale Sets Managed Instance Group
Managed SQL RDS Aurora Azure SQL SQL MI Cloud SQL
Read replicas RDS Read Replica Aurora Replica SQL geo-replica Cloud SQL replica
Session cache ElastiCache (Redis) Azure Cache for Redis Memorystore
Secrets & keys KMS Secrets Manager Key Vault Cloud KMS Secret Manager
Module 02

Containerized Microservices on Kubernetes

Microservices on Kubernetes give independent teams the ability to release, scale, and own their services without coordinating monolith deployments. Each service has its own scaling policy, runtime, and often its own data boundary — connected through a shared ingress and internal service mesh.

USERS Internet API GATEWAY / INGRESS CONTROLLER Rate limiting · Auth · TLS termination · Routing KUBERNETES CLUSTER SERVICE A HPA: 2–20 pods Order Domain CPU · custom metrics SERVICE B HPA: 3–30 pods Inventory Domain Requests/sec trigger SERVICE C HPA: 1–10 pods Notification Domain Queue depth trigger NoSQL DB Distributed Cache Message Queue Cluster Autoscaler → node pool expansion
✓ Use when
  • Multiple teams need independent release cadences
  • Domain boundaries are well understood and stable
  • Polyglot runtimes (Go, Python, JVM, Node) are required
  • Team has operational maturity for distributed systems
  • Service-level ownership and SLA accountability matter
✗ Avoid when
  • Small team without Kubernetes operations experience
  • Domain boundaries are poorly defined or contested
  • Distributed tracing and service contract discipline is absent
  • Simplicity is more important than independent deployability

Scaling Strategy

🔄
Horizontal Pod Autoscaler

Scale pod counts per-service on CPU, memory, custom metrics (RPS, queue lag, latency P95). Configure per-deployment — not globally.

🖥️
Cluster Autoscaler

Adds or removes worker nodes when the scheduler cannot place pods. Works alongside HPA to ensure the cluster always has headroom for rapid scale-out.

🛡️
Pod Disruption Budgets

Define minimum available replicas to ensure safe rolling deployments and autoscaler node drains without service interruption.

📦
Resource Requests & Limits

Set CPU and memory requests accurately — used for scheduling decisions. Set limits to contain noisy-neighbour impact. Mis-set limits cause OOMKills.

Cloud-Specific Service Mappings

CapabilityAWSAzureGCP
Managed Kubernetes Amazon EKS AKS GKE (Autopilot)
Ingress / API entry AWS LBC API Gateway AGIC API Management GKE Ingress API Gateway
Container registry ECR Azure Container Registry Artifact Registry
NoSQL data store DynamoDB Cosmos DB Firestore Bigtable
Distributed cache ElastiCache Azure Cache for Redis Memorystore
Metrics / autoscale CloudWatch Prometheus Azure Monitor Cloud Monitoring
Service mesh App Mesh Open Service Mesh Cloud Service Mesh
Module 03

Serverless & Event-Driven Architecture

Serverless event-driven systems decouple producers from consumers through a durable event bus or messaging layer. Producers emit events without knowing who processes them. Consumers scale independently according to event volume, with zero idle cost when quiet.

USERS / CLIENTS HTTP / WebSocket / IoT API GATEWAY Throttling · Auth · Request routing INGRESS FUNCTION Validates · Enriches · Emits event EVENT BUS / TOPIC / QUEUE EventBridge · Event Grid · Pub/Sub · SNS · Service Bus Notification Fn Audit / Logging Fn Analytics Fn Workflow Fn each scales independently · 0→N concurrent executions
✓ Use when
  • Traffic is spiky, bursty, or highly unpredictable
  • Workflows are naturally asynchronous and tolerate delay
  • Minimising idle infrastructure cost is a primary concern
  • Integration glue, automation, and back-office processing
  • Rapid prototyping with low operational ownership
✗ Avoid when
  • Workloads are long-running (>15 min without checkpointing)
  • Ultra-low latency is a hard requirement (cold starts matter)
  • Complex distributed transactions require strong consistency
  • Debugging and tracing experience is weak on the team

Scaling Strategy

Scale to Zero

No traffic → no running instances → no idle cost. Functions spin up on demand. Provision Concurrency (AWS) or pre-warming eliminates cold starts for critical paths.

📨
Queue-Backed Consumers

Event buses absorb traffic spikes. Downstream functions process at their own rate. No cascading overload — each consumer scales to its own subscription pressure.

🔁
Concurrency & Throttle Guards

Set reserved or maximum concurrency limits to protect downstream services and databases from thundering-herd effects during burst events.

☠️
Dead-Letter Queues

Failed events go to a DLQ for inspection and replay. Never silently discard events. Monitor DLQ depth as a key operational metric.

Cloud-Specific Service Mappings

CapabilityAWSAzureGCP
API entry API Gateway API Management API Gateway
Serverless compute Lambda Azure Functions Cloud Functions Cloud Run
Event router / topic EventBridge SNS SQS Event Grid Service Bus Pub/Sub
Workflow orchestration Step Functions Durable Functions Logic Apps Workflows
Dead-letter / retry SQS DLQ Service Bus DLQ Pub/Sub dead-letter topic
Module 04

Global High Availability — Multi-Region

Multi-region architecture protects against catastrophic regional failure and reduces latency for globally distributed users. It requires strong discipline around failover automation, replication lag tolerance, consistency models, and incident runbooks — the most complex pattern in this handbook.

GLOBAL USERS Distributed across regions GLOBAL DNS / TRAFFIC ROUTING Route 53 · Front Door · Cloud DNS · Geo-routing · Health checks ACTIVE-ACTIVE MODE REGION A (PRIMARY) REGION B (SECONDARY) APPLICATION STACK A LB → Web → App → Internal services APPLICATION STACK B LB → Web → App → Internal services DATABASE — REGION A DATABASE — REGION B ASYNC REPLICATION (lag-aware)
✓ Use when
  • Application is Tier 0 — regional downtime is unacceptable
  • Users are globally distributed and latency matters materially
  • Compliance or resilience objectives require demonstrable DR
  • Business continuity requires RTO < 15 min and RPO < 1 min
  • You have the team to operate and practice failover regularly
✗ Avoid when
  • Single-region zonal redundancy is sufficient for the SLA
  • The team has never tested failover under real incident conditions
  • Replication lag and eventual consistency are unacceptable
  • Budget cannot support duplicate infrastructure running continuously
⚠️
Active-Active vs Active-Passive: Active-active distributes live traffic continuously across regions — higher steady-state cost, faster failover, but requires conflict resolution for writes. Active-passive holds a warm/hot secondary for failover only — lower cost but higher failover latency. Choose based on your RTO/RPO targets and budget.

Cloud-Specific Service Mappings

CapabilityAWSAzureGCP
Global traffic routing Route 53 Global Accelerator Front Door Traffic Manager Global Cloud LB Cloud DNS
Regional compute EC2 / ECS / EKS VMSS / AKS / App Service MIG / GKE / Cloud Run
Multi-region database DynamoDB Global Tables Aurora Global DB Cosmos DB SQL geo-replication Cloud Spanner AlloyDB
Edge WAF & DDoS WAF + Shield WAF + DDoS Protection Cloud Armor
Observability CloudWatch X-Ray Monitor + App Insights Cloud Monitoring + Trace
Module 05

Enterprise Network & Security Standards

These standards apply to every architecture above. They represent the minimum acceptable security posture for any production system. Gaps here are not technical debt — they are active risk that must be tracked and remediated on a fixed timeline.

Network Isolation Model

Internet
Public traffic
Untrusted
Edge / DMZ
WAF
DDoS Protection
TLS Termination
CDN / Edge LB
Public Subnet
Load Balancer
Approved Bastion
NAT Gateway
Private App Subnet
Compute (VMs/Pods)
Workers
Internal Services
Caches
Private Data Subnet
Databases (no public IP)
Object Storage
Message Brokers
🚨
Databases must never have public IP addresses. This includes RDS, Cosmos DB, Cloud SQL, and all relational and NoSQL stores. Access must flow exclusively through application services, private endpoints, or approved bastion paths. Any exception requires a formal architecture review and compensating controls.

Security Standard Areas

🔐 Identity & Access
Use platform-native identities — IAM Roles, Managed Identities, Service Accounts
Never hardcode static access keys in source code, images, or pipelines
Apply least privilege. Restrict to the minimum required resources and actions
Human access: role-based, time-bound where possible, fully auditable
Rotate and audit service account credentials on a scheduled cadence
🛡️ Perimeter Security
WAF required for every externally reachable application endpoint
Enable managed DDoS protection for Tier 0 and Tier 1 workloads
Terminate TLS at the approved ingress point only
Inbound rules: minimum required ports and source CIDRs only
Enable egress inspection for outbound data exfiltration controls
🔑 Encryption Standards
TLS 1.2 minimum for all client-to-service and service-to-service traffic
Encryption at rest: platform-managed or CMK via KMS / Key Vault / Cloud KMS
All secrets in a centralized secrets manager — never in environment sprawl
TLS certificates managed automatically (ACM, App Service Certs, CAS)
Audit key rotation policies quarterly
📋 Audit & Compliance
Enable cloud-native audit logs: CloudTrail, Activity Log, Cloud Audit Logs
Centralise logs to a SIEM or durable log aggregation store
Retain audit logs for the period required by applicable compliance frameworks
Enable threat detection: GuardDuty, Microsoft Defender, Security Command Center
Run infrastructure drift detection and policy-as-code validation in CI

Security Controls by Cloud

Control AreaAWSAzureGCP
Network boundaryVPCVNetVPC
WAFAWS WAFAzure WAFCloud Armor
DDoS protectionShield Standard / AdvancedDDoS Protection Basic / StdCloud Armor + Google Edge
Identity standardIAM RolesManaged IdentitiesService Accounts / Workload Identity
Key managementKMSKey VaultCloud KMS
Secrets managementSecrets ManagerKey Vault SecretsSecret Manager
Threat detectionGuardDutyMicrosoft Defender for CloudSecurity Command Center
Policy as codeAWS Config + SCPAzure PolicyOrg Policy Service
Module 06

Observability Standards

A system you cannot observe cannot be operated safely under incident conditions. Observability is not a "nice to have" — it is a first-class architecture concern applied at design time, not retrofitted after incidents.

📈
Metrics

Instrument all services with RED metrics: Request rate, Error rate, Duration (latency). Expose in Prometheus format or cloud-native metrics. Set alert thresholds on SLOs, not averages.

📝
Structured Logging

All logs must be structured (JSON). Include: correlation ID, service name, environment, trace ID, user context (anonymised). Never log secrets or PII.

🔗
Distributed Tracing

Propagate trace context (W3C Trace Context) across all service boundaries. Required for microservices and serverless. Essential for diagnosing latency in call chains.

🏥
Health Probes

Expose /health/live (process alive) and /health/ready (dependencies healthy). Load balancers and orchestrators use readiness to gate traffic routing decisions.

🔔
Alerting on SLOs

Define SLOs for error rate and latency percentiles. Alert on error budget burn rate, not raw counts. High burn rate ≠ page if SLO is healthy; SLO breach always pages.

📊
Dashboards

Maintain a service dashboard per architecture tier: request volume, error rate, P50/P95/P99 latency, saturation, and dependency health. Visible to the oncall team without digging.

CapabilityAWSAzureGCP
Metrics & dashboardsCloudWatchAzure MonitorCloud Monitoring
Distributed tracingX-RayApplication InsightsCloud Trace
Log aggregationCloudWatch LogsLog AnalyticsCloud Logging
Managed PrometheusAmazon Managed PrometheusAzure Managed PrometheusCloud Monitoring (Prometheus)
Module 07

Common Anti-Patterns

These are the most common architecture mistakes that reach production. Each one has a recognisable failure signature. Identifying them early in design review avoids incidents and costly refactors.

01
The Pinball Architecture — Long Synchronous Service Call Chains

One user request synchronously traverses five or six microservices before returning a response. Each hop adds latency, retry pressure, and a new failure point.

Why it fails
  • Latency compounds on every hop — 6 × 50ms = 300ms minimum
  • Retry storms amplify downstream instability under load
  • One degraded service cascades failure across the chain
  • Distributed ownership makes debugging extremely hard
  • Timeouts must be tuned for every pair, not just the edge
Preferred approach
  • Collapse overly chatty service boundaries into fewer, cohesive services
  • Use async events where real-time coupling is unnecessary
  • Apply circuit breakers, bulkheads, and timeouts at every boundary
  • Use the Backend-for-Frontend pattern to aggregate at the edge
  • Instrument every hop so latency attribution is visible
02
Public Databases — Databases Exposed to the Internet

A database is reachable from the public internet, either intentionally through a public IP or accidentally through a misconfigured firewall or permissive security group.

Why it fails
  • Dramatically expands the attack surface
  • Credential leakage is immediately exploitable from anywhere
  • Bypasses layered network security assumptions
  • Creates audit, compliance, and regulatory exposure
  • Breach notification requirements are triggered on compromise
Preferred approach
  • Keep all databases in private subnets — no public IP addresses
  • Access only through application services or approved bastion
  • Use private endpoints / VPC peering for cross-service access
  • Enforce with SCPs, Azure Policy, or Org Policy constraints
  • Run automated scanning to detect public resources
03
Over-Provisioning — Paying for Peak Capacity Year-Round

Infrastructure is sized for Black Friday every day. Spikes that occur a few times a year drive permanent baseline capacity decisions.

Why it fails
  • Constant idle spend on underutilised capacity
  • Delays adoption of elastic design patterns
  • Often reflects missing load tests or poor traffic forecasting
  • Teams accept over-provisioning rather than solve the hard problem
Preferred approach
  • Instrument actual utilisation before sizing baseline capacity
  • Use auto-scaling groups, HPA, managed instance groups
  • Queue-based burst absorption for bursty ingest paths
  • Reserve steady-state baseline only; let burst expand elastically
  • Run load tests and define data-driven scaling triggers
04
Microservices Premature Extraction — Distributed Monolith

Services are split before domain boundaries are understood. The result is a distributed monolith — all the operational complexity of microservices with none of the autonomy.

Why it fails
  • Tight coupling between services creates shared deployment dependencies
  • Teams cannot release independently — the core problem remains unsolved
  • Operational complexity increases with no autonomy benefit
  • Data boundaries are unclear — shared database persists across "services"
Preferred approach
  • Start with a modular monolith — well-structured internal modules
  • Extract services only when a domain boundary is stable and well-understood
  • Each extracted service owns its own data store exclusively
  • Validate that teams can deploy the service independently before extraction
The maturity test: Can your team detect, diagnose, and remediate an incident in any service at 3am with confidence? If not, the architecture may be more complex than the team can safely operate. Simplify before complexity compounds.
Reference

RTO / RPO Target Matrix

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) drive the architecture pattern and database replication strategy. Define these at design time with the business — not during an incident.

Tier
Target RTO
Target RPO
Typical pattern
Tier 0 — Mission Critical
< 5 min
< 30 sec
Multi-region active-active
Tier 1 — Business Critical
< 30 min
< 5 min
Multi-region active-passive
Tier 2 — Important
< 4 hrs
< 1 hr
Multi-AZ + regular snapshots
Tier 3 — Standard
< 24 hrs
< 24 hrs
Single-region, daily backups
ℹ️
Test, don't assume. Failover paths and backup restoration procedures must be tested on a regular schedule — quarterly at minimum for Tier 0/1. An untested DR plan is not a DR plan. Document the runbook and review it after every significant architecture change.
Final Standards

Architecture Checklist

Click each item to track your review progress for a given architecture.

  • Simplest viable architecture selected — chosen because of workload characteristics and team maturity, not trend pressure.
  • Data plane is private — no databases, message brokers, or internal services have public IP addresses.
  • WAF and DDoS protection — applied at every internet-facing endpoint, not only production.
  • TLS 1.2+ enforced everywhere — client-to-service and service-to-service. TLS termination only at approved ingress points.
  • Platform-native identity in use — IAM roles, managed identities, or service accounts. No static access keys in code, images, or configs.
  • All secrets in a secrets manager — AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager. None in environment files or repos.
  • Horizontal elasticity in every tier — web, app, and where possible data layers scale on measured load, not assumptions.
  • Failure domain design validated — system tolerates loss of a single instance, a zone, and (for Tier 0/1) a full region.
  • RTO and RPO targets defined with business stakeholders — and the architecture demonstrably meets them.
  • DR / failover tested — not assumed. Recovery procedures are documented and have been exercised in the last 90 days (Tier 0/1).
  • Observability is complete — metrics (RED), structured logs, distributed traces, health probes, and alert policies are live in all environments.
  • Anti-pattern review done — pinball chains, public databases, over-provisioning, and premature service extraction checked and clear.
Back to handbooks index

Multi-Cloud Architecture & Scaling Standards Handbook

Production-ready reference guidance for scalable, secure systems across AWS, Azure, and Google Cloud Platform, written for engineers and technical leads who need consistent architecture decisions.

AWS · Azure · GCP Scalability Standards Security Baselines April 2026

Core Principles

Module 1: The Classic 3-Tier Web Architecture

The Concept

The 3-tier model separates presentation, application logic, and persistence into distinct layers. This keeps web concerns, business logic, and database responsibilities isolated, which improves maintainability, security segmentation, and operational clarity.

flowchart TD U[Users] --> LB[Public Load Balancer] LB --> W1[Web Server 1] LB --> W2[Web Server 2] LB --> WN[Web Server N] W1 --> A1[App Server 1] W2 --> A2[App Server 2] WN --> AN[App Server N] A1 --> DBP[(Primary SQL Database)] A2 --> DBP AN --> DBP DBP --> DBR[(Read Replica)]

Detailed Explanation & When to Use It

This pattern is still highly effective for enterprise monoliths, internal portals, legacy migrations, and systems that require OS-level customization. It is easy to reason about, works well with traditional operations teams, and creates a clean path for horizontal growth at the web and app layers.

Use it for monolithic business applications, lift-and-shift migrations, and workloads that need VM-level control or middleware customization.

Scaling Strategy

Cloud-Specific Mappings

LayerAWSAzureGCP
Public entryApplication Load BalancerApplication Gateway / Load BalancerCloud Load Balancing
Web tier VMsEC2 Auto Scaling GroupVirtual Machine Scale SetsManaged Instance Group
App tier VMsEC2 Auto Scaling GroupVirtual Machine Scale SetsManaged Instance Group
Managed SQLAmazon RDS / AuroraAzure SQL Database / SQL Managed InstanceCloud SQL
Read replicasRDS Read Replica / Aurora ReplicaAzure SQL read scale-out / geo-replicaCloud SQL read replica

Module 2: Containerized Microservices (Kubernetes)

The Concept

Traffic enters a Kubernetes cluster through an ingress or API gateway, then routes to independently deployable microservices running in pods. Each service can own its runtime, scaling policy, deployment cadence, and supporting data systems.

flowchart TD U[Users] --> GW[API Gateway / Ingress Controller] GW --> SA[Service A Pods] GW --> SB[Service B Pods] SA --> NDB[(NoSQL Database)] SB --> CACHE[(Distributed Cache)]

Detailed Explanation & When to Use It

This model is strongest when multiple teams need independent release lifecycles, domain boundaries are well understood, and the platform must support multiple languages or frameworks. It trades simplicity for autonomy and control.

Use it for high-velocity product teams, polyglot estates, and systems where service-level ownership and independent deployment matter materially.

Scaling Strategy

Cloud-Specific Mappings

CapabilityAWSAzureGCP
Managed KubernetesAmazon EKSAKSGKE
Ingress / API entryAWS Load Balancer Controller / API GatewayApplication Gateway Ingress Controller / API ManagementGKE Ingress / API Gateway
Container registryAmazon ECRAzure Container RegistryArtifact Registry
NoSQL storeDynamoDBCosmos DBFirestore / Bigtable
CacheElastiCacheAzure Cache for RedisMemorystore

Module 3: Modern Serverless & Event-Driven Architecture

The Concept

Serverless and event-driven systems decouple producers from consumers. A synchronous edge request can trigger an initial function, which emits an event onto a bus or topic. Independent downstream consumers then process that event without direct runtime coupling.

flowchart TD U[Users / Clients] --> API[API Gateway] API --> F1[Ingress Function] F1 --> BUS[Event Bus / Topic] BUS --> F2[Notification Function] BUS --> F3[Audit / Processing Function]

Detailed Explanation & When to Use It

This architecture is ideal for bursty traffic, integration glue, asynchronous workflows, and cost-sensitive workloads that should not pay for idle servers. It shifts the design emphasis from host management to event contracts, idempotency, concurrency, and retry behavior.

Use it for unpredictable traffic patterns, automation workflows, glue code, and rapid prototyping with minimal idle infrastructure cost.

Scaling Strategy

Cloud-Specific Mappings

CapabilityAWSAzureGCP
API entryAmazon API GatewayAPI Management / Functions HTTP triggerAPI Gateway
Serverless computeAWS LambdaAzure FunctionsCloud Functions
Event router / topicEventBridge / SNS / SQSEvent Grid / Service BusPub/Sub
Workflow orchestrationStep FunctionsDurable Functions / Logic AppsWorkflows

Module 4: Global High Availability (Multi-Region)

The Concept

Multi-region architecture protects the system from regional failure and reduces latency for globally distributed users. Traffic is routed by a global DNS or traffic manager layer, while application stacks and databases operate across at least two regions.

flowchart TD U[Global Users] --> DNS[Global DNS / Traffic Router] DNS --> RA[Region A Application Stack] DNS --> RB[Region B Application Stack] RA --> DBA[(Regional Database A)] RB --> DBB[(Regional Database B)] DBA -. Replication .-> DBB DBB -. Replication .-> DBA

Detailed Explanation & When to Use It

This architecture is designed for region-level catastrophic failure, not just zone loss. It requires careful decisions around active-active versus active-passive topology, global routing policy, failover automation, and replication lag tolerance.

Use it for mission-critical applications, regulated systems, financial platforms, and globally distributed products where low downtime and low latency both matter.

Scaling Strategy

Cloud-Specific Mappings

CapabilityAWSAzureGCP
Global traffic routingRoute 53 / Global AcceleratorAzure Front Door / Traffic ManagerGlobal Cloud Load Balancing / Cloud DNS
Regional app platformEC2 / ECS / EKSVMSS / AKS / App ServiceMIG / GKE / Cloud Run
Multi-region databaseDynamoDB Global Tables / Aurora Global DatabaseCosmos DB / Azure SQL geo-replicationCloud Spanner
Edge securityAWS WAF + ShieldAzure WAF + DDoS ProtectionCloud Armor

Module 5: Enterprise Network & Security Standards

flowchart LR I[Internet] --> EDGE[WAF / DDoS / Edge LB] EDGE --> PUB[Public Subnet] PUB --> PRIVAPP[Private App Subnet] PRIVAPP --> PRIVDATA[Private Data Subnet]

Network Isolation

All production systems must run inside VPCs or VNets with a strict split between public ingress and private workloads. Public subnets are reserved for edge components such as load balancers and approved bastions. Compute, workers, and all databases belong in private subnets. Public database exposure is not acceptable.

Perimeter Security

Every externally reachable application must be protected by a WAF and appropriate edge DDoS controls. Terminate TLS at approved ingress points and restrict inbound traffic to explicitly required paths and ports.

Identity & Access

Use platform-native identities and least privilege everywhere. IAM roles, managed identities, and workload identities replace static access keys. Human access must be role-based, time-bound where possible, and auditable.

Data Security

Encryption in transit using TLS 1.2 or higher is mandatory. Encryption at rest must use platform key management such as AWS KMS, Azure Key Vault, or Cloud KMS. Secrets belong in centralized secrets managers, not in code, images, or ad hoc environment files.

Control AreaAWSAzureGCP
Private network boundaryVPCVNetVPC
Web perimeterAWS WAFAzure WAFCloud Armor
DDoS protectionAWS ShieldAzure DDoS ProtectionCloud Armor / Google edge protections
Identity standardIAM RolesManaged IdentitiesService Accounts / Workload Identity
Key managementAWS KMSAzure Key VaultCloud KMS

Module 6: Common Pitfalls & Anti-Patterns

1. The Pinball Architecture
Long synchronous microservice call chains compound latency and create cascading failures. Prefer simpler boundaries, asynchronous fan-out where appropriate, and explicit timeout plus retry policies.
2. Public Databases
Assigning public IPs to databases expands the attack surface dramatically. Keep all databases private and expose access only through private networking, approved bastions, and identity-aware controls.
3. Over-Provisioning for What If
Running peak-sized infrastructure all year wastes budget and discourages elastic design. Prefer auto-scaling, queue buffering, and burst capacity patterns matched to measured demand.
Architecture maturity means knowing when not to distribute. A simpler architecture with strong scaling and security controls is better than a fashionable distributed design that the team cannot operate safely.

Final Standards Checklist