Software Architecture Handbook

🗺

Overview

// why architecture decisions are the most expensive ones you'll make

Architecture decisions are the ones that are hardest to reverse. Choosing how to decompose your system — how it communicates, how data flows, how teams are organised around it — shapes every engineering decision that follows. Get it wrong and you spend years fighting your own infrastructure instead of building product.

There is no universally correct architecture. Every pattern is a set of tradeoffs. The goal of this handbook is to make those tradeoffs explicit so your team can make deliberate, informed choices instead of defaulting to the industry fashion of the moment.

◆

2024 Industry Reality: After a decade of microservices evangelism, major engineering organisations — including Amazon, Shopify, Stack Overflow, and Basecamp — are publicly discussing consolidation back to simpler, modular architectures. The lesson isn't that microservices are bad; it's that they require organisational maturity most teams don't have.

Complexity Budget

Every architectural decision spends from your team's complexity budget. Distributed systems, eventual consistency, and service meshes are expensive. Spend wisely.

Conway's Law

Your architecture will mirror your team structure — and vice versa. You can't deploy a microservices architecture with a 5-person team and expect it to work.

Reversibility

Prefer architectures you can migrate out of. A monolith that's well-structured can be extracted into services later. A distributed mess cannot easily be collapsed.

⚖️

How to Choose an Architecture

// the questions that actually drive the right decision

Question	Lean Simpler	Lean Distributed
Team size	1–25 engineers	25–100+ engineers
Domain clarity	New / exploratory product	Well-understood, stable domain
Deployment frequency	Weekly or less	Multiple times per day per team
Scale requirements	<10k RPS, single region	Global, massive, independent scale
Regulatory isolation	Shared data model is fine	Hard data boundaries required
Technology diversity	Single language/stack is fine	Legitimately need different runtimes
Operational maturity	Small or growing ops team	Mature platform / SRE capability
Time to market	Speed is #1 priority now	Long-term autonomy matters more

⚠️

The startup trap: Most early-stage teams over-engineer. If you can't clearly articulate which part of your system needs to scale independently and why, you don't need microservices yet. You need a working product.

🧱

The Monolith

// the original, still powerful

Monolithic Architecture

Traditional Battle-Tested

Complexity

All application logic — UI, business logic, data access — lives in a single deployable unit. One codebase, one process, one deployment. Shares a single database. Despite its reputation as "legacy," a well-written monolith outperforms a poorly-written distributed system in almost every practical dimension.

✓ Use When

Small team (under 15 engineers)
Exploring product-market fit
Domain is unclear or evolving
Limited operational resources
Need fast iteration cycles
Vertical scaling is sufficient

✗ Avoid When

Teams stepping on each other daily
Parts need wildly different scale
Deployment takes hours to release
Multiple teams own same codebase
Technology diversity is required

⚠ Watch Out For

The Big Ball of Mud anti-pattern
Shared database becoming a bottleneck
Test suite growing slower over time
Long release cycles start to hurt
Cross-cutting concerns tangled

Monolith Anti-Patterns

ANTI-PATTERNBig Ball of Mud

No discernible structure. Everything imports everything. Business logic in controllers, database calls in view templates, validation spread across the codebase. Common result of adding features rapidly without maintaining layer discipline.

Fix

Enforce clear layers (presentation → application → domain → infrastructure). Introduce module boundaries and forbidden dependency rules. Treat internal architecture with same care as API design.

ANTI-PATTERNShared Database as Integration Layer

Multiple applications or services talk to the same database directly, including updating each other's tables. Schema changes become impossible without coordinating every consumer.

Fix

Each logical module owns its tables. Expose data to others via API or events — never by granting direct table access. Even in a monolith, enforce this boundary to prepare for future extraction.

ANTI-PATTERNAnemic Domain Model

Domain objects are just data holders with no behaviour. All business logic lives in service classes that orchestrate dumb models. Results in scattered, duplicated business rules that are hard to test and reason about.

Fix

Apply DDD principles: move invariants and business rules into rich domain objects. An Order should know how to calculate its total and validate its own state — not delegate that to a 500-line OrderService.

🏗️

Modular Monolith

// the architecture the industry is rediscovering

Industry Trend 2023–2025

The Return of the Modular Monolith

Amazon, the company that literally invented microservices culture with its "two-pizza team" rule, published a paper in 2023 examining their Prime Video team's return from microservices to a monolith — achieving a 90% cost reduction. Shopify, Stack Overflow, and Basecamp have all made similar moves or written extensively about preferring well-structured monoliths. This isn't nostalgia — it's hard-won pragmatism.

Modular Monolith

Recommended Default 2024 Best Practice

Complexity

A single deployable unit — like a monolith — but with strong, enforced internal module boundaries. Each module owns its domain, its data, and its public API surface. Modules communicate through well-defined interfaces, not by reaching into each other's internals. Deployed as one unit; structured like services. You get the simplicity of a monolith with the logical isolation of services — and the ability to extract true services when you genuinely need to.

✓ Use When

Team of 5–50 engineers
Domain reasonably well understood
Want service-like isolation without ops overhead
May need to extract services later
Single-region deployment is fine
Growth stage: 1M–50M users

Key Principles

Module boundaries enforced at build time
Each module has its own DB schema/tables
Cross-module calls via public interfaces only
No reaching into another module's internals
Modules can be tested in isolation
Use ArchUnit / Dependency Cruiser to enforce

Extraction Path

Start with well-defined module boundaries
Add messaging abstraction inside the monolith
When a module needs to scale independently — extract
The interface stays the same, delivery changes
You'll know exactly where boundaries belong

💡

Shopify's approach: Shopify runs one of the world's largest Rails applications. Rather than splitting into microservices, they invested deeply in module boundaries within their monolith, using component boundaries that are enforced by tooling. The result: large-scale team autonomy without distributed systems overhead.

🔬

Microservices

// powerful at scale, expensive everywhere else

Microservices Architecture

For Large Organisations High Ops Cost

Complexity

The system is decomposed into small, independently deployable services — each owned by a small team, each with its own database, each communicating over a network (HTTP/gRPC/messaging). Enables independent scaling and deployment per service. Requires mature platform engineering, service discovery, distributed tracing, and organisational alignment to actually succeed.

✓ Genuine Use Cases

50+ engineers with clear team ownership
Parts truly need independent scale (e.g. payments vs. search)
Regulatory isolation requirements
Different technology stacks legitimately needed
Mature DevOps/Platform Engineering org
Netflix, Amazon, Uber scale

Real Benefits

True team independence & autonomy
Independent deployment — no coordination
Independent scaling per service
Fault isolation (one service fails, others don't)
Technology freedom per service

Real Costs

Distributed transactions are hard
Network latency is now everywhere
Observability requires investment
Service discovery and mesh overhead
Testing integration paths is complex
Operational overhead multiplied by N

Microservice Anti-Patterns

ANTI-PATTERNDistributed Monolith

Services that are physically separate but logically coupled — they call each other synchronously in a chain, share a database, or must be deployed together. You get all the costs of distribution with none of the benefits. The worst of both worlds.

Fix

True service independence requires independent data stores and async communication. If two services must always deploy together, merge them. If they chain-call 6 deep to handle one request, your boundaries are wrong.

ANTI-PATTERNNano-services / Over-decomposition

Services so small they have no business logic — just CRUD wrappers around a table. 50 services for a 10-person team. Operational burden is massive; every feature requires coordinating 5 service deployments.

Fix

A service should map to a bounded context — a meaningful domain concept, not a table. If a service is just a thin wrapper around a database with no business rules, it belongs inside another service.

ANTI-PATTERNSynchronous Chain Calls

Service A calls B, which calls C, which calls D. Latency compounds, failure cascades. One slow service makes everything slow. P99 latency of the chain is sum of all P99s.

Fix

Prefer async messaging for non-blocking work. Use the Saga pattern for distributed transactions. Design services to be autonomous — if Service A needs data from Service B to function, consider whether A is actually part of B's bounded context.

ANTI-PATTERNShared Library Hell

A giant shared library ("commons" / "core") that all services depend on. Updating the library requires coordinating and re-deploying all services. The services are no longer independently deployable.

Fix

Share as little as possible. When sharing is necessary (e.g., authentication middleware, telemetry), version it explicitly and allow services to upgrade at their own pace. Treat shared libraries like external packages.

⚡

Event-Driven Architecture

// decouple producers from consumers through async messaging

Event-Driven Architecture (EDA)

Async Decoupling High Scalability

Complexity

Components communicate by producing and consuming events through a message broker (Kafka, RabbitMQ, AWS EventBridge, NATS). Producers don't know who consumes their events. Consumers subscribe to events they care about. Enables temporal decoupling, high throughput, and extensibility — add a new consumer without touching the producer. Works beautifully within a modular monolith (in-process events) or across microservices (external broker).

✓ Use When

High-throughput pipelines (analytics, logs)
Fan-out: one event, many consumers
Workflows with long-running steps
Audit logs / event sourcing
Real-time data processing
Decoupling legacy systems

Core Concepts

Domain Events — something happened
Commands — requests to do something
Sagas — distributed long-running transactions
Outbox Pattern — reliably publish events
Consumer Groups — competing consumers
Dead Letter Queues — failed message handling

⚠ Watch Out For

Event schema evolution is hard
Debugging async flows is complex
Eventual consistency surprises users
Message ordering assumptions
At-least-once delivery = idempotency required
"Event spaghetti" — undocumented flows

Event-Driven Anti-Patterns

ANTI-PATTERNEvent Sourcing Everything

Applying Event Sourcing to every entity in the system because it sounds elegant. Event Sourcing has real operational complexity: rebuilding state from events, projections, schema migration of historical events. Most CRUD entities don't benefit.

Fix

Use Event Sourcing only for aggregates where the history of changes is a core domain requirement (financial transactions, audit logs, order lifecycle). Use standard CRUD with domain events for everything else.

ANTI-PATTERNFat Events / God Events

Events that contain the entire entity payload — every field every time. Consumers become dependent on the event schema in the same way as a shared database. Any field addition or removal breaks consumers.

Fix

Events should carry minimal context: what happened and the ID of what changed. Consumers that need more data should query the producing service's API (event-carried state transfer, used carefully) or maintain their own read model.

ANTI-PATTERNMissing Idempotency

Message brokers guarantee at-least-once delivery. If your consumer isn't idempotent — processing the same message twice produces the same result — you will double-charge customers, double-send emails, or corrupt state.

Fix

Every event consumer must be idempotent. Track processed message IDs in a database (idempotency key). Use conditional writes or upsert semantics. Treat duplicate events as a first-class concern, not an edge case.

☁️

Serverless Architecture

// pay per execution, infinite scale, real constraints

Serverless / FaaS Architecture

Event-Triggered Low Ops

Complexity

Functions-as-a-Service (Lambda, Cloud Functions, Azure Functions) triggered by events. No servers to manage; auto-scales from zero to millions; pay only for actual execution time. Ideal for irregular workloads, webhooks, and event pipelines. Cold starts, execution time limits, and vendor lock-in are real considerations.

✓ Best For

Webhooks & event processors
Scheduled/cron jobs
Image/video processing pipelines
Low-traffic APIs with spiky traffic
Glue code between services
Background job processing

✗ Poor Fit

Long-running computations (>15min)
Latency-sensitive APIs (cold starts)
Stateful workloads
High sustained throughput (cost)
Complex local development
When vendor lock-in is unacceptable

⚠ Anti-Patterns

Monolithic Lambda — 1 function handles everything
Lambda calling Lambda synchronously
Keeping database connections open in Lambda
Not handling cold starts for user-facing paths
Missing Dead Letter Queues for failures

🔀

Backend for Frontend (BFF)

// tailored APIs for each client type

The BFF pattern, coined by Sam Newman, solves a specific problem: a general-purpose API designed for all clients optimises for none of them. A mobile app needs different data shapes, payload sizes, and endpoints than a web app or a third-party partner API. The BFF introduces a dedicated API layer per client type.

Architecture Overview

🌐Web App

📱iOS App

🤖Android

🔗Partner API

↕

BFF — Web

BFF — Mobile

BFF — Partner

↕ (internal APIs / services)

User Service

Order Service

Product Service

Payment Service

Notifications

Why BFF Works

Mobile BFF returns compact payloads; Web BFF returns full data
Each BFF team is owned by the client team — no upstream negotiation
Aggregates/transforms data from multiple backend services
Handles client-specific auth flows and session management
Can evolve independently of backend services

BFF Anti-Patterns

One BFF for all clients — defeats the purpose entirely
Business logic in BFF — BFFs should aggregate, not implement rules
BFF calling BFF — creates coupling between client layers
Too many BFFs — one per team is a smell; one per client type is right
BFF becomes a God Service — pulls too much logic from downstream

💡

GraphQL as BFF: GraphQL works exceptionally well in the BFF layer — clients specify exactly what data they need, eliminating over- and under-fetching. The BFF resolves fields from downstream services. This is how companies like GitHub, Shopify, and Netflix use GraphQL — as an aggregation layer, not as their internal service communication protocol.

📋

CQRS & Event Sourcing

// separate reads from writes for scale and auditability

CQRS — Command Query Responsibility Segregation

Write / Read Split

Separate the model used for writes (Commands) from the model used for reads (Queries). Commands mutate state; Queries return data. Read models are optimised for their specific query pattern — denormalised, pre-computed, fast. Write models enforce business invariants. Often combined with Event Sourcing, where the event log is the source of truth and read models are projections.

✓ Use When

Read/write ratio is very asymmetric
Queries need data in complex shapes
Need full audit history (Event Sourcing)
Complex domain with many invariants
Scaling reads independently from writes

✗ Avoid When

Simple CRUD domains
Small teams — overhead is high
Users expect immediate consistency
Domain logic is not complex enough to warrant it

⚠ Anti-Patterns

Sharing a write model with reads
Querying command handlers
Applying CQRS to every entity, not just complex ones
Event schema is too fine-grained, breaking consumers

📚

Layered & Hexagonal Architecture

// internal structure patterns that work at any scale

Layered Architecture

N-Tier

Classic horizontal layers: Presentation → Application → Domain → Infrastructure. Dependencies flow downward only. Simple, familiar, and effective when discipline is maintained.

Presentation: HTTP controllers, GraphQL resolvers
Application: Use cases, orchestration, no business rules
Domain: Entities, value objects, domain services
Infrastructure: DB, external APIs, email, file storage

ANTI-PATTERNLayer skipping

Controllers calling the repository directly, bypassing the domain. One layer should never know about layers more than one step away.

Hexagonal / Ports & Adapters

Clean Architecture

The domain is at the centre with zero dependencies on infrastructure. External systems (databases, APIs, message brokers) are adapters that plug into defined ports. The domain is fully testable in isolation — no database required.

Ports: Interfaces defined by the domain
Adapters: Implementations (SQL, REST, Kafka)
Domain: No framework, no ORM, no HTTP imports
Swap database, don't touch domain code

ANTI-PATTERNLeaking infrastructure into domain

Domain entities that import ORM annotations, HTTP status codes, or logging frameworks. The domain must be framework-agnostic.

🚫

General Architectural Anti-Patterns

// the traps every team falls into eventually

ANTI-PATTERNPremature Optimisation / Over-Engineering

Building for 10 million users when you have 100. Designing for failure modes you haven't encountered. Choosing Kafka because it's impressive, not because you need it. Engineering time spent on infrastructure that doesn't move the product forward.

Fix

Make decisions based on current pain and near-term projections. "We might need this later" is not sufficient justification for complexity today. You can add complexity when needed; you cannot easily remove it once embedded.

ANTI-PATTERNResume-Driven Development (RDD)

Choosing technologies to pad CVs rather than to solve actual problems. Using Kubernetes, Kafka, GraphQL, and gRPC because they look good — when a simple REST API on a single VM would serve the actual product perfectly.

Fix

Every technology decision should answer: "What specific problem does this solve for us, today?" If the answer is "it's what companies like us use," that's not enough.

ANTI-PATTERNThe Golden Hammer

Using the same architectural pattern for every problem because it worked before. Applying microservices to a 3-person internal tool because "we used it at my last job." Every tool has a problem it's optimised for.

Fix

Map each architectural decision to a specific requirement. When evaluating patterns, be explicit about the constraints and context of the current system, not past systems.

ANTI-PATTERNAccidental Coupling / Tight Integration

Components that should be independent are secretly entangled through shared global state, shared database tables, implicit ordering assumptions, or hard-coded service URLs. Changes ripple unexpectedly.

Fix

Explicit contracts between every module and service. Use architecture fitness functions (automated tests that verify structural rules) to catch coupling before it becomes load-bearing.

ANTI-PATTERNNo Clear Data Ownership

Multiple services or modules writing to the same entity. No single authoritative source of truth for customer data, product data, or order state. Synchronisation logic becomes the most complex part of the system.

Fix

For every entity, there is exactly one writer. Everyone else queries or subscribes to events from that owner. Data ownership maps to team ownership. If two teams both need to write the same data, that's a boundary problem to resolve first.

ANTI-PATTERNArchitecture by Consensus / Design by Committee

Every architectural decision requires sign-off from every team. Meetings to decide how to name HTTP routes. No individual is accountable for outcomes. Results in paralysis, bland compromises, and inconsistent implementation.

Fix

Appoint a decision-maker for architectural choices. Use RFC processes for broad input, but with a clear owner who makes the final call and is accountable for the outcome. Disagree and commit.

⚠️

The Microservice Complexity Tax

// what you're signing up for when you go distributed

Microservices don't eliminate complexity — they redistribute it from code to infrastructure and organisational processes. Before adopting them, your team must be ready to pay the following taxes:

Problem	Monolith	Microservices	Required Solution
Transaction	DB transaction	Distributed	Saga pattern, 2PC, eventual consistency
Debugging	Stack trace	Across services	Distributed tracing (OpenTelemetry, Jaeger)
Testing	Unit + integration	Contract + E2E	Consumer-driven contract tests (Pact)
Deployment	One pipeline	N pipelines	Platform engineering, GitOps, ArgoCD
Service Discovery	Function call	Runtime lookup	Service mesh (Istio, Linkerd), DNS
Auth	Session / in-process	Per-service	JWT propagation, service-to-service auth
Data consistency	ACID	BASE	Explicit design for eventual consistency
Local dev	One process	Dozens of containers	Docker Compose, service stubs, Telepresence

⚠️

The Fallacies of Distributed Computing: The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Every one of these assumptions is false. Your architecture must account for all of them.

📈

The Great Architecture Shift

// why the industry is revisiting microservices

2023–2025 Industry Movement

From Microservices Back to Simplicity

The microservices movement, which peaked around 2017–2020, promised team autonomy, independent scaling, and technical freedom. For many organisations, it delivered complexity, operational burden, and engineering hours diverted from product to platform. The correction isn't a rejection of microservices — it's a recognition that the costs are real and context-dependent.

What the Evidence Shows

Amazon Prime Video (2023)

Moved from distributed microservices architecture to a monolith. Result: 90% reduction in infrastructure cost, reduced operational complexity, simpler debugging. Published as a case study by their own engineers — creating significant industry discussion.

Shopify

One of the world's largest Rails monoliths — serving billions in commerce. Instead of splitting into services, they invested in deep module boundaries, component architecture, and tooling to enforce boundaries. Team autonomy achieved without distributed systems overhead.

Stack Overflow

Runs on a monolith serving hundreds of millions of monthly visitors with a small engineering team. Has written extensively about why they don't see the need for microservices at their scale given their team size and architecture quality.

DHH / 37signals (Basecamp)

Loud proponents of the "majestic monolith" — moving away from cloud microservices to a small number of well-structured server processes. Documented significant cost savings and engineering simplicity improvements.

Why the Shift is Happening

The Promise	The Reality
"Independent deployment per service"	In practice: coordinating releases across 30 services, managing schema migrations across teams
"Teams can choose their own tech stack"	In practice: 8 languages, nobody can debug each other's services, hiring becomes impossible
"Scale individual services independently"	In practice: most services have similar load profiles; only 2–3 genuinely need different scaling
"Fault isolation — one service fails, others don't"	In practice: synchronous call chains mean upstream failure propagates anyway
"Small, focused teams with clear ownership"	In practice: 40 services owned by 8 engineers, everyone is on-call for everything

✅

The nuanced conclusion: Microservices solve real problems — at the right organisational scale. The error was treating them as an aspirational default rather than a context-specific solution. The current advice: start with a well-structured modular monolith. Extract services when you have a concrete, measurable reason that the monolith cannot address. Your future self will thank you.

🧭

Timeless Guiding Principles

// what survives every architectural fashion cycle

Make It Easy to Change

The best architecture is the one you can change safely and quickly. Favour loose coupling, explicit interfaces, and clear ownership over any particular structural pattern.

Defer Decisions

Don't make irreversible decisions before you have to. A good architecture delays infrastructure and framework decisions, keeping options open until the cost of deferral exceeds the cost of deciding.

Measure, Don't Guess

Every architectural change should have measurable success criteria. "More scalable" is not a metric. "P99 <200ms at 10k RPS" is. Measure before and after.

Match Team Topology

Conway's Law is real. If your team isn't structured to own a service independently, the service will become a coordination nightmare. Architecture and team design must evolve together.

Design for Failure

Every external call will fail, every database will go down, every third-party API will have an outage. Circuit breakers, retries with backoff, fallbacks, and graceful degradation are not optional.

Avoid Distributed if Possible

The best distributed systems code is the code you didn't write. Network calls are 1000× slower than in-process calls. Every service boundary is a potential failure point, a serialisation cost, and a debugging challenge.

✅

Architecture Design Checklist

// step-by-step process for designing a system, with pitfalls at each stage

Architecture is not a single meeting — it's an iterative process. This checklist provides a structured order of operations for designing or evaluating a system's architecture, along with the most common mistakes at each step.

1

Understand the Requirements — Functional & Non-Functional

Before drawing any boxes, deeply understand what the system must do and the constraints it must operate within. Many architectural disasters stem from optimising for the wrong requirements.

Document core user journeys — what workflows must work, always
Define non-functional requirements: availability (99.9% vs 99.999%), latency targets (P50/P95/P99), throughput (RPS), data volume
Identify compliance and regulatory constraints (GDPR, HIPAA, SOC2)
Clarify consistency requirements — strong consistency vs. eventual consistency, per workflow
Establish team size and operational maturity — this constrains your pattern options
Identify time-to-market constraints — complexity costs delivery time

⚠ Common Pitfall

Designing for imaginary scale. "We might get 10M users" is not a requirement. Ask: what does day-one traffic look like? What's the 12-month realistic projection? Architect for that, with a clear path to the next order of magnitude.

2

Define the Domain — Identify Bounded Contexts

Use Domain-Driven Design to identify the natural seams in your problem domain. These seams — bounded contexts — are where your architectural boundaries should live. Boundaries drawn in the wrong place cause more damage than any technology choice.

Run an Event Storming session to map domain events and commands
Identify aggregates — clusters of entities that change together, owned by a single service/module
Find where concepts mean different things in different contexts (a "Customer" in billing vs. shipping)
Map team ownership to domain concepts — each bounded context should have one owning team
Document the context map: relationships between bounded contexts (upstream/downstream, partnership, anti-corruption layer)

⚠ Common Pitfall

Drawing service boundaries around technical concerns (AuthService, DatabaseService) rather than domain concepts. Technical boundaries couple teams; domain boundaries enable autonomy. A "UserService" that owns all user data is usually a monolith with an API in front of it.

3

Choose Your Decomposition Strategy

Given your requirements, team size, and domain understanding from steps 1–2, choose the structural pattern that fits. This is the most consequential single decision in the architecture process.

Apply the decision matrix: team size, scale requirements, domain clarity, operational maturity
Default to the simplest pattern that meets current requirements — simpler is almost always better
If choosing microservices: can each service be independently deployed, scaled, and developed by one team?
If choosing a monolith: enforce module boundaries — don't let it become a ball of mud
Document the reasoning for the decision — future team members need to understand why
Explicitly note what would cause you to revisit this decision (scaling signals, team growth milestones)

⚠ Common Pitfall

Choosing microservices because a senior engineer read about them or "that's how Netflix does it." Netflix also has thousands of engineers and a decade of platform engineering investment. Your context is different.

4

Design Data Architecture & Ownership

Data design is inseparable from service design. Every entity has exactly one owner. The data model is often the real source of coupling — more than APIs or code.

Assign data ownership: every table/collection has exactly one authoritative writer
Decide on consistency model per workflow: ACID (same service) vs. eventual (cross-service via events)
Design for data sharing: query APIs, read models, event subscriptions — never shared tables across modules
Plan schema evolution strategy: how will you add fields, rename columns, and remove data without downtime
Choose your database types intentionally: relational, document, time-series, search — don't default to one for everything
Design the backup, recovery, and data retention policy — architecture decision, not an afterthought

⚠ Common Pitfall

The shared database anti-pattern in disguise: microservices with separate deployment pipelines but a shared schema. Any team can modify any table. Schema migrations require coordinating every service. This is a monolith with network calls added.

5

Design Communication Patterns

How components talk to each other determines latency, resilience, and coupling. Synchronous calls couple availability; async messaging decouples it but adds eventual consistency.

Map each interaction: is it request/response (synchronous) or fire-and-forget / publish-subscribe (async)?
Use synchronous calls for user-facing reads and transactional writes within a bounded context
Use async messaging for cross-context workflows, notifications, and eventual propagation
Define your API contract strategy: REST, gRPC, GraphQL — and how schemas are versioned
Design for failure: every synchronous call needs a timeout, retry policy, and fallback
If using messaging: define message schema ownership, schema registry, and evolution policy

⚠ Common Pitfall

Synchronous call chains: Service A → B → C → D to handle one user request. P99 latency = sum of all P99s. One slow or failed service brings down the whole chain. Design for autonomy: each service should be able to serve its core function even if downstream services are unavailable.

6

Design for Observability from Day One

You cannot operate what you cannot observe. Observability is not a feature to add later — it must be part of the initial design. In distributed systems, it's the difference between a 5-minute incident and a 5-hour outage.

Define the three pillars: Metrics (Prometheus/Datadog), Logs (structured, correlation IDs), Traces (OpenTelemetry)
Propagate correlation/trace IDs across every service boundary — from HTTP header to database query
Define SLIs (what you measure), SLOs (the targets), and SLAs (the commitments)
Design alerting strategy: alert on symptoms (SLO burn rate), not causes (CPU > 80%)
Build runbooks alongside the system — not after incidents
Implement health checks, readiness probes, and circuit breakers as standard

⚠ Common Pitfall

"We'll add monitoring later." In a distributed system, "later" means "after the first major outage, while under pressure, with customers watching." The cost of instrumenting a service at build time is tiny compared to debugging a production issue with no telemetry.

7

Design Security Architecture

Security is not a layer you add on top — it's a dimension of every architectural decision. Authentication, authorisation, data encryption, network segmentation, and secrets management all have architectural implications.

Define authentication: who are the actors (users, services, third-parties) and how do they prove identity
Define authorisation model: RBAC, ABAC, or per-resource ACLs — and where enforcement happens
Design service-to-service auth: mutual TLS, JWT, or API keys — never unauthenticated internal traffic
Plan secrets management: environment variables at minimum; HashiCorp Vault / AWS Secrets Manager for production
Network segmentation: what can talk to what — apply least privilege at the network level
Data classification: identify PII, payment data, health data — apply appropriate encryption and access controls

⚠ Common Pitfall

Implicit trust inside the network perimeter. "Internal services don't need auth because nothing bad can get in." One compromised service, one misconfigured S3 bucket, or one insider threat breaks this assumption. Zero-trust networking is now the standard baseline.

8

Plan the Deployment & Operations Model

Architecture and deployment are inseparable. How you deploy dictates what failure modes you face, how quickly you can release, and how much operational overhead your team carries.

Define deployment targets: containers (Kubernetes, ECS), VMs, PaaS, serverless — match to team operational capability
Design CI/CD pipeline: how code moves from commit to production, including automated tests and approvals
Define release strategy: blue/green, canary, feature flags — how do you roll back quickly?
Plan infrastructure as code from day one: Terraform/Pulumi, not ClickOps
Define environment strategy: how many environments, what runs in each, how are they provisioned
Plan for disaster recovery: RTO (how long to recover) and RPO (how much data can be lost)

⚠ Common Pitfall

Kubernetes by default. K8s is a powerful operations platform with real complexity. If your team doesn't have platform engineering capability, a managed PaaS (Railway, Render, Fly.io, ECS Fargate) delivers 90% of the benefit with 10% of the operational overhead.

9

Document Architecture Decisions (ADRs)

Architecture without documentation is tribal knowledge. Architecture Decision Records (ADRs) capture not just what was decided, but why — including the alternatives considered and the forces at play. Invaluable for new team members and for revisiting decisions as context changes.

Write an ADR for every significant architectural decision (template: Context → Decision → Consequences)
Store ADRs alongside code in version control — they evolve with the system
Mark superseded ADRs as deprecated, not deleted — the history matters
Document the system's C4 model: Context, Containers, Components, Code diagrams
Define and publish API contracts (OpenAPI, AsyncAPI, Protobuf schemas)
Review and update documentation as the system evolves — stale docs are worse than no docs

⚠ Common Pitfall

Architecture diagrams that don't match reality. Systems drift from their documented state immediately. Build living documentation: generate diagrams from code where possible (Structurizr, C4 DSL), run architecture fitness functions in CI to catch drift automatically.

10

Define Evolution & Migration Strategy

No architecture is final. Design for change: define the signals that would trigger an architectural evolution and the path for getting there. The teams that handle growth well aren't the ones who made perfect initial decisions — they're the ones who built in reversibility.

Define explicit scaling triggers: "When we reach X RPS, we'll extract Y module as a service"
Plan for the Strangler Fig pattern: incrementally replace parts of the system without a big rewrite
Identify the riskiest architectural assumptions and build in monitoring to detect when they break
Establish regular architecture reviews — quarterly at minimum — with the engineering team
Budget for technical debt paydown: not all debt is bad, but unmanaged debt compounds

⚠ Common Pitfall

The Big Rewrite. "Our codebase is a mess — let's start over with the right architecture." Rewrites take 2–3× longer than estimated, the new system acquires its own technical debt, and the business continues running on the old system the whole time. The Strangler Fig almost always beats the rewrite.

📚

Reference Links

// foundational reading for every practicing architect

Patterns & Principles

Domain-Driven Design

Modular Monolith & The Shift

Event-Driven & Messaging

BFF & API Patterns

Observability & Reliability

Team & Organisation

Books

📌

Quick diagnostic: If you're evaluating your current architecture, run this thought experiment — "If I needed to change the billing logic, how many teams would I need to coordinate? How many services would I deploy? How long would it take?" The answer tells you whether your current architecture is enabling or impeding your team. There is no universal right answer — only what's right for your context.

Patterns,Anti-Patterns& the Art of Choice

Overview

How to Choose an Architecture

The Monolith

Monolith Anti-Patterns

Modular Monolith

Microservices

Microservice Anti-Patterns

Event-Driven Architecture

Event-Driven Anti-Patterns

Serverless Architecture

Backend for Frontend (BFF)

CQRS & Event Sourcing

Layered & Hexagonal Architecture

General Architectural Anti-Patterns

The Microservice Complexity Tax

The Great Architecture Shift

What the Evidence Shows

Why the Shift is Happening

Timeless Guiding Principles

Architecture Design Checklist

Reference Links

Patterns,
Anti-Patterns
& the Art of Choice