V1
Back to handbooks index
Λ
Architecture Handbook Patterns & Anti-Patterns
Patterns Anti-Patterns Industry Trends
Software Architecture Handbook

Patterns,
Anti-Patterns
& the Art of Choice

A practitioner's guide to the major architectural styles — when to reach for each, the traps each one hides, and the evolving industry consensus on what actually works at scale.

Monolith Modular Monolith Microservices Event-Driven Serverless BFF CQRS Hexagonal
🗺

Overview

// why architecture decisions are the most expensive ones you'll make

Architecture decisions are the ones that are hardest to reverse. Choosing how to decompose your system — how it communicates, how data flows, how teams are organised around it — shapes every engineering decision that follows. Get it wrong and you spend years fighting your own infrastructure instead of building product.

There is no universally correct architecture. Every pattern is a set of tradeoffs. The goal of this handbook is to make those tradeoffs explicit so your team can make deliberate, informed choices instead of defaulting to the industry fashion of the moment.

2024 Industry Reality: After a decade of microservices evangelism, major engineering organisations — including Amazon, Shopify, Stack Overflow, and Basecamp — are publicly discussing consolidation back to simpler, modular architectures. The lesson isn't that microservices are bad; it's that they require organisational maturity most teams don't have.
Complexity Budget

Every architectural decision spends from your team's complexity budget. Distributed systems, eventual consistency, and service meshes are expensive. Spend wisely.

Conway's Law

Your architecture will mirror your team structure — and vice versa. You can't deploy a microservices architecture with a 5-person team and expect it to work.

Reversibility

Prefer architectures you can migrate out of. A monolith that's well-structured can be extracted into services later. A distributed mess cannot easily be collapsed.

⚖️

How to Choose an Architecture

// the questions that actually drive the right decision
QuestionLean SimplerLean Distributed
Team size1–25 engineers25–100+ engineers
Domain clarityNew / exploratory productWell-understood, stable domain
Deployment frequencyWeekly or lessMultiple times per day per team
Scale requirements<10k RPS, single regionGlobal, massive, independent scale
Regulatory isolationShared data model is fineHard data boundaries required
Technology diversitySingle language/stack is fineLegitimately need different runtimes
Operational maturitySmall or growing ops teamMature platform / SRE capability
Time to marketSpeed is #1 priority nowLong-term autonomy matters more
⚠️
The startup trap: Most early-stage teams over-engineer. If you can't clearly articulate which part of your system needs to scale independently and why, you don't need microservices yet. You need a working product.
🧱

The Monolith

// the original, still powerful
Monolithic Architecture
Traditional Battle-Tested
Complexity

All application logic — UI, business logic, data access — lives in a single deployable unit. One codebase, one process, one deployment. Shares a single database. Despite its reputation as "legacy," a well-written monolith outperforms a poorly-written distributed system in almost every practical dimension.

✓ Use When
  • Small team (under 15 engineers)
  • Exploring product-market fit
  • Domain is unclear or evolving
  • Limited operational resources
  • Need fast iteration cycles
  • Vertical scaling is sufficient
✗ Avoid When
  • Teams stepping on each other daily
  • Parts need wildly different scale
  • Deployment takes hours to release
  • Multiple teams own same codebase
  • Technology diversity is required
⚠ Watch Out For
  • The Big Ball of Mud anti-pattern
  • Shared database becoming a bottleneck
  • Test suite growing slower over time
  • Long release cycles start to hurt
  • Cross-cutting concerns tangled

Monolith Anti-Patterns

ANTI-PATTERNBig Ball of Mud

No discernible structure. Everything imports everything. Business logic in controllers, database calls in view templates, validation spread across the codebase. Common result of adding features rapidly without maintaining layer discipline.

Fix
Enforce clear layers (presentation → application → domain → infrastructure). Introduce module boundaries and forbidden dependency rules. Treat internal architecture with same care as API design.
ANTI-PATTERNShared Database as Integration Layer

Multiple applications or services talk to the same database directly, including updating each other's tables. Schema changes become impossible without coordinating every consumer.

Fix
Each logical module owns its tables. Expose data to others via API or events — never by granting direct table access. Even in a monolith, enforce this boundary to prepare for future extraction.
ANTI-PATTERNAnemic Domain Model

Domain objects are just data holders with no behaviour. All business logic lives in service classes that orchestrate dumb models. Results in scattered, duplicated business rules that are hard to test and reason about.

Fix
Apply DDD principles: move invariants and business rules into rich domain objects. An Order should know how to calculate its total and validate its own state — not delegate that to a 500-line OrderService.
🏗️

Modular Monolith

// the architecture the industry is rediscovering
Industry Trend 2023–2025
The Return of the Modular Monolith
Amazon, the company that literally invented microservices culture with its "two-pizza team" rule, published a paper in 2023 examining their Prime Video team's return from microservices to a monolith — achieving a 90% cost reduction. Shopify, Stack Overflow, and Basecamp have all made similar moves or written extensively about preferring well-structured monoliths. This isn't nostalgia — it's hard-won pragmatism.
Modular Monolith
Recommended Default 2024 Best Practice
Complexity

A single deployable unit — like a monolith — but with strong, enforced internal module boundaries. Each module owns its domain, its data, and its public API surface. Modules communicate through well-defined interfaces, not by reaching into each other's internals. Deployed as one unit; structured like services. You get the simplicity of a monolith with the logical isolation of services — and the ability to extract true services when you genuinely need to.

✓ Use When
  • Team of 5–50 engineers
  • Domain reasonably well understood
  • Want service-like isolation without ops overhead
  • May need to extract services later
  • Single-region deployment is fine
  • Growth stage: 1M–50M users
Key Principles
  • Module boundaries enforced at build time
  • Each module has its own DB schema/tables
  • Cross-module calls via public interfaces only
  • No reaching into another module's internals
  • Modules can be tested in isolation
  • Use ArchUnit / Dependency Cruiser to enforce
Extraction Path
  • Start with well-defined module boundaries
  • Add messaging abstraction inside the monolith
  • When a module needs to scale independently — extract
  • The interface stays the same, delivery changes
  • You'll know exactly where boundaries belong
💡
Shopify's approach: Shopify runs one of the world's largest Rails applications. Rather than splitting into microservices, they invested deeply in module boundaries within their monolith, using component boundaries that are enforced by tooling. The result: large-scale team autonomy without distributed systems overhead.
🔬

Microservices

// powerful at scale, expensive everywhere else
Microservices Architecture
For Large Organisations High Ops Cost
Complexity

The system is decomposed into small, independently deployable services — each owned by a small team, each with its own database, each communicating over a network (HTTP/gRPC/messaging). Enables independent scaling and deployment per service. Requires mature platform engineering, service discovery, distributed tracing, and organisational alignment to actually succeed.

✓ Genuine Use Cases
  • 50+ engineers with clear team ownership
  • Parts truly need independent scale (e.g. payments vs. search)
  • Regulatory isolation requirements
  • Different technology stacks legitimately needed
  • Mature DevOps/Platform Engineering org
  • Netflix, Amazon, Uber scale
Real Benefits
  • True team independence & autonomy
  • Independent deployment — no coordination
  • Independent scaling per service
  • Fault isolation (one service fails, others don't)
  • Technology freedom per service
Real Costs
  • Distributed transactions are hard
  • Network latency is now everywhere
  • Observability requires investment
  • Service discovery and mesh overhead
  • Testing integration paths is complex
  • Operational overhead multiplied by N

Microservice Anti-Patterns

ANTI-PATTERNDistributed Monolith

Services that are physically separate but logically coupled — they call each other synchronously in a chain, share a database, or must be deployed together. You get all the costs of distribution with none of the benefits. The worst of both worlds.

Fix
True service independence requires independent data stores and async communication. If two services must always deploy together, merge them. If they chain-call 6 deep to handle one request, your boundaries are wrong.
ANTI-PATTERNNano-services / Over-decomposition

Services so small they have no business logic — just CRUD wrappers around a table. 50 services for a 10-person team. Operational burden is massive; every feature requires coordinating 5 service deployments.

Fix
A service should map to a bounded context — a meaningful domain concept, not a table. If a service is just a thin wrapper around a database with no business rules, it belongs inside another service.
ANTI-PATTERNSynchronous Chain Calls

Service A calls B, which calls C, which calls D. Latency compounds, failure cascades. One slow service makes everything slow. P99 latency of the chain is sum of all P99s.

Fix
Prefer async messaging for non-blocking work. Use the Saga pattern for distributed transactions. Design services to be autonomous — if Service A needs data from Service B to function, consider whether A is actually part of B's bounded context.
ANTI-PATTERNShared Library Hell

A giant shared library ("commons" / "core") that all services depend on. Updating the library requires coordinating and re-deploying all services. The services are no longer independently deployable.

Fix
Share as little as possible. When sharing is necessary (e.g., authentication middleware, telemetry), version it explicitly and allow services to upgrade at their own pace. Treat shared libraries like external packages.

Event-Driven Architecture

// decouple producers from consumers through async messaging
Event-Driven Architecture (EDA)
Async Decoupling High Scalability
Complexity

Components communicate by producing and consuming events through a message broker (Kafka, RabbitMQ, AWS EventBridge, NATS). Producers don't know who consumes their events. Consumers subscribe to events they care about. Enables temporal decoupling, high throughput, and extensibility — add a new consumer without touching the producer. Works beautifully within a modular monolith (in-process events) or across microservices (external broker).

✓ Use When
  • High-throughput pipelines (analytics, logs)
  • Fan-out: one event, many consumers
  • Workflows with long-running steps
  • Audit logs / event sourcing
  • Real-time data processing
  • Decoupling legacy systems
Core Concepts
  • Domain Events — something happened
  • Commands — requests to do something
  • Sagas — distributed long-running transactions
  • Outbox Pattern — reliably publish events
  • Consumer Groups — competing consumers
  • Dead Letter Queues — failed message handling
⚠ Watch Out For
  • Event schema evolution is hard
  • Debugging async flows is complex
  • Eventual consistency surprises users
  • Message ordering assumptions
  • At-least-once delivery = idempotency required
  • "Event spaghetti" — undocumented flows

Event-Driven Anti-Patterns

ANTI-PATTERNEvent Sourcing Everything

Applying Event Sourcing to every entity in the system because it sounds elegant. Event Sourcing has real operational complexity: rebuilding state from events, projections, schema migration of historical events. Most CRUD entities don't benefit.

Fix
Use Event Sourcing only for aggregates where the history of changes is a core domain requirement (financial transactions, audit logs, order lifecycle). Use standard CRUD with domain events for everything else.
ANTI-PATTERNFat Events / God Events

Events that contain the entire entity payload — every field every time. Consumers become dependent on the event schema in the same way as a shared database. Any field addition or removal breaks consumers.

Fix
Events should carry minimal context: what happened and the ID of what changed. Consumers that need more data should query the producing service's API (event-carried state transfer, used carefully) or maintain their own read model.
ANTI-PATTERNMissing Idempotency

Message brokers guarantee at-least-once delivery. If your consumer isn't idempotent — processing the same message twice produces the same result — you will double-charge customers, double-send emails, or corrupt state.

Fix
Every event consumer must be idempotent. Track processed message IDs in a database (idempotency key). Use conditional writes or upsert semantics. Treat duplicate events as a first-class concern, not an edge case.
☁️

Serverless Architecture

// pay per execution, infinite scale, real constraints
Serverless / FaaS Architecture
Event-Triggered Low Ops
Complexity

Functions-as-a-Service (Lambda, Cloud Functions, Azure Functions) triggered by events. No servers to manage; auto-scales from zero to millions; pay only for actual execution time. Ideal for irregular workloads, webhooks, and event pipelines. Cold starts, execution time limits, and vendor lock-in are real considerations.

✓ Best For
  • Webhooks & event processors
  • Scheduled/cron jobs
  • Image/video processing pipelines
  • Low-traffic APIs with spiky traffic
  • Glue code between services
  • Background job processing
✗ Poor Fit
  • Long-running computations (>15min)
  • Latency-sensitive APIs (cold starts)
  • Stateful workloads
  • High sustained throughput (cost)
  • Complex local development
  • When vendor lock-in is unacceptable
⚠ Anti-Patterns
  • Monolithic Lambda — 1 function handles everything
  • Lambda calling Lambda synchronously
  • Keeping database connections open in Lambda
  • Not handling cold starts for user-facing paths
  • Missing Dead Letter Queues for failures
🔀

Backend for Frontend (BFF)

// tailored APIs for each client type

The BFF pattern, coined by Sam Newman, solves a specific problem: a general-purpose API designed for all clients optimises for none of them. A mobile app needs different data shapes, payload sizes, and endpoints than a web app or a third-party partner API. The BFF introduces a dedicated API layer per client type.

Architecture Overview
🌐Web App
📱iOS App
🤖Android
🔗Partner API
BFF — Web
BFF — Mobile
BFF — Partner
↕ (internal APIs / services)
User Service
Order Service
Product Service
Payment Service
Notifications
Why BFF Works
  • Mobile BFF returns compact payloads; Web BFF returns full data
  • Each BFF team is owned by the client team — no upstream negotiation
  • Aggregates/transforms data from multiple backend services
  • Handles client-specific auth flows and session management
  • Can evolve independently of backend services
BFF Anti-Patterns
  • One BFF for all clients — defeats the purpose entirely
  • Business logic in BFF — BFFs should aggregate, not implement rules
  • BFF calling BFF — creates coupling between client layers
  • Too many BFFs — one per team is a smell; one per client type is right
  • BFF becomes a God Service — pulls too much logic from downstream
💡
GraphQL as BFF: GraphQL works exceptionally well in the BFF layer — clients specify exactly what data they need, eliminating over- and under-fetching. The BFF resolves fields from downstream services. This is how companies like GitHub, Shopify, and Netflix use GraphQL — as an aggregation layer, not as their internal service communication protocol.
📋

CQRS & Event Sourcing

// separate reads from writes for scale and auditability
CQRS — Command Query Responsibility Segregation
Write / Read Split

Separate the model used for writes (Commands) from the model used for reads (Queries). Commands mutate state; Queries return data. Read models are optimised for their specific query pattern — denormalised, pre-computed, fast. Write models enforce business invariants. Often combined with Event Sourcing, where the event log is the source of truth and read models are projections.

✓ Use When
  • Read/write ratio is very asymmetric
  • Queries need data in complex shapes
  • Need full audit history (Event Sourcing)
  • Complex domain with many invariants
  • Scaling reads independently from writes
✗ Avoid When
  • Simple CRUD domains
  • Small teams — overhead is high
  • Users expect immediate consistency
  • Domain logic is not complex enough to warrant it
⚠ Anti-Patterns
  • Sharing a write model with reads
  • Querying command handlers
  • Applying CQRS to every entity, not just complex ones
  • Event schema is too fine-grained, breaking consumers
📚

Layered & Hexagonal Architecture

// internal structure patterns that work at any scale
Layered Architecture
N-Tier

Classic horizontal layers: Presentation → Application → Domain → Infrastructure. Dependencies flow downward only. Simple, familiar, and effective when discipline is maintained.

  • Presentation: HTTP controllers, GraphQL resolvers
  • Application: Use cases, orchestration, no business rules
  • Domain: Entities, value objects, domain services
  • Infrastructure: DB, external APIs, email, file storage
ANTI-PATTERNLayer skipping

Controllers calling the repository directly, bypassing the domain. One layer should never know about layers more than one step away.

Hexagonal / Ports & Adapters
Clean Architecture

The domain is at the centre with zero dependencies on infrastructure. External systems (databases, APIs, message brokers) are adapters that plug into defined ports. The domain is fully testable in isolation — no database required.

  • Ports: Interfaces defined by the domain
  • Adapters: Implementations (SQL, REST, Kafka)
  • Domain: No framework, no ORM, no HTTP imports
  • Swap database, don't touch domain code
ANTI-PATTERNLeaking infrastructure into domain

Domain entities that import ORM annotations, HTTP status codes, or logging frameworks. The domain must be framework-agnostic.

🚫

General Architectural Anti-Patterns

// the traps every team falls into eventually
ANTI-PATTERNPremature Optimisation / Over-Engineering

Building for 10 million users when you have 100. Designing for failure modes you haven't encountered. Choosing Kafka because it's impressive, not because you need it. Engineering time spent on infrastructure that doesn't move the product forward.

Fix
Make decisions based on current pain and near-term projections. "We might need this later" is not sufficient justification for complexity today. You can add complexity when needed; you cannot easily remove it once embedded.
ANTI-PATTERNResume-Driven Development (RDD)

Choosing technologies to pad CVs rather than to solve actual problems. Using Kubernetes, Kafka, GraphQL, and gRPC because they look good — when a simple REST API on a single VM would serve the actual product perfectly.

Fix
Every technology decision should answer: "What specific problem does this solve for us, today?" If the answer is "it's what companies like us use," that's not enough.
ANTI-PATTERNThe Golden Hammer

Using the same architectural pattern for every problem because it worked before. Applying microservices to a 3-person internal tool because "we used it at my last job." Every tool has a problem it's optimised for.

Fix
Map each architectural decision to a specific requirement. When evaluating patterns, be explicit about the constraints and context of the current system, not past systems.
ANTI-PATTERNAccidental Coupling / Tight Integration

Components that should be independent are secretly entangled through shared global state, shared database tables, implicit ordering assumptions, or hard-coded service URLs. Changes ripple unexpectedly.

Fix
Explicit contracts between every module and service. Use architecture fitness functions (automated tests that verify structural rules) to catch coupling before it becomes load-bearing.
ANTI-PATTERNNo Clear Data Ownership

Multiple services or modules writing to the same entity. No single authoritative source of truth for customer data, product data, or order state. Synchronisation logic becomes the most complex part of the system.

Fix
For every entity, there is exactly one writer. Everyone else queries or subscribes to events from that owner. Data ownership maps to team ownership. If two teams both need to write the same data, that's a boundary problem to resolve first.
ANTI-PATTERNArchitecture by Consensus / Design by Committee

Every architectural decision requires sign-off from every team. Meetings to decide how to name HTTP routes. No individual is accountable for outcomes. Results in paralysis, bland compromises, and inconsistent implementation.

Fix
Appoint a decision-maker for architectural choices. Use RFC processes for broad input, but with a clear owner who makes the final call and is accountable for the outcome. Disagree and commit.
⚠️

The Microservice Complexity Tax

// what you're signing up for when you go distributed

Microservices don't eliminate complexity — they redistribute it from code to infrastructure and organisational processes. Before adopting them, your team must be ready to pay the following taxes:

ProblemMonolithMicroservicesRequired Solution
TransactionDB transactionDistributedSaga pattern, 2PC, eventual consistency
DebuggingStack traceAcross servicesDistributed tracing (OpenTelemetry, Jaeger)
TestingUnit + integrationContract + E2EConsumer-driven contract tests (Pact)
DeploymentOne pipelineN pipelinesPlatform engineering, GitOps, ArgoCD
Service DiscoveryFunction callRuntime lookupService mesh (Istio, Linkerd), DNS
AuthSession / in-processPer-serviceJWT propagation, service-to-service auth
Data consistencyACIDBASEExplicit design for eventual consistency
Local devOne processDozens of containersDocker Compose, service stubs, Telepresence
⚠️
The Fallacies of Distributed Computing: The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Every one of these assumptions is false. Your architecture must account for all of them.
📈

The Great Architecture Shift

// why the industry is revisiting microservices
2023–2025 Industry Movement
From Microservices Back to Simplicity
The microservices movement, which peaked around 2017–2020, promised team autonomy, independent scaling, and technical freedom. For many organisations, it delivered complexity, operational burden, and engineering hours diverted from product to platform. The correction isn't a rejection of microservices — it's a recognition that the costs are real and context-dependent.

What the Evidence Shows

Amazon Prime Video (2023)

Moved from distributed microservices architecture to a monolith. Result: 90% reduction in infrastructure cost, reduced operational complexity, simpler debugging. Published as a case study by their own engineers — creating significant industry discussion.

Shopify

One of the world's largest Rails monoliths — serving billions in commerce. Instead of splitting into services, they invested in deep module boundaries, component architecture, and tooling to enforce boundaries. Team autonomy achieved without distributed systems overhead.

Stack Overflow

Runs on a monolith serving hundreds of millions of monthly visitors with a small engineering team. Has written extensively about why they don't see the need for microservices at their scale given their team size and architecture quality.

DHH / 37signals (Basecamp)

Loud proponents of the "majestic monolith" — moving away from cloud microservices to a small number of well-structured server processes. Documented significant cost savings and engineering simplicity improvements.

Why the Shift is Happening

The PromiseThe Reality
"Independent deployment per service"In practice: coordinating releases across 30 services, managing schema migrations across teams
"Teams can choose their own tech stack"In practice: 8 languages, nobody can debug each other's services, hiring becomes impossible
"Scale individual services independently"In practice: most services have similar load profiles; only 2–3 genuinely need different scaling
"Fault isolation — one service fails, others don't"In practice: synchronous call chains mean upstream failure propagates anyway
"Small, focused teams with clear ownership"In practice: 40 services owned by 8 engineers, everyone is on-call for everything
The nuanced conclusion: Microservices solve real problems — at the right organisational scale. The error was treating them as an aspirational default rather than a context-specific solution. The current advice: start with a well-structured modular monolith. Extract services when you have a concrete, measurable reason that the monolith cannot address. Your future self will thank you.
🧭

Timeless Guiding Principles

// what survives every architectural fashion cycle
Make It Easy to Change

The best architecture is the one you can change safely and quickly. Favour loose coupling, explicit interfaces, and clear ownership over any particular structural pattern.

Defer Decisions

Don't make irreversible decisions before you have to. A good architecture delays infrastructure and framework decisions, keeping options open until the cost of deferral exceeds the cost of deciding.

Measure, Don't Guess

Every architectural change should have measurable success criteria. "More scalable" is not a metric. "P99 <200ms at 10k RPS" is. Measure before and after.

Match Team Topology

Conway's Law is real. If your team isn't structured to own a service independently, the service will become a coordination nightmare. Architecture and team design must evolve together.

Design for Failure

Every external call will fail, every database will go down, every third-party API will have an outage. Circuit breakers, retries with backoff, fallbacks, and graceful degradation are not optional.

Avoid Distributed if Possible

The best distributed systems code is the code you didn't write. Network calls are 1000× slower than in-process calls. Every service boundary is a potential failure point, a serialisation cost, and a debugging challenge.

Architecture Design Checklist

// step-by-step process for designing a system, with pitfalls at each stage

Architecture is not a single meeting — it's an iterative process. This checklist provides a structured order of operations for designing or evaluating a system's architecture, along with the most common mistakes at each step.

1
Understand the Requirements — Functional & Non-Functional
Before drawing any boxes, deeply understand what the system must do and the constraints it must operate within. Many architectural disasters stem from optimising for the wrong requirements.
  • Document core user journeys — what workflows must work, always
  • Define non-functional requirements: availability (99.9% vs 99.999%), latency targets (P50/P95/P99), throughput (RPS), data volume
  • Identify compliance and regulatory constraints (GDPR, HIPAA, SOC2)
  • Clarify consistency requirements — strong consistency vs. eventual consistency, per workflow
  • Establish team size and operational maturity — this constrains your pattern options
  • Identify time-to-market constraints — complexity costs delivery time
⚠ Common Pitfall

Designing for imaginary scale. "We might get 10M users" is not a requirement. Ask: what does day-one traffic look like? What's the 12-month realistic projection? Architect for that, with a clear path to the next order of magnitude.

2
Define the Domain — Identify Bounded Contexts
Use Domain-Driven Design to identify the natural seams in your problem domain. These seams — bounded contexts — are where your architectural boundaries should live. Boundaries drawn in the wrong place cause more damage than any technology choice.
  • Run an Event Storming session to map domain events and commands
  • Identify aggregates — clusters of entities that change together, owned by a single service/module
  • Find where concepts mean different things in different contexts (a "Customer" in billing vs. shipping)
  • Map team ownership to domain concepts — each bounded context should have one owning team
  • Document the context map: relationships between bounded contexts (upstream/downstream, partnership, anti-corruption layer)
⚠ Common Pitfall

Drawing service boundaries around technical concerns (AuthService, DatabaseService) rather than domain concepts. Technical boundaries couple teams; domain boundaries enable autonomy. A "UserService" that owns all user data is usually a monolith with an API in front of it.

3
Choose Your Decomposition Strategy
Given your requirements, team size, and domain understanding from steps 1–2, choose the structural pattern that fits. This is the most consequential single decision in the architecture process.
  • Apply the decision matrix: team size, scale requirements, domain clarity, operational maturity
  • Default to the simplest pattern that meets current requirements — simpler is almost always better
  • If choosing microservices: can each service be independently deployed, scaled, and developed by one team?
  • If choosing a monolith: enforce module boundaries — don't let it become a ball of mud
  • Document the reasoning for the decision — future team members need to understand why
  • Explicitly note what would cause you to revisit this decision (scaling signals, team growth milestones)
⚠ Common Pitfall

Choosing microservices because a senior engineer read about them or "that's how Netflix does it." Netflix also has thousands of engineers and a decade of platform engineering investment. Your context is different.

4
Design Data Architecture & Ownership
Data design is inseparable from service design. Every entity has exactly one owner. The data model is often the real source of coupling — more than APIs or code.
  • Assign data ownership: every table/collection has exactly one authoritative writer
  • Decide on consistency model per workflow: ACID (same service) vs. eventual (cross-service via events)
  • Design for data sharing: query APIs, read models, event subscriptions — never shared tables across modules
  • Plan schema evolution strategy: how will you add fields, rename columns, and remove data without downtime
  • Choose your database types intentionally: relational, document, time-series, search — don't default to one for everything
  • Design the backup, recovery, and data retention policy — architecture decision, not an afterthought
⚠ Common Pitfall

The shared database anti-pattern in disguise: microservices with separate deployment pipelines but a shared schema. Any team can modify any table. Schema migrations require coordinating every service. This is a monolith with network calls added.

5
Design Communication Patterns
How components talk to each other determines latency, resilience, and coupling. Synchronous calls couple availability; async messaging decouples it but adds eventual consistency.
  • Map each interaction: is it request/response (synchronous) or fire-and-forget / publish-subscribe (async)?
  • Use synchronous calls for user-facing reads and transactional writes within a bounded context
  • Use async messaging for cross-context workflows, notifications, and eventual propagation
  • Define your API contract strategy: REST, gRPC, GraphQL — and how schemas are versioned
  • Design for failure: every synchronous call needs a timeout, retry policy, and fallback
  • If using messaging: define message schema ownership, schema registry, and evolution policy
⚠ Common Pitfall

Synchronous call chains: Service A → B → C → D to handle one user request. P99 latency = sum of all P99s. One slow or failed service brings down the whole chain. Design for autonomy: each service should be able to serve its core function even if downstream services are unavailable.

6
Design for Observability from Day One
You cannot operate what you cannot observe. Observability is not a feature to add later — it must be part of the initial design. In distributed systems, it's the difference between a 5-minute incident and a 5-hour outage.
  • Define the three pillars: Metrics (Prometheus/Datadog), Logs (structured, correlation IDs), Traces (OpenTelemetry)
  • Propagate correlation/trace IDs across every service boundary — from HTTP header to database query
  • Define SLIs (what you measure), SLOs (the targets), and SLAs (the commitments)
  • Design alerting strategy: alert on symptoms (SLO burn rate), not causes (CPU > 80%)
  • Build runbooks alongside the system — not after incidents
  • Implement health checks, readiness probes, and circuit breakers as standard
⚠ Common Pitfall

"We'll add monitoring later." In a distributed system, "later" means "after the first major outage, while under pressure, with customers watching." The cost of instrumenting a service at build time is tiny compared to debugging a production issue with no telemetry.

7
Design Security Architecture
Security is not a layer you add on top — it's a dimension of every architectural decision. Authentication, authorisation, data encryption, network segmentation, and secrets management all have architectural implications.
  • Define authentication: who are the actors (users, services, third-parties) and how do they prove identity
  • Define authorisation model: RBAC, ABAC, or per-resource ACLs — and where enforcement happens
  • Design service-to-service auth: mutual TLS, JWT, or API keys — never unauthenticated internal traffic
  • Plan secrets management: environment variables at minimum; HashiCorp Vault / AWS Secrets Manager for production
  • Network segmentation: what can talk to what — apply least privilege at the network level
  • Data classification: identify PII, payment data, health data — apply appropriate encryption and access controls
⚠ Common Pitfall

Implicit trust inside the network perimeter. "Internal services don't need auth because nothing bad can get in." One compromised service, one misconfigured S3 bucket, or one insider threat breaks this assumption. Zero-trust networking is now the standard baseline.

8
Plan the Deployment & Operations Model
Architecture and deployment are inseparable. How you deploy dictates what failure modes you face, how quickly you can release, and how much operational overhead your team carries.
  • Define deployment targets: containers (Kubernetes, ECS), VMs, PaaS, serverless — match to team operational capability
  • Design CI/CD pipeline: how code moves from commit to production, including automated tests and approvals
  • Define release strategy: blue/green, canary, feature flags — how do you roll back quickly?
  • Plan infrastructure as code from day one: Terraform/Pulumi, not ClickOps
  • Define environment strategy: how many environments, what runs in each, how are they provisioned
  • Plan for disaster recovery: RTO (how long to recover) and RPO (how much data can be lost)
⚠ Common Pitfall

Kubernetes by default. K8s is a powerful operations platform with real complexity. If your team doesn't have platform engineering capability, a managed PaaS (Railway, Render, Fly.io, ECS Fargate) delivers 90% of the benefit with 10% of the operational overhead.

9
Document Architecture Decisions (ADRs)
Architecture without documentation is tribal knowledge. Architecture Decision Records (ADRs) capture not just what was decided, but why — including the alternatives considered and the forces at play. Invaluable for new team members and for revisiting decisions as context changes.
  • Write an ADR for every significant architectural decision (template: Context → Decision → Consequences)
  • Store ADRs alongside code in version control — they evolve with the system
  • Mark superseded ADRs as deprecated, not deleted — the history matters
  • Document the system's C4 model: Context, Containers, Components, Code diagrams
  • Define and publish API contracts (OpenAPI, AsyncAPI, Protobuf schemas)
  • Review and update documentation as the system evolves — stale docs are worse than no docs
⚠ Common Pitfall

Architecture diagrams that don't match reality. Systems drift from their documented state immediately. Build living documentation: generate diagrams from code where possible (Structurizr, C4 DSL), run architecture fitness functions in CI to catch drift automatically.

10
Define Evolution & Migration Strategy
No architecture is final. Design for change: define the signals that would trigger an architectural evolution and the path for getting there. The teams that handle growth well aren't the ones who made perfect initial decisions — they're the ones who built in reversibility.
  • Define explicit scaling triggers: "When we reach X RPS, we'll extract Y module as a service"
  • Plan for the Strangler Fig pattern: incrementally replace parts of the system without a big rewrite
  • Identify the riskiest architectural assumptions and build in monitoring to detect when they break
  • Establish regular architecture reviews — quarterly at minimum — with the engineering team
  • Budget for technical debt paydown: not all debt is bad, but unmanaged debt compounds
⚠ Common Pitfall

The Big Rewrite. "Our codebase is a mess — let's start over with the right architecture." Rewrites take 2–3× longer than estimated, the new system acquires its own technical debt, and the business continues running on the old system the whole time. The Strangler Fig almost always beats the rewrite.

📚

Reference Links

// foundational reading for every practicing architect
📌
Quick diagnostic: If you're evaluating your current architecture, run this thought experiment — "If I needed to change the billing logic, how many teams would I need to coordinate? How many services would I deploy? How long would it take?" The answer tells you whether your current architecture is enabling or impeding your team. There is no universal right answer — only what's right for your context.