Enterprise Architecture — Beginner's Guide
A practical, opinionated handbook for engineers stepping into enterprise architecture. Covers domains, technology selection, governance, common pitfalls, standards, documentation, and real-world examples.
What is Enterprise Architecture?
Enterprise Architecture (EA) is the discipline of aligning an organisation's technology landscape with its business strategy. It is not just about software — it encompasses processes, people, data, infrastructure, and governance. EA provides a blueprint for how the enterprise operates today and how it should evolve.
Common EA Frameworks
| Framework | Origin | Key Strength | When to Use |
|---|---|---|---|
| TOGAF | The Open Group | Comprehensive ADM lifecycle, widely recognised | Large regulated enterprises needing formal structure |
| Zachman | John Zachman | Classification matrix — who, what, when, where, why, how | Taxonomy and classification of artefacts |
| SAFe (Agile) | Scaled Agile | EA embedded into agile at scale | Agile enterprises, PI planning |
| Gartner EA | Gartner | Business-outcome-driven, pragmatic | Consultancy-led transformation |
| Informal / Lightweight | Org-specific | Speed, pragmatism | Startups scaling to enterprise, product companies |
EA vs Solution Architecture
| Dimension | Enterprise Architect | Solution Architect |
|---|---|---|
| Scope | Entire organisation or domain landscape | Single project or product |
| Time horizon | 3–5+ years (strategic) | 6–18 months (tactical) |
| Primary stakeholder | CTO, CIO, business leadership | Product owner, dev team, project manager |
| Output | Technology roadmaps, principles, standards | Solution design, component diagrams, ADRs |
| Level of detail | High-level patterns and constraints | Detailed enough to build from |
The Six Domains of Enterprise Architecture
Enterprise architecture spans six interconnected domains. Gaps in any one domain create risk, technical debt, or business misalignment. All six must be considered even if different architects own different layers.
Business Architecture
Business architecture maps the capabilities and processes of the organisation to technology. It ensures that every technical decision is grounded in a real business need.
Core Concepts
- Business Capability: What the organisation does (e.g., "Order Management", "Customer Onboarding"). Capabilities are stable; processes and systems change.
- Value Stream: End-to-end flow of activities that deliver value to a customer — from request to fulfilment.
- Business Process: A defined sequence of tasks within a capability. Documented in BPMN or simple flowcharts.
- Operating Model: How the organisation operates — centralised vs federated, shared services, outsourced functions.
- Capability Map: A visual inventory of all capabilities grouped by domain (e.g., Finance, HR, Operations).
Business Capability Map — Example
// Level 1 Capability Map (simplified e-commerce platform)
Customer Domain
├── Customer Acquisition (Marketing, SEO, Paid Ads)
├── Customer Management (Profile, Preferences, CRM)
└── Customer Support (Helpdesk, Returns, Escalation)
Order Domain
├── Product Catalogue (Listing, Search, Inventory)
├── Order Processing (Cart, Checkout, Payment)
└── Order Fulfilment (Warehouse, Shipping, Tracking)
Finance Domain
├── Revenue Management (Invoicing, Reconciliation)
├── Payments & Settlements (PSP Integration, Refunds)
└── Financial Reporting (P&L, Dashboards, Audit)
Application Architecture
Application architecture defines the system portfolio — what software systems exist, how they are structured internally, and how they relate to each other.
Common Patterns
| Pattern | Description | Best For |
|---|---|---|
| Modular Monolith | Single deployable unit with internal module boundaries | Small-to-medium teams, early-stage products |
| Microservices | Independent, bounded-context services | Large orgs with independent teams and domains |
| Event-Driven | Services communicate via events/messages | Decoupled workflows, async processing, audit trails |
| Serverless / FaaS | Functions triggered by events, no server management | Sporadic workloads, automation, glue code |
| CQRS + Event Sourcing | Separate read/write models; state rebuilt from events | Complex domains, audit, temporal queries |
| BFF (Backend for Frontend) | Dedicated API layer per frontend client type | Mobile + web with different data needs |
Application Portfolio Principles
- Rationalise before you build. Ask: does a system already exist that can be extended? Can a SaaS product cover this?
- Define bounded contexts. Each system should have a clear, non-overlapping ownership boundary (Domain-Driven Design).
- Avoid shared databases. Services sharing a DB are distributed monoliths — the worst of both worlds.
- Track technical debt. Use a tech debt register; classify debt as: intentional, unintentional, bit rot.
- Decommission proactively. Every system costs money to operate. Dead systems still generate incidents.
Data Architecture
Data architecture defines how data is created, stored, transformed, moved, and consumed across the enterprise. Poor data architecture is the single most common root cause of system complexity.
Key Concepts
| Concept | Description |
|---|---|
| Master Data | The authoritative, shared definition of key entities (Customer, Product, Account). One system of record per entity. |
| Data Lineage | Traceability of data from origin to consumption. Critical for compliance and debugging. |
| Data Mesh | Domain-oriented data ownership; teams own and publish data products rather than central team. |
| Data Lake | Central store for raw, unstructured data. Enables analytics and ML without pre-modelling. |
| Data Warehouse | Structured, curated store for reporting and BI. Schema-on-write, high query performance. |
| Lakehouse | Combines data lake storage with warehouse query semantics (e.g., Databricks Delta, Apache Iceberg). |
| Data Governance | Policies, ownership, quality rules, classification, and lifecycle management of data. |
Data Architecture Checklist
- 1Define entities and ownership: Which system is the system of record for each key entity?
- 2Data classification: PII, confidential, internal, public — every data element must be classified.
- 3Define retention policies: How long is each category of data retained? Who approves exceptions?
- 4Event vs state: Should systems share state (DB replication) or events (CDC, Kafka)? Events are almost always better for decoupling.
- 5Encryption at rest and in transit: Default to encrypted. Document any exceptions and the accepted risk.
Technology Architecture
Technology architecture covers the infrastructure and platform layer — cloud, on-premises, networking, compute, storage, and the runtimes that applications run on.
Cloud Strategy Decisions
| Strategy | Description | Trade-off |
|---|---|---|
| Cloud-native | Fully managed services, PaaS/FaaS first | Speed + agility vs potential lock-in |
| Cloud-agnostic | Containers + Kubernetes, portable workloads | Portability vs complexity and overhead |
| Hybrid cloud | On-prem + cloud, connected via VPN/ExpressRoute | Compliance & latency control vs operational complexity |
| Multi-cloud | AWS + Azure + GCP, best-of-breed per service | Resilience vs significantly higher operational burden |
| On-premises | Own data centres | Full control vs CapEx, slow to provision |
Platform Standards Checklist
- Define approved runtimes — e.g., Java 21 LTS, Python 3.12, Node 22 LTS. Discourage old/unsupported versions.
- Containerise everything — Docker images as the standard deployable unit; Kubernetes or managed container platforms for orchestration.
- Infrastructure as Code — Terraform, Bicep, or Pulumi. All infrastructure changes via code, reviewed, and version-controlled.
- Golden paths — Provide opinionated, pre-approved templates for common workload types (API service, worker, static site). Reduce decisions, increase consistency.
- Observability by default — Logs, metrics, and traces must be emitted by all services. No exception for new workloads.
Security Architecture
Security architecture must be built in, not bolted on. Every design decision has a security implication. Enterprise architects and solution architects share joint responsibility for security outcomes.
Principles
Threat Modelling (STRIDE)
| Threat | Description | Mitigation Example |
|---|---|---|
| Spoofing | Impersonating another identity | Strong authentication (MFA, mTLS) |
| Tampering | Modifying data in transit or at rest | HTTPS, data integrity checks, signed tokens |
| Repudiation | Denying an action was taken | Audit logs, non-repudiation via signed events |
| Information Disclosure | Exposing sensitive data | Encryption, role-based field masking, DLP |
| Denial of Service | Making a system unavailable | Rate limiting, WAF, auto-scaling, circuit breakers |
| Elevation of Privilege | Gaining unauthorised elevated access | RBAC, just-in-time access, privilege boundaries |
Integration Architecture
Integration architecture defines how systems communicate, share data, and collaborate. Poor integration design is the leading cause of brittle, hard-to-change systems.
Integration Patterns
| Pattern | When to Use | Examples |
|---|---|---|
| REST API | Synchronous request-response, standard CRUD | JSON over HTTPS, OpenAPI spec |
| GraphQL | Flexible querying for multiple clients with different data needs | Apollo, Hasura |
| gRPC | High-throughput internal service-to-service, typed contracts | Protobuf, internal microservices |
| Message Queue | Async, decoupled, at-least-once delivery | RabbitMQ, Azure Service Bus, SQS |
| Event Streaming | High-volume, ordered, replayable event log | Apache Kafka, Azure Event Hubs, Kinesis |
| Webhooks | Push notifications for state changes to external parties | Stripe, GitHub, Salesforce callbacks |
| CDC (Change Data Capture) | Replicating data changes from a database to downstream systems | Debezium, AWS DMS |
| API Gateway | Single entry point; routing, auth, rate limiting, observability | Azure APIM, AWS API Gateway, Kong |
Technology Selection Framework
Selecting technology is one of the most consequential decisions in enterprise architecture. The wrong choice creates years of pain; a rushed choice creates the same. Use a structured evaluation framework every time.
Selection Criteria
| Criterion | Weight | Questions to Ask |
|---|---|---|
| Fit for purpose | High | Does it actually solve the problem? Is the use-case in its sweet spot? |
| Operational maturity | High | Is it production-proven at scale? What are the known failure modes? |
| Team capability | High | Does the team have the skills? What is the learning curve and time-to-competency? |
| Vendor / OSS health | High | Is the vendor stable? Is the project actively maintained? What is the bus factor? |
| Total Cost of Ownership | High | Licensing + ops + training + migration. Not just purchase price. |
| Integration complexity | Medium | How does it connect to existing systems? What standards does it support? |
| Security & compliance | High | Does it meet our compliance posture (SOC2, GDPR, ISO 27001)? |
| Scalability | Medium | Will it handle 10x growth without re-architecting? |
| Lock-in risk | Medium | What does exit look like? Are there open standards? Migration cost? |
| Community & ecosystem | Medium | Forums, documentation quality, third-party integrations, hiring market. |
Evaluation Process
- 1Define the problem clearly: Write a one-page problem statement before looking at any tools.
- 2Identify shortlist: 2–4 options maximum. More than 4 is analysis paralysis.
- 3Score against criteria: Use a weighted scoring matrix. Make weights explicit and agreed upfront.
- 4Run a PoC: Build the hardest integration scenario, not the happy path. Time-box to 2–4 weeks.
- 5Reference check: Talk to 2–3 organisations already running it in production at similar scale.
- 6Document and socialise the decision: Write an ADR. Get ARB sign-off if the decision affects multiple teams.
Build vs Buy vs Integrate
This is one of the most repeated questions in enterprise architecture — and the answer is never obvious. Each option has different trade-offs across cost, control, speed, and risk.
Custom software developed in-house or contracted.
- Full control over features and data
- Differentiating IP stays in-house
- Highest cost and longest time-to-value
- You own the operational burden forever
Commercial off-the-shelf or SaaS product.
- Fastest time-to-value
- Vendor manages ops, updates, compliance
- Risk of lock-in and feature gaps
- Often requires process change to fit the tool
Open-source or platform component, self-hosted or managed.
- No licensing cost; OSS community support
- You own hosting, patching, upgrades
- Higher flexibility than SaaS
- Requires internal expertise
Technology Radar
A Technology Radar is a living document that categorises technologies by their maturity and adoption stance within your organisation. It gives teams a clear signal on what to adopt, experiment with, hold, or avoid.
The Four Quadrants
| Ring | Meaning | Action |
|---|---|---|
| ● Adopt | Proven, recommended for most use-cases | Default choice. No escalation required. |
| ● Trial | Worth pursuing; requires evidence collection | Use on non-critical workloads with architect oversight |
| ● Assess | Interesting; needs investigation before commitment | PoC only. Do not use in production yet. |
| ● Hold | Not recommended for new work; legacy only | No new projects. Migrate off existing usage over time. |
Example Radar Entries
// Example Technology Radar — Platform Domain (2026) ADOPT Kubernetes (container orchestration) Terraform (infrastructure as code) OpenTelemetry (observability instrumentation) PostgreSQL (relational data) React / Angular 18+ (frontend) TRIAL Temporal (workflow orchestration) Apache Iceberg (open table format) HTMX (lightweight frontend interactivity) ASSESS Wasm (WebAssembly for compute-intensive tasks) Deno (JavaScript runtime) HOLD AngularJS (EOL) REST XML / SOAP (legacy integrations only) On-premises Oracle DB (for new projects)
Proof of Concept (PoC)
A PoC is a time-boxed, low-fidelity experiment to reduce uncertainty before committing to a technology or approach. It is not a prototype, not an MVP, and not the start of production code.
PoC Rules
- Time-box strictly: 1–4 weeks. If you cannot answer the key question in that time, the question is too broad.
- Define the success criteria before starting: "We will adopt X if the PoC demonstrates Y and Z." Written down. Agreed by stakeholders.
- Test the hardest scenario: The PoC must validate the most risky integration, the highest throughput requirement, or the most complex use-case — not the trivial happy path.
- PoC code is throwaway: It must never become production code. Label it explicitly. Enforce this.
- Document findings: Write a short findings report: what was tested, what worked, what failed, what is still unknown. This feeds the ADR.
Review Processes
Governance without review processes is decoration. Effective review processes catch architectural drift, ensure standards compliance, and build shared understanding — without becoming bureaucratic blockers.
Review Types and When to Trigger
| Review Type | Trigger | Participants | Output |
|---|---|---|---|
| Architecture Review | New system, major feature, or cross-team integration | EA, Solution Architects, Tech Leads | Approved / conditional / rejected design |
| Design Review | Start of any non-trivial feature (>2 sprint story) | Solution Architect, Senior Engineers | Reviewed design doc, identified risks |
| Code Review | Every pull request | Peers + at least one senior engineer | Approved PR or change requests |
| Security Review | Before any public-facing feature or data-touching change | Security Architect or designated reviewer | Threat model sign-off, pen test scheduling |
| Post-Incident Review | After every P1/P2 incident | Incident owner, SRE, Architecture | Action items, RCA document |
| Quarterly Architecture Review | Calendar-driven, every quarter | ARB + Domain Architects | Updated roadmap, radar changes, debt register |
Architecture Review Board (ARB)
The ARB is the governance body responsible for cross-cutting architectural decisions. It reviews proposed changes that affect multiple systems, introduce new technology, or deviate from established standards.
ARB Charter — Key Elements
- Membership: Enterprise Architect (chair), Domain/Solution Architects, Principal Engineers, Security Architect, occasionally a business stakeholder.
- Cadence: Standing meeting every 2 weeks. Emergency sessions for critical escalations.
- Decision authority: The ARB approves, conditionally approves (with action items), or rejects proposals. Decisions are binding.
- Escalation path: Teams that disagree with an ARB decision escalate to CTO. The ARB is not a committee for endless debate.
- Transparency: All ARB decisions are published internally — the decision log is visible to all engineers.
What Must Go to the ARB
- Introducing a new technology not on the Adopt list
- Cross-domain integrations between bounded contexts
- New data stores or databases
- Changes to authentication or authorisation models
- Architectural patterns deviating from standards
- Third-party vendor onboarding (data access)
- Feature development within an existing service
- Library/dependency upgrades within the same major version
- Internal refactoring with no external interface changes
- Bug fixes and operational improvements
- Infrastructure scaling (within approved patterns)
- UI/UX changes
Design & Code Reviews
Architecture Design Review — What to Look For
- Does the design solve the actual business problem, or is it solving an interesting engineering problem?
- Are the integration points clearly defined — contracts, error handling, back-pressure, retry behaviour?
- Is there a data model? Has the entity ownership question been answered?
- What are the failure modes? What happens when dependency X is unavailable?
- Has the security model been considered? Who can read/write this data? Is it classified?
- Is the design consistent with existing patterns, or does it introduce a new pattern that needs ARB sign-off?
- Is there an operational runbook concept? How will this be monitored, alerted on, and recovered?
Code Review Standards (Architecture Perspective)
- Code reviews enforce team standards — not personal preference. Disagreements on style must be resolved in the standards document, not in PR comments.
- Architectural concerns (boundary violations, coupling, wrong abstraction level) are blocking. Style nits are non-blocking.
- Large PRs (>500 lines) indicate a design problem, not a review problem. Push back on the PR size, not the reviewer.
- Every PR should include test coverage changes. A feature PR with no test changes is a red flag.
Governance Gates
Governance gates are checkpoints in the delivery lifecycle where architecture and quality standards are validated before a team proceeds. They prevent technical debt accumulation from moving too fast.
Roles Overview
Enterprise architecture involves a hierarchy of architectural roles. Each has a distinct scope, accountability, and set of outputs.
| Role | Scope | Primary Outputs | Reports To |
|---|---|---|---|
| Chief Architect / EA Lead | Entire organisation | Technology strategy, EA principles, ARB governance | CTO / CIO |
| Enterprise Architect | Multi-domain | Technology radar, standards, capability maps, roadmaps | Chief Architect / CTO |
| Domain Architect | Single business domain | Domain-level architecture, integration patterns, standards | Enterprise Architect |
| Solution Architect | Single project or product | Solution design, component diagrams, ADRs, technical risk log | Domain Architect / Product |
| Principal Engineer | Technical domain within a team | Code standards, technical spikes, architectural input on PRs | Engineering Manager |
| Tech Lead | Single team | Sprint-level technical decisions, code review, mentoring | Engineering Manager |
Guide for Solution Architects
The solution architect is the most hands-on architectural role. You bridge strategy and execution — translating EA principles into buildable solutions, and pushing back when delivery pressures threaten architectural integrity.
Core Responsibilities
- 1Understand the business problem first. Before drawing a diagram, be able to explain the problem in business terms without using any technical jargon. If you cannot, you are not ready to design a solution.
- 2Design to the team's capability level. The best solution is one the team can build, operate, and evolve. A correct-but-unmaintainable design is a bad design.
- 3Validate with data, not opinions. Performance estimates, scalability projections, and cost models should be backed by numbers. Run load tests. Check cloud pricing calculators.
- 4Document assumptions explicitly. Every solution design rests on assumptions. Write them down. Validate them. Revisit them when they change.
- 5Own the technical risk register. Identify, score (likelihood × impact), and mitigate risks proactively. Do not wait for a retrospective.
- 6Be present in the team. Attend standups, participate in PR reviews, pair with developers. Architecture that is delivered only via diagrams fails.
Solution Design Document Template
## Solution Design: [Feature / System Name] Version: 1.0 | Author: [Name] | Date: [Date] Status: Draft / Under Review / Approved ### Problem Statement What business problem does this solve? (2–3 sentences, no jargon) ### Goals & Non-Goals Goals: - [Goal 1] - [Goal 2] Non-goals: - [What is explicitly out of scope] ### Assumptions - [Assumption 1] — validated / unvalidated - [Assumption 2] ### Architecture Overview [High-level component diagram or C4 Level 2 Container diagram] ### Key Design Decisions Decision 1: [What was decided] — [Why] — [Alternatives considered] Decision 2: ... ### Data Model [Entity definitions, ownership, relationships] ### Integration Points [System A] → [This system]: [protocol, contract, error handling] [This system] → [System B]: [protocol, contract, error handling] ### Security Considerations - Authentication: [How] - Authorisation: [RBAC model] - Data classification: [PII? Confidential?] - Threat model summary ### Observability - Key metrics to track - Alert thresholds - Runbook location ### Risks | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | [Risk 1] | M | H | [Mitigation] | ### Open Questions - [Question] — Owner: [Name] — Due: [Date]
Collaborating with Development Teams
Architecture fails when it is done to teams rather than with them. Effective architects embed, listen, and earn trust through technical credibility.
Anti-Patterns in Architect-Team Relationships
- Ivory Tower Architecture: Designing in isolation, handing down diagrams, then disappearing. Teams who feel architecture is imposed on them will route around it.
- Governance-only presence: Being visible only in review meetings, not in day-to-day delivery. You lose context; teams lose trust.
- Rejecting without alternatives: "We can't do X" is only useful if followed by "because Y, and here's what we can do instead: Z".
What Good Collaboration Looks Like
- Architects participate in sprint planning and refinement sessions — at least fortnightly.
- Architecture input is available in the team's working tools (Confluence, Notion, GitHub Discussions) — not locked in separate EA tooling.
- Architects pair with developers on complex integrations and review spikes.
- Feedback flows both ways: developers identify problems that improve architectural decisions.
- Architects celebrate team engineering wins publicly — they are not a police force.
Common Pitfalls in Enterprise Architecture
Enterprise architecture has a well-documented set of recurring failure modes. Recognising them early is the single most valuable skill for a new practitioner.
Resume-Driven Development (RDD)
How to Detect It
- Technology selection conversations focus on features and hype rather than the specific problem at hand.
- The team pushing hardest for a technology has the most to gain professionally from learning it.
- TCO analysis is skipped or minimised. "We'll figure out the ops later."
- Alternatives are dismissed with vague arguments: "It doesn't scale" — without defining what scale means in context.
How to Counter It
- Require a written problem statement before any technology discussion.
- Use a weighted scoring matrix. Scores must be justified, not asserted.
- Ask "what problem does this solve that our current stack cannot?" The answer must be specific.
- Separate the technology evaluation from the person advocating for it. Can someone else run the PoC?
Other Common Pitfalls
Defining Coding Standards
Coding standards are the written agreements that define how your organisation writes, structures, and reviews code. Without them, every team reinvents conventions independently, making cross-team collaboration and maintenance costly.
What Coding Standards Cover
| Area | Examples |
|---|---|
| Naming conventions | camelCase vs snake_case, file naming, class vs interface prefixes |
| Project structure | Directory layout, module boundaries, entry point conventions |
| Code style | Indentation, line length, brace style, import ordering |
| Language features | Banned features (e.g. eval), preferred patterns (e.g. async/await over callbacks) |
| Error handling | Do not swallow exceptions, structured logging, error boundary patterns |
| Testing requirements | Minimum coverage %, naming conventions for tests, mandatory test types per PR |
| Security rules | No secrets in code, parameterised queries only, OWASP Top 10 awareness |
| Documentation | When to add comments, public API documentation requirements |
| Git conventions | Branch naming, commit message format (Conventional Commits), PR size limits |
| Dependency management | Approved dependency sources, vulnerability scanning, major version pinning rules |
How to Define Coding Standards — Process
- 1Audit the current state: What are teams already doing? Codify existing good practices rather than inventing new ones. Standards imposed without buy-in will be ignored.
- 2Involve engineers in authoring: Standards written by architects in isolation are resented. Run workshops. Use surveys. Crowdsource contentious decisions.
- 3Prioritise automation over documentation: Every standard that can be enforced by a linter, formatter, or CI check should be. Documentation for what a human must consciously decide.
- 4Explain the "why": Every rule should have a rationale. "We use snake_case for Python because PEP 8 is the community standard and our linting enforces it." Rules without rationale get challenged constantly.
- 5Version and review annually: Standards that never change become cargo-cult religion. Review them annually with the community. Retire rules that no longer serve their purpose.
- 6Distinguish enforced from advisory: Be explicit. Is this rule enforced by CI (non-negotiable) or a recommended practice (advisory)?
Standards Enforcement & Tooling
Standards that rely entirely on human diligence in code review will erode under delivery pressure. Automate enforcement wherever possible.
Enforcement Toolchain
| Tool Type | Examples | Enforces |
|---|---|---|
| Code Formatter | Prettier, Black (Python), gofmt, dotnet-format | Style consistency — no debate in PR reviews |
| Linter | ESLint, Pylint, Checkstyle, SonarLint | Code quality rules, anti-patterns, complexity |
| Static Analysis (SAST) | SonarCloud, Semgrep, CodeQL | Security vulnerabilities, code smells, coverage |
| Dependency Scanner | Dependabot, Snyk, OWASP Dependency-Check | Known CVEs in third-party libraries |
| Secret Scanner | Gitleaks, TruffleHog, GitHub Secret Scanning | Credentials and secrets committed to repos |
| Container Scanner | Trivy, Grype, Snyk Container | Vulnerable base images, OS packages |
| Pre-commit Hooks | Husky, pre-commit framework | Local fast-fail before code reaches CI |
| Branch Protection | GitHub / GitLab branch rules | PR required, CI must pass, review required |
| Code Coverage Gate | Codecov, SonarQube coverage threshold | Minimum test coverage per PR |
# Example: Pre-commit configuration (.pre-commit-config.yaml) repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-merge-conflict - id: detect-private-key # Block committed secrets - repo: https://github.com/psf/black # Python formatter rev: 24.3.0 hooks: - id: black - repo: https://github.com/gitleaks/gitleaks rev: v8.18.2 hooks: - id: gitleaks # Secret scanning
Living Standards
Standards are not written once and forgotten. They must evolve with the technology landscape, team growth, and lessons learned. A "living standard" is one that has a clear ownership, review cadence, and contribution process.
Principles for Living Standards
- Store standards as code: In a Git repository (e.g., a
standardsrepo or a well-structured Confluence space). All changes go through a PR/review process. No one person can unilaterally change a standard. - Changelog required: Every change to a standard must include a dated changelog entry explaining what changed and why.
- Deprecation notices: Rules being removed should be deprecated first — flagged as "will be removed in 90 days" — giving teams time to adapt.
- Exception process: There must be a documented process for requesting an exception to a standard. This is healthier than teams silently ignoring rules.
- Annual review: Schedule a formal annual standards review. Use it to cull rules that are no longer relevant and add rules for patterns that have emerged.
Documentation Types
Different audiences need different documentation. The mistake most teams make is writing everything in one format (usually developer-focused) and assuming it serves everyone.
| Documentation Type | Audience | Format | Update Frequency |
|---|---|---|---|
| Architecture Decision Records (ADRs) | Engineers, Architects | Structured markdown, version-controlled | Per decision (immutable once accepted) |
| System Context / C4 Diagrams | All stakeholders | Diagrams-as-code (Structurizr, Mermaid, PlantUML) | On significant change |
| Runbooks | SRE, Operations | Step-by-step wiki pages | After every incident or procedure change |
| API Documentation | Developers (consumers) | OpenAPI / AsyncAPI spec, generated portal | On API change (auto-generated from code) |
| Solution Design Docs | Engineers, PM, Architect | Structured document (Confluence, Notion) | Before build; updated during significant changes |
| Technology Radar | All engineers | HTML/PDF, visual quadrant | Quarterly |
| Coding Standards | Engineers | Git-hosted markdown | Annually + as needed |
| Data Dictionary | Engineers, Analysts, Business | Structured catalogue (e.g., DataHub, Collibra) | On data model change |
| Roadmaps | Business, Leadership | Visual timeline, slide deck | Quarterly |
Managing Documentation
Documentation that isn't maintained is worse than no documentation — outdated docs cause costly mistakes. Managing documentation requires process, tooling, and cultural buy-in.
Key Principles
- Documentation is part of the Definition of Done. A story is not complete if the documentation it requires has not been updated. This is non-negotiable.
- Docs-as-Code: Where possible, store architecture docs alongside code. ADRs in
/docs/decisions/, runbooks in/docs/runbooks/. Changes are reviewed in PRs. - Single source of truth: If the same information exists in three places, it will be inconsistent within a month. Pick one canonical source per document type and redirect everything else to it.
- Audit quarterly: Review all documentation for staleness each quarter. Archive or delete documents that are no longer accurate. A labelled archive is better than stale live content.
- Search must work: Engineers must be able to find documentation. Poor information architecture in Confluence or Notion is the most common reason documentation is not used.
Documentation Ownership Matrix
| Document | Owner | Reviewer | Access |
|---|---|---|---|
| Architecture Decision Records | Solution Architect | Domain Architect | All engineers |
| System diagrams (C4) | Solution Architect | Tech Lead | All engineers + business |
| API Specs (OpenAPI) | Service team | API Gateway team | All engineers; published externally where applicable |
| Runbooks | SRE / Service team | Platform Engineer | Operations + On-call engineers |
| Coding Standards | Principal Engineer | ARB | All engineers |
| Technology Radar | Enterprise Architect | ARB | All engineers |
Documenting Architecture & Flows
Architecture diagrams communicate structure and behaviour to different audiences. The key principle is right diagram for the right audience — not one diagram that tries to show everything.
Diagram Types
| Diagram Type | Shows | Audience | Tool |
|---|---|---|---|
| Context Diagram (C4 L1) | System and its external users/systems | Business, all stakeholders | Structurizr, Mermaid |
| Container Diagram (C4 L2) | Major deployable units within the system | Architects, senior engineers | Structurizr, draw.io |
| Component Diagram (C4 L3) | Internal components of a container | Development team | Structurizr, PlantUML |
| Sequence Diagram | Time-ordered message flow between actors/services | Engineers, QA, architects | Mermaid, PlantUML, Lucidchart |
| Entity-Relationship (ERD) | Data entities and their relationships | Engineers, data architects, DBAs | dbdiagram.io, draw.io |
| Data Flow Diagram (DFD) | How data moves through a system | Architects, security, compliance | draw.io, Lucidchart |
| State Diagram | States of an entity and transition triggers | Engineers, BA, QA | Mermaid, PlantUML |
| Deployment Diagram | Infrastructure topology, cloud resources | Engineers, DevOps, architects | draw.io, Terraform CDK, diagrams.net |
Architecture Decision Records (ADRs)
An ADR is a short document that captures a significant architectural decision — what was decided, why, what alternatives were considered, and what the consequences are. ADRs are one of the highest-value practices in enterprise architecture.
ADR Template (MADR Format)
# ADR-0042: Use Apache Kafka for Domain Event Streaming Date: 2026-03-15 Status: Accepted Deciders: [Names] Tags: integration, messaging, event-driven ## Context and Problem Statement We need to propagate domain events from the Order service to downstream consumers (Inventory, Fulfilment, Analytics) with guaranteed ordering within a partition and replay capability for new consumers. ## Decision Drivers * Must support consumer replay (new consumers catching up on history) * Ordering guarantees within an entity's event stream required * Expected throughput: 50,000 events/day (growing to 500k in 12 months) * Team has prior Kafka experience; internal managed Kafka cluster available ## Considered Options 1. Apache Kafka (managed internal cluster) 2. Azure Service Bus (cloud-native) 3. RabbitMQ (existing infrastructure) 4. PostgreSQL LISTEN/NOTIFY (simple, no new infrastructure) ## Decision Outcome Chosen: Apache Kafka Positive consequences: * Log compaction enables efficient replay * Partition-ordered delivery meets our ordering requirement * Team expertise reduces ramp-up time Negative consequences / risks: * Operational complexity of Kafka cluster management * Schema evolution requires Avro/Confluent Schema Registry ## Pros and Cons of Alternatives Azure Service Bus: No replay; message TTL only. Ruled out. RabbitMQ: No partition-ordered replay. Ruled out. PG LISTEN/NOTIFY: Would not scale to 500k events/day. Ruled out.
ADR Best Practices
- One decision per ADR. Do not bundle multiple decisions into one document.
- ADRs are immutable once accepted. If the decision changes, write a new ADR that supersedes the old one. The old ADR remains as historical context.
- Number sequentially. ADR-0001, ADR-0002. This makes referencing trivial and prevents conflicts.
- Link from the codebase. Add ADR references in code comments or README files in the directories affected by the decision.
- Store in the repository:
/docs/decisions/alongside the code it governs.
C4 Model
The C4 model (by Simon Brown) provides four levels of architecture diagram with increasing detail. Each level answers a different question and serves a different audience.
| Level | Name | Question Answered | Audience |
|---|---|---|---|
| L1 | System Context | What is the system and who uses it? | Everyone (non-technical) |
| L2 | Container | What are the major deployable units and how do they communicate? | Architects, senior engineers |
| L3 | Component | What are the internal components of a given container? | Development team |
| L4 | Code | How is a component implemented? | Developers (auto-generated from IDE) |
C4 L1 — System Context (Mermaid)
```mermaid
C4Context
title System Context — Order Management Platform
Person(customer, "Customer", "Places and tracks orders via web/mobile")
Person(agent, "Support Agent", "Handles escalations and refunds")
System(oms, "Order Management System", "Core order lifecycle — cart, checkout, fulfilment tracking")
System_Ext(payment, "Payment Gateway", "Stripe / PayPal — payment processing")
System_Ext(wms, "Warehouse Management System", "3PL partner — picking, packing, dispatch")
System_Ext(notif, "Notification Service", "Email / SMS via SendGrid / Twilio")
Rel(customer, oms, "Places orders", "HTTPS")
Rel(agent, oms, "Manages escalations", "HTTPS")
Rel(oms, payment, "Authorise & capture payment", "HTTPS/REST")
Rel(oms, wms, "Dispatch fulfilment instruction", "Event/Kafka")
Rel(oms, notif, "Trigger notifications", "Event/Kafka")
```
C4 L2 — Container Diagram (Mermaid)
```mermaid
C4Container
title Container Diagram — Order Management System
Person(customer, "Customer", "Web / Mobile")
Container_Boundary(oms, "Order Management System") {
Container(spa, "Web SPA", "React", "Customer-facing storefront")
Container(bff, "BFF API", "Node.js / Express", "Backend for frontend — aggregates data for web client")
Container(order_svc, "Order Service", ".NET 8", "Order lifecycle domain logic")
Container(inventory_svc, "Inventory Service", "Python / FastAPI", "Stock levels and reservations")
ContainerDb(order_db, "Order DB", "PostgreSQL", "Order, line-item, and payment state")
ContainerDb(inventory_db, "Inventory DB", "PostgreSQL", "Stock levels per SKU and warehouse")
Container(event_bus, "Event Bus", "Apache Kafka", "Domain event streaming")
}
System_Ext(payment, "Payment Gateway", "Stripe")
System_Ext(wms, "WMS", "3PL Warehouse")
Rel(customer, spa, "Uses", "HTTPS")
Rel(spa, bff, "API calls", "JSON/REST")
Rel(bff, order_svc, "Delegates", "gRPC")
Rel(bff, inventory_svc, "Queries stock", "REST")
Rel(order_svc, order_db, "Reads/writes", "SQL")
Rel(inventory_svc, inventory_db, "Reads/writes", "SQL")
Rel(order_svc, event_bus, "Publishes domain events", "Kafka")
Rel(inventory_svc, event_bus, "Subscribes to order events", "Kafka")
Rel(order_svc, payment, "Authorise payment", "HTTPS")
Rel(event_bus, wms, "Dispatch fulfilment", "Kafka")
```
Flow & Sequence Diagrams
Flow and sequence diagrams document behaviour over time. They are indispensable for communicating integration contracts, validating error-handling logic, and onboarding engineers.
Sequence Diagram — Order Checkout Flow
```mermaid
sequenceDiagram
autonumber
actor C as Customer
participant BFF as BFF API
participant OS as Order Service
participant IS as Inventory Service
participant PG as Payment Gateway
participant KB as Kafka Bus
C->>BFF: POST /checkout { cart, payment_token }
BFF->>IS: Reserve stock { items }
IS-->>BFF: 200 OK { reservation_id }
BFF->>OS: Create order { cart, reservation_id, payment_token }
OS->>PG: Authorise payment { amount, token }
PG-->>OS: 200 OK { auth_code }
OS->>OS: Persist order (PENDING)
OS->>KB: Publish OrderCreated event
OS-->>BFF: 201 Created { order_id }
BFF-->>C: 201 Created { order_id, estimated_delivery }
note over KB: Downstream consumers process asynchronously
KB--)IS: OrderCreated → confirm stock deduction
KB--)NS: OrderCreated → send confirmation email
```
State Diagram — Order Lifecycle
```mermaid
stateDiagram-v2
[*] --> PENDING : Order created
PENDING --> PAYMENT_CAPTURED : Payment authorised
PENDING --> CANCELLED : Payment declined / timeout
PAYMENT_CAPTURED --> IN_FULFILMENT : Dispatched to WMS
IN_FULFILMENT --> SHIPPED : Carrier collected
SHIPPED --> DELIVERED : Delivery confirmed
SHIPPED --> RETURN_REQUESTED : Customer requests return
RETURN_REQUESTED --> REFUNDED : Return received + inspected
DELIVERED --> RETURN_REQUESTED : Within return window
CANCELLED --> [*]
DELIVERED --> [*]
REFUNDED --> [*]
```
Enterprise Architecture Examples
The following examples illustrate how EA principles apply to common real-world scenarios.
Example 1 — E-Commerce Platform Architecture
A mid-size retailer with 2M customers scaling from a monolith to a service-oriented architecture. 50 engineers across 8 product teams.
| Domain | Approach | Key Decisions |
|---|---|---|
| Application | Modular monolith → bounded service split | Start monolith; extract services when team owns a clear domain and the monolith boundary is proven |
| Data | PostgreSQL per service; Kafka CDC to analytics lake | No shared DB. Each service owns its data. Analytics via Kafka CDC → Snowflake |
| Integration | REST APIs internally; Kafka for domain events | Synchronous for user-facing reads; async events for state propagation |
| Security | OAuth2 + OIDC via Keycloak; zero-trust internal network | mTLS between services; RBAC in API gateway |
| Platform | Kubernetes on AWS EKS; Terraform IaC | Golden path templates for service scaffolding; observability via OpenTelemetry → Grafana stack |
| Governance | ARB bi-weekly; ADRs in each service repo | Technology radar updated quarterly; new databases require ARB approval |
Example 2 — Financial Services (Regulated)
A regulated payments company operating under PCI-DSS and FCA oversight. 200 engineers. On-premises data centre + hybrid cloud.
| Domain | Approach | Key Decisions |
|---|---|---|
| Security | Zero Trust, FIDO2/MFA enforced, HSM for key management | No lateral movement; network microsegmentation; all secrets in Vault |
| Data | Strict data classification; PCI scope boundary drawn around card data systems | Card data only in PCI-scoped systems; tokenisation for all external exposure |
| Compliance | Architecture review mandatory for PCI-scoped changes | Evidence artefacts (ADRs, threat models, scan results) linked from each change ticket |
| Platform | Hybrid: on-prem for card data; Azure for non-PCI workloads | Private ExpressRoute; no card data ever crosses to public cloud |
| Integration | Internal gRPC; external REST + mutual TLS | API Gateway at boundary with certificate pinning and DLP inspection |
Example 3 — SaaS Startup Scaling
A B2B SaaS company with 15 engineers growing rapidly. Currently a Rails monolith with a single PostgreSQL database. Starting to feel pain at 500k MAU.
| Pain Point | EA Response | Outcome |
|---|---|---|
| Database slowdown on reports | Add read replica; move analytics queries to replica | 3-day effort; solved for 18 months of growth |
| Background job queue coupling | Extract async job processor as separate process (Sidekiq → separate container) | Isolated failure domain, independent scaling |
| Real-time notifications blocking web requests | Introduce WebSocket service + Redis pub/sub | Decoupled; notification latency <200ms |
| Third-party integrations causing timeouts | Wrap in async queue with idempotent retry | Resilient; timeouts no longer affect user experience |
Reference Links
Frameworks & Standards
- TOGAFThe Open Group Architecture Framework (TOGAF) — opengroup.org/togaf
- ZachmanZachman Framework — zachman.com
- C4 ModelC4 Model for Software Architecture — c4model.com
- ADRArchitecture Decision Records — adr.github.io
- MADRMarkdown Architecture Decision Records (MADR) — adr.github.io/madr
Books
- BookFundamentals of Software Architecture — Mark Richards & Neal Ford (O'Reilly, 2020). Best starting point for architects.
- BookSoftware Architecture: The Hard Parts — Richards, Ford, Sadalage, Dehghani (O'Reilly, 2021). Trade-off analysis in distributed systems.
- BookBuilding Evolutionary Architectures — Ford, Parsons, Kua (O'Reilly, 2017). Architecture fitness functions and evolution.
- BookClean Architecture — Robert C. Martin (Prentice Hall, 2017). Dependency rules and architectural boundaries.
- BookDomain-Driven Design — Eric Evans (Addison-Wesley, 2003). Foundational for bounded contexts and ubiquitous language.
- BookDesigning Data-Intensive Applications — Martin Kleppmann (O'Reilly, 2017). Essential for data architecture trade-offs.
- BookThe Software Architect Elevator — Gregor Hohpe (O'Reilly, 2020). The EA's role bridging business and engineering.
Patterns & Design
- Patternsmicroservices.io — Chris Richardson's pattern catalogue
- PatternsEnterprise Integration Patterns — Hohpe & Woolf
- DDDMartin Fowler — DDD articles
- CQRSMartin Fowler — CQRS
- Event SourcingMartin Fowler — Event Sourcing
Security
- OWASPOWASP Top 10 — owasp.org
- Threat ModelOWASP Threat Modelling Process
- Zero TrustNIST SP 800-207 — Zero Trust Architecture
Tools
- DiagramsMermaid — mermaid.js.org (diagrams as code)
- C4Structurizr — C4 model tooling
- RadarThoughtworks Technology Radar — thoughtworks.com/radar
- IaCTerraform — developer.hashicorp.com/terraform
- API SpecOpenAPI Specification — swagger.io/specification
- Git Hookspre-commit — pre-commit.com