DevOps Architecture Handbook — 20 Microservices

01

System Landscape — The 20 Services

At 20 services, you are in the "medium microservices" zone. Large enough that manual deploys are impossible, small enough that you do not need a platform engineering team of 10. The right architecture needs to be automatable, self-service, and auditable. Below is a representative domain decomposition.

🌐 api-gateway

Ingress, routing, rate limiting

ingress tier-1

🔐 auth-service

OAuth2 / JWT / OIDC

security tier-1

👤 user-service

Profile, preferences, RBAC

domain

📦 product-service

Catalogue, inventory

domain

🛒 order-service

Order lifecycle, saga

domain stateful

💳 payment-service

PCI scope, external PSP

pci isolated

📧 notification-service

Email, SMS, push

async

📊 analytics-service

Event aggregation, reports

data

🔍 search-service

Elasticsearch interface

query

📁 media-service

Upload, transcode, CDN

storage

📬 queue-worker

Kafka consumer workers

async

🕒 scheduler-service

Cron jobs, delayed tasks

infra

🏥 health-service

Readiness, liveness probes

platform

📈 metrics-exporter

Prometheus custom metrics

platform

🔧 config-service

Feature flags, dynamic config

platform

🗂 audit-service

Compliance event log

compliance

🌍 geo-service

Localization, timezone

domain

🤝 partner-service

B2B integrations, webhooks

domain

🧪 test-harness

Contract tests, E2E, mocks

testing

🗺 gitops-config

Helm values, ArgoCD apps

gitops ★ special

Repository count = 20 source repos + 1 GitOps repo = 21 total The GitOps config repo is the 21st — intentionally separate. This is a critical architectural boundary explained in detail in §05.

02

The Full Pipeline — Every Stage Justified

Here is the complete flow a code change travels, from a developer's workstation to production. Each stage is a gate. Nothing proceeds without passing the previous one.

💻 Local Dev IDE + Docker

→

🪝 pre-commit lint + format

→

🏗 CI Build compile + test

→

🔬 CI Scan SAST + SCA

→

📦 Image Push tag + sign

→

📝 GitOps PR bump digest

→

🔄 ArgoCD Sync reconcile

→

☸ Kubernetes rolling deploy

→

🚨 Alerts PagerDuty

Stage-by-stage rationale

1

Local Dev → pre-commit

Pre-commit hooks catch the cheap problems (trailing whitespace, secrets in code, linting errors) before they ever hit CI. At 20 services, CI minutes are expensive and slow. Pre-commit is free and instant. Use the pre-commit framework (Python-based, language-agnostic). Hooks to include: detect-secrets, hadolint (Dockerfile linting), gitleaks (credential scanning), and language-specific formatters (black, gofmt, prettier). Every engineer installs this on checkout — enforced via a Makefile target and onboarding docs.

2

CI — Build + Test

GitHub Actions is the natural choice because your 20 repos already live in GitHub. No context-switching, no extra auth, and GitHub-hosted runners are cheap for the volume a 20-service system generates. Each service repo contains its own .github/workflows/ci.yml. Build steps: compile (or lint for interpreted), unit tests, integration tests against ephemeral containers (use services: in the workflow to spin up a Postgres or Redis for the duration of the job). Fail fast — tests before images, never the reverse.

3

CI — Security Scanning

Two mandatory scan types in CI: (a) SAST — static code analysis using language-specific tools (Semgrep for Python/JS/Go, SpotBugs for Java, or GitHub's CodeQL which is built-in). (b) SCA — software composition analysis to catch vulnerable dependencies using Trivy (open source, fast, multi-ecosystem). Trivy also scans the built Docker image for OS-level CVEs. Thresholds: fail CI on CRITICAL, warn on HIGH, report but do not block on MEDIUM. These thresholds are encoded in the CI YAML — not in developer heads.

4

Image Push — Tag Strategy + Signing

Build the container image and push to an OCI-compliant registry (GitHub Container Registry ghcr.io is the default — zero extra auth needed from Actions). Tag with the full Git SHA (sha-a1b2c3d) — never latest in production. Also tag with the semantic version if release-triggered (v1.4.2). Sign the image using Cosign (Sigstore) — this creates a tamper-evident provenance record. ArgoCD can be configured to verify signatures before deploying.

5

GitOps Update — Automated PR to Config Repo

After a successful image push, the CI pipeline opens a Pull Request in the gitops-config repo that bumps the image digest in the relevant Helm values file. This is the critical handoff between the CI world (code) and the CD world (desired state). The PR can be auto-merged for non-production environments (dev, staging) or require a human approval for production. Never write directly to the GitOps repo's main branch from CI — always via PR for auditability.

6

ArgoCD Sync — Reconciliation Loop

ArgoCD watches the gitops-config repo. When the image digest changes, it detects drift between desired state (Git) and actual state (Kubernetes). It reconciles by applying the updated Helm chart. In production: sync policy is set to selfHeal: false (manual sync trigger for a human gate). In dev/staging: automated: selfHeal: true for fully hands-free deploys. The reconciliation loop runs every 3 minutes by default.

7

Kubernetes — Rolling Deploy

Kubernetes applies the new manifest with a rolling update strategy — replacing pods one at a time, respecting readiness probes. This means zero downtime for well-configured services. Health checks must be correctly defined: livenessProbe (is the pod stuck?) and readinessProbe (is the pod ready to receive traffic?). For stateful services (order-service) consider canary deployments using Argo Rollouts.

8

Observability → Alerts

Prometheus scrapes metrics from all pods. Grafana dashboards visualise the golden signals: latency, traffic, errors, saturation. Alertmanager routes to PagerDuty (on-call) for critical alerts and to Slack for warning-level. Loki ingests structured logs. Jaeger or Tempo handles distributed traces. Every service must emit the standard telemetry set — this is enforced via a shared Helm library chart (sidecar injector or SDK requirement).

03

Source Control — 20 Repos, Not a Monorepo

Polyrepo vs Monorepo decision

With 20 services across independent teams or domain boundaries, a polyrepo (one repo per service) is the right call. Here is why:

polyrepo — chosen ✓

Independent release cadence

payment-service and notification-service do not need to be versioned together. In a monorepo, a CI run for any service triggers tooling decisions for all of them. Independent repos mean independent pipelines, independent secrets, and independent access control.

✓ Independent deploy cycles per team

polyrepo — chosen ✓

Blast radius control

A misconfigured CI pipeline in user-service does not disable order-service's deploys. In a monorepo with shared CI config, a syntax error can block all 20 services simultaneously.

✓ Failure isolation per service

trade-off — managed

Shared library problem

Cross-cutting concerns (logging, auth middleware, tracing) cannot be copy-pasted. Solve with an internal package registry (GitHub Packages) and a shared-libs repo with its own versioning. Services consume specific versions — no implicit coupling.

⚠ Requires internal package discipline

GitHub Organisation structure

      github.com/your-org/

      ├── api-gateway          # service repo

      ├── auth-service         # service repo

      ├── user-service         # service repo

      ├── ...                  # 17 more service repos

      ├── gitops-config        # ★ THE gitops repo — no application code here

      ├── shared-libs          # internal SDK, middleware, observability

      └── infra-modules        # Terraform modules for cloud infra (optional)

GitHub repo settings — enforce for all 20 repos via GitHub Org Policy Branch protection on main: require PR, require status checks, dismiss stale reviews, require linear history. These must be applied at the organisation level, not manually per repo — at 20 repos, human enforcement is not reliable.

04

Branch Strategy — Trunk-Based Development

At 20 services, long-lived feature branches are poison. They create merge conflicts, delay integration feedback, and make the concept of "done" fuzzy. The industry-standard answer at this scale is Trunk-Based Development (TBD).

main protected Always deployable. Direct commits blocked. Every push triggers CI + possible deploy.

feature/JIRA-123-add-oauth Short-lived (max 2 days). Squash-merged to main. Never long-running.

fix/payment-timeout Hotfix branch. Merged to main, then cherry-picked to release if needed.

release/v1.4.x protected Cut only for a release. Cherry-pick hotfixes here. Never merge back "up" to main manually — use the cherry-pick path.

chore/update-dependencies Renovate Bot or Dependabot automated PR branches. Auto-merged on passing CI.

Why not GitFlow?

GitFlow (develop, release, hotfix branches) was designed for scheduled release cycles of quarterly or monthly software. At 20 microservices deploying multiple times per day, GitFlow introduces merge complexity with no benefit. The develop branch becomes a buffer that delays integration. Trunk-Based Development with feature flags solves the "unfinished feature" problem without a long-lived branch.

Feature flags over long branches

For large, in-progress features: merge the code to main behind a feature flag (managed by config-service). The code ships but the feature is off. When ready, toggle the flag. This eliminates the "branch diverged for 3 weeks" problem entirely.

Commit convention — enforced via pre-commit Use Conventional Commits (feat:, fix:, chore:, ci:). This enables automated changelogs and semantic versioning triggers in CI. At 20 repos, consistent convention is critical for understanding cross-service changes.

05

Should There Be a GitOps Repo? Yes — Mandatory

This is one of the most debated architectural decisions in microservice DevOps. The answer is unambiguous at 20 services: yes, you need a dedicated, separate GitOps repository.

Why separate from the service repos?

reason 1

Separation of concerns

A developer making a code change to order-service should not need access to the deployment manifests for payment-service. Keeping config separate enforces a clear boundary: application code in service repos, desired infrastructure state in the GitOps repo.

reason 2

Single source of truth for all environments

The gitops-config repo holds what is running in dev, staging, and prod — all in one place. Anyone can look at this repo and know exactly what version of every service is deployed where. No manual tracking, no wikis, no "check the CI logs."

reason 3

Audit trail and compliance

Every change to production is a Git commit with author, timestamp, and message. This is your audit log. For financial services or regulated industries, this is not optional. A merge history in the GitOps repo is your deployment log.

reason 4

Access control

Only ArgoCD and senior engineers need write access to the production namespace in the GitOps repo. Service developers get read access. This is impossible to enforce cleanly if configs live alongside application code in 20 separate repos.

GitOps repo structure

      gitops-config/

      ├── apps/                        # ArgoCD Application manifests

      │   ├── dev/                    # one file per service

      │   │   ├── api-gateway.yaml

      │   │   ├── auth-service.yaml

      │   │   └── ... (20 files)

      │   ├── staging/

      │   └── prod/

      ├── charts/                      # Helm chart per service (or shared library chart)

      │   ├── api-gateway/

      │   │   ├── Chart.yaml

      │   │   ├── templates/

      │   │   └── values.yaml

      │   └── ... (20 charts)

      ├── environments/               # Per-env overrides

      │   ├── dev/

      │   │   └── api-gateway.yaml    # image.tag: sha-a1b2c3d

      │   ├── staging/

      │   └── prod/

      │       └── api-gateway.yaml    # image.tag: v1.4.2

      └── bootstrap/               # ArgoCD App-of-Apps root

The "App of Apps" pattern

With 20 services, create a single ArgoCD Application in bootstrap/ that points to the apps/ directory. ArgoCD discovers and manages all 20 child applications from this one parent. Adding a new service is a two-file operation: add a Helm chart, add an Application manifest. No manual ArgoCD configuration.

06

CI — Build, Test, Scan

Why GitHub Actions (not Jenkins, not GitLab CI)?

GitHub Actions — chosen ✓

Native GitHub integration

Your 20 repos are on GitHub. GitHub Actions requires zero authentication setup to check out code, post status checks, and open PRs to other repos within the same org. Jenkins requires SSH keys, webhooks, and credential management for the same operations.

✓ Zero-friction GitHub integration

GitHub Actions — chosen ✓

Reusable workflows

GitHub Actions supports reusable workflows — define the CI pattern once in a .github/workflows/ shared location and call it from all 20 service repos. This means updating the scan step in one place propagates to all services. Critical for 20-repo governance.

✓ Centralised CI governance via reusable workflows

trade-off

Vendor lock-in

GitHub Actions YAML is GitHub-specific. However, the actual build logic (Docker, shell scripts, Trivy) is portable. If you ever migrate off GitHub, the scripts move; only the orchestration YAML changes. This is an acceptable trade-off for the productivity gain.

CI workflow — per service repo

      # .github/workflows/ci.yml (exists in all 20 service repos)

      Triggers:

        - push to any branch

        - pull_request to main

      Jobs (in order):

      1. lint       → pre-commit hooks run in CI (redundant safety net)

      2. test       → unit + integration tests (with service containers)

      3. sast       → Semgrep / CodeQL (parallel to test)

      4. build-image → docker build (multi-stage, pinned base images)

      5. scan-image  → trivy image scan (fails on CRITICAL CVE)

      6. push-image  → ghcr.io/your-org/service:sha-XXXX (on main only)

      7. sign-image  → cosign sign (keyless, OIDC-based)

      8. update-gitops → open PR in gitops-config repo (on main only)

Critical: steps 6–8 run ONLY on pushes to main — never on feature branches Feature branches produce a build and test result only. They do not push images or touch the GitOps repo. This prevents staging/prod from being polluted by in-progress feature work.

Reusable workflow pattern

Create a shared-ci-workflows repo (or use the shared-libs repo) that contains the canonical workflow YAML. Each service repo references it with:

      # In each service repo's ci.yml:

      jobs:

        ci:

          uses: your-org/shared-ci-workflows/.github/workflows/service-ci.yml@main

          with:

            service-name: 'api-gateway'

            language: 'python'

This pattern means all 20 services get security scan upgrades, new lint rules, or changed image tag strategies by merging one PR in one repo.

07

Image Push — Registry Choice and Tagging

Registry: GitHub Container Registry (ghcr.io)

Option	Why / Why Not	Verdict
ghcr.io	Native GitHub auth, free for public, integrated with Actions OIDC, same access model as repos, no extra service to manage	✓ Default choice
ECR (AWS)	Excellent if you run on EKS, tighter IAM integration, but requires OIDC federation setup and couples you to AWS	Use if on EKS
Artifact Registry (GCP)	Best choice if running on GKE, Workload Identity makes authentication clean	Use if on GKE
Docker Hub	Rate limits on pulls (100/6h unauthenticated), privacy limitations on free tier, no OIDC auth from Actions without token management	✗ Avoid for prod
Harbor (self-hosted)	Full control, built-in vulnerability scanning, good for air-gapped or compliance-heavy environments. Adds operational overhead.	Regulated only

Image tagging convention

      # Never use :latest in staging or production manifests

      Development:   ghcr.io/org/api-gateway:sha-a1b2c3d

      Staging:       ghcr.io/org/api-gateway:sha-a1b2c3d   (same SHA, different values.yaml)

      Production:    ghcr.io/org/api-gateway:v1.4.2         (semantic version on release tag)

      # Why SHA not semver for dev/staging?

      # SHA is immutable. A tag can be overwritten. SHA cannot. 

      # Immutable image references = reproducible deploys = reliable rollbacks.

Image signing with Cosign

Every pushed image is signed using Cosign in keyless mode (leveraging GitHub Actions OIDC token — no private key to manage). The signature is stored in the same registry alongside the image. ArgoCD can be configured with an admission webhook to reject unsigned images. This closes the supply chain loop — you can prove that an image running in prod was built by your CI, not pushed manually by a human with registry write access.

08

Why Helm — Not Raw Manifests, Not Kustomize

This question comes up constantly. Here is the honest trade-off analysis for 20 services across multiple environments.

helm — chosen ✓

Templating is mandatory at 20 services

You have 20 services × 3 environments = 60 combinations. Without templating, you have 60 sets of YAML files. A change to the liveness probe timeout means editing 60 files. Helm reduces this to 1 chart change + 60 value overrides (or a shared library chart default).

✓ DRY principle enforced via templating

helm — chosen ✓

Helm Library Charts

Create a common library Helm chart that encodes your organisational defaults: resource limits, security context, probe paths, pod disruption budgets, network policies. Each service chart depends on this library. When standards change, one PR to the library chart propagates through all 20 services on next deploy.

✓ Single place for K8s standards enforcement

helm — chosen ✓

ArgoCD native support

ArgoCD understands Helm natively. It renders Helm templates server-side, tracks drift correctly, and shows the diff between current and desired state in its UI. The integration is first-class and requires no shims.

✓ Zero-config ArgoCD integration

kustomize — not chosen

Why not Kustomize?

Kustomize is excellent for patching existing manifests you do not control. For your own services, Kustomize's overlay model becomes hard to reason about at scale — you end up with bases, overlays, and patches that interact in non-obvious ways. Helm's values hierarchy is more explicit for environment differentiation across 20 services.

⚠ Use Kustomize only to patch third-party charts you cannot modify

Helm chart structure per service

      charts/api-gateway/

      ├── Chart.yaml          # version, dependencies (→ common library chart)

      ├── values.yaml         # sane defaults for all envs

      └── templates/

          ├── deployment.yaml     # uses library chart's deployment helper

          ├── service.yaml

          ├── ingress.yaml

          ├── hpa.yaml            # Horizontal Pod Autoscaler

          ├── pdb.yaml            # Pod Disruption Budget (always include)

          └── networkpolicy.yaml  # default-deny, explicit allows

Helm version pin — use Helm 3 only Helm 2 required a server-side component (Tiller) that was a security risk. Helm 3 is client-only with Kubernetes RBAC. Never use Helm 2. Pin the Helm version in your CI workflow to a specific minor version (e.g. 3.14.x) to prevent surprise behaviour from upgrades.

09

Why ArgoCD — Not Flux, Not Spinnaker, Not Raw kubectl

ArgoCD — chosen ✓

GitOps-native with a real UI

ArgoCD was purpose-built for GitOps. It continuously watches a Git repo and reconciles Kubernetes state to match. The web UI shows the health of all 20 applications, drift detection, sync status, and the rendered YAML diff — all without kubectl. For a team managing 20 services, this visibility is invaluable.

✓ Operational visibility out of the box

ArgoCD — chosen ✓

Multi-cluster support

ArgoCD can deploy to multiple Kubernetes clusters from one control plane. When you add a DR cluster or a second region, you add it as a cluster target in ArgoCD — no new CD infrastructure needed. Critical for production resilience at this service count.

✓ Single pane for multi-cluster deploys

flux — not chosen

Why not Flux v2?

Flux is architecturally purer (no UI, fully CLI-driven) and has a smaller attack surface. However, at 20 services, the lack of a built-in UI is a real operational cost — you need to either build dashboards or accept reduced visibility. ArgoCD's UI pays for its complexity in this use case. If you have a platform team that prefers CLI and GitOps purity, Flux is a valid alternative.

⚠ Valid alternative for CLI-centric teams

spinnaker — not chosen

Why not Spinnaker?

Spinnaker is a full CD platform with approval gates, canary analysis, and multi-cloud deployment. It is significantly more complex to operate — typically requiring a dedicated team. For 20 microservices without dedicated platform engineering, the operational overhead of Spinnaker outweighs its benefits. ArgoCD + Argo Rollouts covers 95% of what Spinnaker offers with 20% of the complexity.

✗ Overkill below 50 services without platform team

ArgoCD configuration decisions

Setting	Dev	Staging	Production	Reason
Sync Policy	Auto	Auto	Manual	Prod deploys need a human trigger — auto-sync in prod removes the last human gate
selfHeal	true	true	false	In prod, manual kubectl changes should drift-alert but not be auto-overwritten
Prune	true	true	Reviewed	Pruning in prod (deleting old resources) should be a conscious action
Sync Timeout	5m	10m	10m	Longer in prod to handle slow database migrations or init containers

10

Kubernetes Setup for 20 Services

Namespace strategy — per-environment, not per-service

      # Recommended namespace structure:

      dev          # all 20 services in dev

      staging      # all 20 services in staging

      production   # all 20 services in prod

      monitoring   # Prometheus, Grafana, Loki, Jaeger

      argocd       # ArgoCD control plane itself

      ingress-nginx # ingress controller

      cert-manager # TLS certificate management

Do not create one namespace per service With 20 services, per-service namespaces means 20+ namespaces × 3 environments = 60+ namespaces. Network policy complexity, RBAC complexity, and ArgoCD application count all explode. Per-environment namespaces with RBAC and NetworkPolicy within the namespace is the right pattern until you have hundreds of services.

Resource management — mandatory for 20 services

Every pod must have resource requests and limits defined. Without them, a single runaway service (memory leak in queue-worker) can OOM-kill pods across the entire node, taking unrelated services down. Enforce via a ValidatingAdmissionWebhook or via OPA/Gatekeeper policy.

      # Typical values for a medium-sized service:

      resources:

        requests:

          cpu: 100m

          memory: 256Mi

        limits:

          cpu: 500m     # CPU limit prevents noisy neighbours

          memory: 512Mi  # Memory limit triggers OOMKill before node pressure

Autoscaling strategy

Use Horizontal Pod Autoscaler (HPA) on CPU/memory for stateless services (api-gateway, user-service)
Use KEDA (Kubernetes Event-Driven Autoscaling) for queue-worker — scale on Kafka consumer lag, not CPU
Use Vertical Pod Autoscaler (VPA) in recommendation mode to tune resource requests over time
Do NOT set HPA and VPA in auto mode on the same deployment — they conflict
Set minReplicas: 2 in production for all services — single-replica services have no rolling update window

11

Observability & Alerts — The Three Pillars

metrics

Prometheus + Grafana

Prometheus scrapes metrics from all pods via /metrics endpoints. The kube-prometheus-stack Helm chart installs everything: Prometheus, Alertmanager, node exporters, kube-state-metrics, and a default Grafana dashboard. Golden signals per service: request rate, error rate, P95 latency, saturation.

✓ Industry standard, massive ecosystem

logs

Loki + Promtail

Loki is the cost-effective choice for log aggregation at this scale. Unlike Elasticsearch, Loki does not index log content — it indexes only metadata labels (service, pod, namespace). Logs are compressed and stored cheaply. Promtail runs as a DaemonSet and ships logs to Loki. Grafana queries both Prometheus and Loki in the same dashboard.

✓ 10× cheaper than ELK at this log volume

traces

OpenTelemetry + Tempo

Instrument services with the OpenTelemetry SDK (language-specific). The OTel Collector aggregates spans and exports to Grafana Tempo. Traces allow you to follow a request across all the services it touches — essential for debugging latency in a 20-service system where a slow database call in user-service cascades to order-service.

✓ OTel is the vendor-neutral standard

Alert routing policy

      CRITICAL  → PagerDuty on-call rotation  (wakes someone up)

        - Service error rate > 5% for 5 minutes

        - P99 latency > 10s for 5 minutes

        - Pod crash-looping (>3 restarts in 15m)

        - ArgoCD sync failed in production

      WARNING   → Slack #alerts-staging              (notify, don't wake)

        - Memory usage > 80% of limit

        - HPA at max replicas

        - Certificate expiry < 14 days

      INFO      → Slack #deploys                     (FYI only)

        - Successful deploy

        - ArgoCD sync complete

12

Secrets Management

The rule: secrets never go in Git

Not in the service repos. Not in the GitOps repo. Not as base64 in Kubernetes Secrets (base64 is encoding, not encryption). The GitOps repo contains only references to secrets, not the values.

External Secrets Operator — recommended ✓

ESO + HashiCorp Vault or AWS Secrets Manager

Install the External Secrets Operator in Kubernetes. Define ExternalSecret resources in your GitOps repo — these reference a secret path in Vault or AWS Secrets Manager, not the value itself. ESO syncs the actual value into a Kubernetes Secret at deploy time. Git contains the reference (safe). The value lives in Vault (safe). The K8s Secret lives only in the cluster.

✓ GitOps-compatible secrets without storing values in Git

Sealed Secrets — alternative

Bitnami Sealed Secrets

Encrypt secrets with a cluster-specific public key. Store the encrypted SealedSecret in Git (safe — only the cluster can decrypt). Simpler than ESO but couples secrets to the cluster — if you lose the cluster's private key, secrets are irrecoverable. Use ESO for production, Sealed Secrets for dev environments.

CI secrets (GitHub Actions)

Use GitHub Org Secrets for secrets shared across all 20 repos (registry push token, ArgoCD token, Vault token). Use Repository Secrets only for service-specific values. Never hardcode tokens in workflow YAML — even if the repo is private. Use OIDC-based authentication where possible (GitHub OIDC → AWS/GCP avoids storing cloud credentials entirely).

13

Environment Strategy

dev

Development

Auto-sync on every push to main. Shared by all developers. Frequent deploys (10–30/day across 20 services). Relaxed resource limits. Mock external services. Ephemeral — can be torn down and rebuilt.

staging

Staging

Auto-sync, but deploy triggers require passing integration test suite. Mirrors production config as closely as possible. Used for QA sign-off, performance testing. Real (non-prod) external service credentials where possible.

production

Production

Manual ArgoCD sync trigger only. Requires PR approval in GitOps repo before sync. PodDisruptionBudgets active. Multiple replicas. Full observability stack. Blue/green or canary for high-risk services.

DR

Disaster Recovery

Warm standby cluster (or passive) in second region. ArgoCD replicates prod state. RTO/RPO targets drive the cluster sizing. At 20 services, active-active DR may be overkill — warm standby is often sufficient.

Ephemeral environments for PRs — optional but high-value GitHub Actions can spin up a temporary Kubernetes namespace for each open PR using vcluster or Helm + namespace-per-PR pattern. The PR gets a real URL to test against. Destroyed on PR close. This eliminates the "works on dev, fails on staging" problem. At 20 services, this may be scoped to only the most critical services (api-gateway, auth-service) rather than all 20.

14

Security & Compliance

Supply chain security: Cosign image signing + SLSA Level 2 provenance via GitHub Actions. Verify signatures in ArgoCD before deploy.
Network policies: Default-deny all ingress/egress per namespace. Explicit allows only. payment-service has the most restrictive policy — only api-gateway may call it.
Pod Security Standards: Enforce restricted profile via Kubernetes Pod Security Admission. No privileged containers, no hostPath mounts, read-only root filesystem where possible.
RBAC: ArgoCD uses a dedicated ServiceAccount with minimal permissions per namespace. Developers get read-only kubectl access to prod, full access to dev.
Dependency updates: Renovate Bot runs weekly PRs for dependency bumps across all 20 repos. Auto-merge on passing CI for patch versions. Manual review for minor/major.
Runtime security: Falco as a DaemonSet. Detects anomalous container behaviour (unexpected outbound connections, privilege escalation attempts) at runtime.
payment-service isolation: If PCI-scoped, isolate to a dedicated node pool with a taint. Only payment pods tolerate this taint. Reduces PCI audit scope significantly.

15

Tool Selection Summary

Layer	Tool	Why This, Not Alternatives
Source Control	GitHub (20 repos + 1 gitops)	Native Actions integration. Org-level policy enforcement. OIDC for cloud auth.
CI Platform	GitHub Actions	Zero-friction GitHub integration. Reusable workflows for 20-repo governance. Pay-per-minute, no infra to manage.
Pre-commit	pre-commit framework	Language-agnostic. Community hook ecosystem. Identical gates locally and in CI.
SAST	Semgrep / CodeQL	CodeQL is GitHub-native (free on public repos). Semgrep for custom rules. Both integrate with GitHub Security tab.
SCA / CVE	Trivy	Single tool for dependency AND image scanning. Fast. OSS. Excellent GitHub Actions support.
Container Registry	ghcr.io	Same auth as GitHub. Free for private at org scale. No extra service.
Image Signing	Cosign (Sigstore)	Keyless signing via OIDC. No private key management. Growing standard.
Package Manager	Helm 3	Templating at env scale. Library charts for standards. ArgoCD native. Kustomize for 3rd-party patches only.
GitOps CD	ArgoCD	UI for visibility at 20 services. Multi-cluster. App-of-Apps for 20+ app management. Drift detection.
Runtime	Kubernetes	Industry standard. Helm + ArgoCD ecosystem. HPA + KEDA for autoscaling.
Metrics	Prometheus + Grafana	De facto standard. kube-prometheus-stack Helm chart sets everything up.
Logs	Grafana Loki	10× cheaper than ELK. Grafana native. Index-free. Scales well at this service count.
Traces	OpenTelemetry + Tempo	OTel is the vendor-neutral standard. Tempo integrates in Grafana for unified observability.
Alerting	Alertmanager → PagerDuty/Slack	Bundled with Prometheus. Routing logic keeps it simple. PagerDuty for on-call, Slack for awareness.
Secrets	External Secrets Operator + Vault	No secrets in Git. GitOps-compatible. Supports rotation. Works with AWS/GCP/Vault.
Security Runtime	Falco	eBPF-based runtime threat detection. Minimal overhead. Alerts on anomalous container behaviour.

16

Starter Repository

All the CI/CD patterns, Docker configurations, pipeline scripts, and Docker Compose setups referenced in this handbook have a concrete starting point. Use this reference repository to bootstrap your pipeline implementation:

🚀

DevOps Playbook — CI/CD Reference Implementation

A comprehensive starter repository containing CI scripts, CD pipelines, Docker Compose configurations, GitHub Actions workflows, and Docker best practices. Covers the full pipeline from local development through to production deployment — aligned with the architecture decisions in this handbook.

Includes: Docker multi-stage build patterns · GitHub Actions CI template · CD deploy scripts · Docker Compose for local dev · Pre-commit hook configurations

→ vivek-doshi.github.io/devops-playbook/

GitHub Actions Docker CI Scripts CD Scripts Docker Compose Pre-commit Helm

How to use this handbook + the starter repo together This handbook answers "what should I build and why." The starter repo answers "here is working code to start from." Apply the architectural decisions in §03–§14 to the repository patterns in the starter repo. The two are designed to complement each other — strategic decisions here, tactical implementation there.

DevOps Architecture Handbook v1.0 · 20 Microservices Pattern · Based on CNCF, Google SRE, and DORA industry standards

DevOps Architecture for20 Microservices

System Landscape — The 20 Services

The Full Pipeline — Every Stage Justified

Stage-by-stage rationale

Local Dev → pre-commit

CI — Build + Test

CI — Security Scanning

Image Push — Tag Strategy + Signing

GitOps Update — Automated PR to Config Repo

ArgoCD Sync — Reconciliation Loop

Kubernetes — Rolling Deploy

Observability → Alerts

Source Control — 20 Repos, Not a Monorepo

Polyrepo vs Monorepo decision

Independent release cadence

Blast radius control

Shared library problem

GitHub Organisation structure

Branch Strategy — Trunk-Based Development

Why not GitFlow?

Feature flags over long branches

Should There Be a GitOps Repo? Yes — Mandatory

Why separate from the service repos?

Separation of concerns

Single source of truth for all environments

Audit trail and compliance

Access control

GitOps repo structure

The "App of Apps" pattern

CI — Build, Test, Scan

Why GitHub Actions (not Jenkins, not GitLab CI)?

Native GitHub integration

Reusable workflows

Vendor lock-in

CI workflow — per service repo

Reusable workflow pattern

Image Push — Registry Choice and Tagging

Registry: GitHub Container Registry (ghcr.io)

Image tagging convention

Image signing with Cosign

Why Helm — Not Raw Manifests, Not Kustomize

Templating is mandatory at 20 services

Helm Library Charts

ArgoCD native support

Why not Kustomize?

Helm chart structure per service

Why ArgoCD — Not Flux, Not Spinnaker, Not Raw kubectl

GitOps-native with a real UI

Multi-cluster support

Why not Flux v2?

Why not Spinnaker?

ArgoCD configuration decisions

Kubernetes Setup for 20 Services

Namespace strategy — per-environment, not per-service

Resource management — mandatory for 20 services

Autoscaling strategy

Observability & Alerts — The Three Pillars

Prometheus + Grafana

Loki + Promtail

OpenTelemetry + Tempo

Alert routing policy

Secrets Management

The rule: secrets never go in Git

ESO + HashiCorp Vault or AWS Secrets Manager

Bitnami Sealed Secrets

CI secrets (GitHub Actions)

Environment Strategy

Development

Staging

Production

Disaster Recovery

Security & Compliance

Tool Selection Summary

Starter Repository

DevOps Playbook — CI/CD Reference Implementation

DevOps Architecture for
20 Microservices