Back to handbooks index
🏗 Industry-Grade Architecture

DevOps Architecture for
20 Microservices

A complete decision framework — not code, just architecture. Every tool choice is justified. Every trade-off is explained. Designed for senior engineers who want to know why, not just what.

20 Services20 GitHub Repos
GitHub ActionsCI Platform
ArgoCDGitOps CD
Helm + K8sRuntime
OCI RegistryImage Store
01

System Landscape — The 20 Services

At 20 services, you are in the "medium microservices" zone. Large enough that manual deploys are impossible, small enough that you do not need a platform engineering team of 10. The right architecture needs to be automatable, self-service, and auditable. Below is a representative domain decomposition.

🌐 api-gateway
Ingress, routing, rate limiting
ingress tier-1
🔐 auth-service
OAuth2 / JWT / OIDC
security tier-1
👤 user-service
Profile, preferences, RBAC
domain
📦 product-service
Catalogue, inventory
domain
🛒 order-service
Order lifecycle, saga
domain stateful
💳 payment-service
PCI scope, external PSP
pci isolated
📧 notification-service
Email, SMS, push
async
📊 analytics-service
Event aggregation, reports
data
🔍 search-service
Elasticsearch interface
query
📁 media-service
Upload, transcode, CDN
storage
📬 queue-worker
Kafka consumer workers
async
🕒 scheduler-service
Cron jobs, delayed tasks
infra
🏥 health-service
Readiness, liveness probes
platform
📈 metrics-exporter
Prometheus custom metrics
platform
🔧 config-service
Feature flags, dynamic config
platform
🗂 audit-service
Compliance event log
compliance
🌍 geo-service
Localization, timezone
domain
🤝 partner-service
B2B integrations, webhooks
domain
🧪 test-harness
Contract tests, E2E, mocks
testing
🗺 gitops-config
Helm values, ArgoCD apps
gitops ★ special
Repository count = 20 source repos + 1 GitOps repo = 21 total The GitOps config repo is the 21st — intentionally separate. This is a critical architectural boundary explained in detail in §05.
02

The Full Pipeline — Every Stage Justified

Here is the complete flow a code change travels, from a developer's workstation to production. Each stage is a gate. Nothing proceeds without passing the previous one.

💻 Local Dev IDE + Docker
🪝 pre-commit lint + format
🏗 CI Build compile + test
🔬 CI Scan SAST + SCA
📦 Image Push tag + sign
📝 GitOps PR bump digest
🔄 ArgoCD Sync reconcile
Kubernetes rolling deploy
🚨 Alerts PagerDuty

Stage-by-stage rationale

1

Local Dev → pre-commit

Pre-commit hooks catch the cheap problems (trailing whitespace, secrets in code, linting errors) before they ever hit CI. At 20 services, CI minutes are expensive and slow. Pre-commit is free and instant. Use the pre-commit framework (Python-based, language-agnostic). Hooks to include: detect-secrets, hadolint (Dockerfile linting), gitleaks (credential scanning), and language-specific formatters (black, gofmt, prettier). Every engineer installs this on checkout — enforced via a Makefile target and onboarding docs.

2

CI — Build + Test

GitHub Actions is the natural choice because your 20 repos already live in GitHub. No context-switching, no extra auth, and GitHub-hosted runners are cheap for the volume a 20-service system generates. Each service repo contains its own .github/workflows/ci.yml. Build steps: compile (or lint for interpreted), unit tests, integration tests against ephemeral containers (use services: in the workflow to spin up a Postgres or Redis for the duration of the job). Fail fast — tests before images, never the reverse.

3

CI — Security Scanning

Two mandatory scan types in CI: (a) SAST — static code analysis using language-specific tools (Semgrep for Python/JS/Go, SpotBugs for Java, or GitHub's CodeQL which is built-in). (b) SCA — software composition analysis to catch vulnerable dependencies using Trivy (open source, fast, multi-ecosystem). Trivy also scans the built Docker image for OS-level CVEs. Thresholds: fail CI on CRITICAL, warn on HIGH, report but do not block on MEDIUM. These thresholds are encoded in the CI YAML — not in developer heads.

4

Image Push — Tag Strategy + Signing

Build the container image and push to an OCI-compliant registry (GitHub Container Registry ghcr.io is the default — zero extra auth needed from Actions). Tag with the full Git SHA (sha-a1b2c3d) — never latest in production. Also tag with the semantic version if release-triggered (v1.4.2). Sign the image using Cosign (Sigstore) — this creates a tamper-evident provenance record. ArgoCD can be configured to verify signatures before deploying.

5

GitOps Update — Automated PR to Config Repo

After a successful image push, the CI pipeline opens a Pull Request in the gitops-config repo that bumps the image digest in the relevant Helm values file. This is the critical handoff between the CI world (code) and the CD world (desired state). The PR can be auto-merged for non-production environments (dev, staging) or require a human approval for production. Never write directly to the GitOps repo's main branch from CI — always via PR for auditability.

6

ArgoCD Sync — Reconciliation Loop

ArgoCD watches the gitops-config repo. When the image digest changes, it detects drift between desired state (Git) and actual state (Kubernetes). It reconciles by applying the updated Helm chart. In production: sync policy is set to selfHeal: false (manual sync trigger for a human gate). In dev/staging: automated: selfHeal: true for fully hands-free deploys. The reconciliation loop runs every 3 minutes by default.

7

Kubernetes — Rolling Deploy

Kubernetes applies the new manifest with a rolling update strategy — replacing pods one at a time, respecting readiness probes. This means zero downtime for well-configured services. Health checks must be correctly defined: livenessProbe (is the pod stuck?) and readinessProbe (is the pod ready to receive traffic?). For stateful services (order-service) consider canary deployments using Argo Rollouts.

8

Observability → Alerts

Prometheus scrapes metrics from all pods. Grafana dashboards visualise the golden signals: latency, traffic, errors, saturation. Alertmanager routes to PagerDuty (on-call) for critical alerts and to Slack for warning-level. Loki ingests structured logs. Jaeger or Tempo handles distributed traces. Every service must emit the standard telemetry set — this is enforced via a shared Helm library chart (sidecar injector or SDK requirement).

03

Source Control — 20 Repos, Not a Monorepo

Polyrepo vs Monorepo decision

With 20 services across independent teams or domain boundaries, a polyrepo (one repo per service) is the right call. Here is why:

polyrepo — chosen ✓

Independent release cadence

payment-service and notification-service do not need to be versioned together. In a monorepo, a CI run for any service triggers tooling decisions for all of them. Independent repos mean independent pipelines, independent secrets, and independent access control.

✓ Independent deploy cycles per team
polyrepo — chosen ✓

Blast radius control

A misconfigured CI pipeline in user-service does not disable order-service's deploys. In a monorepo with shared CI config, a syntax error can block all 20 services simultaneously.

✓ Failure isolation per service
trade-off — managed

Shared library problem

Cross-cutting concerns (logging, auth middleware, tracing) cannot be copy-pasted. Solve with an internal package registry (GitHub Packages) and a shared-libs repo with its own versioning. Services consume specific versions — no implicit coupling.

⚠ Requires internal package discipline

GitHub Organisation structure

github.com/your-org/
├── api-gateway # service repo
├── auth-service # service repo
├── user-service # service repo
├── ... # 17 more service repos
├── gitops-config # ★ THE gitops repo — no application code here
├── shared-libs # internal SDK, middleware, observability
└── infra-modules # Terraform modules for cloud infra (optional)
GitHub repo settings — enforce for all 20 repos via GitHub Org Policy Branch protection on main: require PR, require status checks, dismiss stale reviews, require linear history. These must be applied at the organisation level, not manually per repo — at 20 repos, human enforcement is not reliable.
04

Branch Strategy — Trunk-Based Development

At 20 services, long-lived feature branches are poison. They create merge conflicts, delay integration feedback, and make the concept of "done" fuzzy. The industry-standard answer at this scale is Trunk-Based Development (TBD).

main protected Always deployable. Direct commits blocked. Every push triggers CI + possible deploy.
feature/JIRA-123-add-oauth Short-lived (max 2 days). Squash-merged to main. Never long-running.
fix/payment-timeout Hotfix branch. Merged to main, then cherry-picked to release if needed.
release/v1.4.x protected Cut only for a release. Cherry-pick hotfixes here. Never merge back "up" to main manually — use the cherry-pick path.
chore/update-dependencies Renovate Bot or Dependabot automated PR branches. Auto-merged on passing CI.

Why not GitFlow?

GitFlow (develop, release, hotfix branches) was designed for scheduled release cycles of quarterly or monthly software. At 20 microservices deploying multiple times per day, GitFlow introduces merge complexity with no benefit. The develop branch becomes a buffer that delays integration. Trunk-Based Development with feature flags solves the "unfinished feature" problem without a long-lived branch.

Feature flags over long branches

For large, in-progress features: merge the code to main behind a feature flag (managed by config-service). The code ships but the feature is off. When ready, toggle the flag. This eliminates the "branch diverged for 3 weeks" problem entirely.

Commit convention — enforced via pre-commit Use Conventional Commits (feat:, fix:, chore:, ci:). This enables automated changelogs and semantic versioning triggers in CI. At 20 repos, consistent convention is critical for understanding cross-service changes.
05

Should There Be a GitOps Repo? Yes — Mandatory

This is one of the most debated architectural decisions in microservice DevOps. The answer is unambiguous at 20 services: yes, you need a dedicated, separate GitOps repository.

Why separate from the service repos?

reason 1

Separation of concerns

A developer making a code change to order-service should not need access to the deployment manifests for payment-service. Keeping config separate enforces a clear boundary: application code in service repos, desired infrastructure state in the GitOps repo.

reason 2

Single source of truth for all environments

The gitops-config repo holds what is running in dev, staging, and prod — all in one place. Anyone can look at this repo and know exactly what version of every service is deployed where. No manual tracking, no wikis, no "check the CI logs."

reason 3

Audit trail and compliance

Every change to production is a Git commit with author, timestamp, and message. This is your audit log. For financial services or regulated industries, this is not optional. A merge history in the GitOps repo is your deployment log.

reason 4

Access control

Only ArgoCD and senior engineers need write access to the production namespace in the GitOps repo. Service developers get read access. This is impossible to enforce cleanly if configs live alongside application code in 20 separate repos.

GitOps repo structure

gitops-config/
├── apps/ # ArgoCD Application manifests
│ ├── dev/ # one file per service
│ │ ├── api-gateway.yaml
│ │ ├── auth-service.yaml
│ │ └── ... (20 files)
│ ├── staging/
│ └── prod/
├── charts/ # Helm chart per service (or shared library chart)
│ ├── api-gateway/
│ │ ├── Chart.yaml
│ │ ├── templates/
│ │ └── values.yaml
│ └── ... (20 charts)
├── environments/ # Per-env overrides
│ ├── dev/
│ │ └── api-gateway.yaml # image.tag: sha-a1b2c3d
│ ├── staging/
│ └── prod/
│ └── api-gateway.yaml # image.tag: v1.4.2
└── bootstrap/ # ArgoCD App-of-Apps root

The "App of Apps" pattern

With 20 services, create a single ArgoCD Application in bootstrap/ that points to the apps/ directory. ArgoCD discovers and manages all 20 child applications from this one parent. Adding a new service is a two-file operation: add a Helm chart, add an Application manifest. No manual ArgoCD configuration.

06

CI — Build, Test, Scan

Why GitHub Actions (not Jenkins, not GitLab CI)?

GitHub Actions — chosen ✓

Native GitHub integration

Your 20 repos are on GitHub. GitHub Actions requires zero authentication setup to check out code, post status checks, and open PRs to other repos within the same org. Jenkins requires SSH keys, webhooks, and credential management for the same operations.

✓ Zero-friction GitHub integration
GitHub Actions — chosen ✓

Reusable workflows

GitHub Actions supports reusable workflows — define the CI pattern once in a .github/workflows/ shared location and call it from all 20 service repos. This means updating the scan step in one place propagates to all services. Critical for 20-repo governance.

✓ Centralised CI governance via reusable workflows
trade-off

Vendor lock-in

GitHub Actions YAML is GitHub-specific. However, the actual build logic (Docker, shell scripts, Trivy) is portable. If you ever migrate off GitHub, the scripts move; only the orchestration YAML changes. This is an acceptable trade-off for the productivity gain.

CI workflow — per service repo

# .github/workflows/ci.yml (exists in all 20 service repos)

Triggers:
  - push to any branch
  - pull_request to main

Jobs (in order):

1. lint → pre-commit hooks run in CI (redundant safety net)
2. test → unit + integration tests (with service containers)
3. sast → Semgrep / CodeQL (parallel to test)
4. build-image → docker build (multi-stage, pinned base images)
5. scan-image → trivy image scan (fails on CRITICAL CVE)
6. push-image → ghcr.io/your-org/service:sha-XXXX (on main only)
7. sign-image → cosign sign (keyless, OIDC-based)
8. update-gitops → open PR in gitops-config repo (on main only)
Critical: steps 6–8 run ONLY on pushes to main — never on feature branches Feature branches produce a build and test result only. They do not push images or touch the GitOps repo. This prevents staging/prod from being polluted by in-progress feature work.

Reusable workflow pattern

Create a shared-ci-workflows repo (or use the shared-libs repo) that contains the canonical workflow YAML. Each service repo references it with:

# In each service repo's ci.yml:
jobs:
  ci:
    uses: your-org/shared-ci-workflows/.github/workflows/service-ci.yml@main
    with:
      service-name: 'api-gateway'
      language: 'python'

This pattern means all 20 services get security scan upgrades, new lint rules, or changed image tag strategies by merging one PR in one repo.

07

Image Push — Registry Choice and Tagging

Registry: GitHub Container Registry (ghcr.io)

OptionWhy / Why NotVerdict
ghcr.io Native GitHub auth, free for public, integrated with Actions OIDC, same access model as repos, no extra service to manage ✓ Default choice
ECR (AWS) Excellent if you run on EKS, tighter IAM integration, but requires OIDC federation setup and couples you to AWS Use if on EKS
Artifact Registry (GCP) Best choice if running on GKE, Workload Identity makes authentication clean Use if on GKE
Docker Hub Rate limits on pulls (100/6h unauthenticated), privacy limitations on free tier, no OIDC auth from Actions without token management ✗ Avoid for prod
Harbor (self-hosted) Full control, built-in vulnerability scanning, good for air-gapped or compliance-heavy environments. Adds operational overhead. Regulated only

Image tagging convention

# Never use :latest in staging or production manifests

Development: ghcr.io/org/api-gateway:sha-a1b2c3d
Staging: ghcr.io/org/api-gateway:sha-a1b2c3d (same SHA, different values.yaml)
Production: ghcr.io/org/api-gateway:v1.4.2 (semantic version on release tag)

# Why SHA not semver for dev/staging?
# SHA is immutable. A tag can be overwritten. SHA cannot.
# Immutable image references = reproducible deploys = reliable rollbacks.

Image signing with Cosign

Every pushed image is signed using Cosign in keyless mode (leveraging GitHub Actions OIDC token — no private key to manage). The signature is stored in the same registry alongside the image. ArgoCD can be configured with an admission webhook to reject unsigned images. This closes the supply chain loop — you can prove that an image running in prod was built by your CI, not pushed manually by a human with registry write access.

08

Why Helm — Not Raw Manifests, Not Kustomize

This question comes up constantly. Here is the honest trade-off analysis for 20 services across multiple environments.

helm — chosen ✓

Templating is mandatory at 20 services

You have 20 services × 3 environments = 60 combinations. Without templating, you have 60 sets of YAML files. A change to the liveness probe timeout means editing 60 files. Helm reduces this to 1 chart change + 60 value overrides (or a shared library chart default).

✓ DRY principle enforced via templating
helm — chosen ✓

Helm Library Charts

Create a common library Helm chart that encodes your organisational defaults: resource limits, security context, probe paths, pod disruption budgets, network policies. Each service chart depends on this library. When standards change, one PR to the library chart propagates through all 20 services on next deploy.

✓ Single place for K8s standards enforcement
helm — chosen ✓

ArgoCD native support

ArgoCD understands Helm natively. It renders Helm templates server-side, tracks drift correctly, and shows the diff between current and desired state in its UI. The integration is first-class and requires no shims.

✓ Zero-config ArgoCD integration
kustomize — not chosen

Why not Kustomize?

Kustomize is excellent for patching existing manifests you do not control. For your own services, Kustomize's overlay model becomes hard to reason about at scale — you end up with bases, overlays, and patches that interact in non-obvious ways. Helm's values hierarchy is more explicit for environment differentiation across 20 services.

⚠ Use Kustomize only to patch third-party charts you cannot modify

Helm chart structure per service

charts/api-gateway/
├── Chart.yaml # version, dependencies (→ common library chart)
├── values.yaml # sane defaults for all envs
└── templates/
    ├── deployment.yaml # uses library chart's deployment helper
    ├── service.yaml
    ├── ingress.yaml
    ├── hpa.yaml # Horizontal Pod Autoscaler
    ├── pdb.yaml # Pod Disruption Budget (always include)
    └── networkpolicy.yaml # default-deny, explicit allows
Helm version pin — use Helm 3 only Helm 2 required a server-side component (Tiller) that was a security risk. Helm 3 is client-only with Kubernetes RBAC. Never use Helm 2. Pin the Helm version in your CI workflow to a specific minor version (e.g. 3.14.x) to prevent surprise behaviour from upgrades.
09

Why ArgoCD — Not Flux, Not Spinnaker, Not Raw kubectl

ArgoCD — chosen ✓

GitOps-native with a real UI

ArgoCD was purpose-built for GitOps. It continuously watches a Git repo and reconciles Kubernetes state to match. The web UI shows the health of all 20 applications, drift detection, sync status, and the rendered YAML diff — all without kubectl. For a team managing 20 services, this visibility is invaluable.

✓ Operational visibility out of the box
ArgoCD — chosen ✓

Multi-cluster support

ArgoCD can deploy to multiple Kubernetes clusters from one control plane. When you add a DR cluster or a second region, you add it as a cluster target in ArgoCD — no new CD infrastructure needed. Critical for production resilience at this service count.

✓ Single pane for multi-cluster deploys
flux — not chosen

Why not Flux v2?

Flux is architecturally purer (no UI, fully CLI-driven) and has a smaller attack surface. However, at 20 services, the lack of a built-in UI is a real operational cost — you need to either build dashboards or accept reduced visibility. ArgoCD's UI pays for its complexity in this use case. If you have a platform team that prefers CLI and GitOps purity, Flux is a valid alternative.

⚠ Valid alternative for CLI-centric teams
spinnaker — not chosen

Why not Spinnaker?

Spinnaker is a full CD platform with approval gates, canary analysis, and multi-cloud deployment. It is significantly more complex to operate — typically requiring a dedicated team. For 20 microservices without dedicated platform engineering, the operational overhead of Spinnaker outweighs its benefits. ArgoCD + Argo Rollouts covers 95% of what Spinnaker offers with 20% of the complexity.

✗ Overkill below 50 services without platform team

ArgoCD configuration decisions

SettingDevStagingProductionReason
Sync Policy Auto Auto Manual Prod deploys need a human trigger — auto-sync in prod removes the last human gate
selfHeal true true false In prod, manual kubectl changes should drift-alert but not be auto-overwritten
Prune true true Reviewed Pruning in prod (deleting old resources) should be a conscious action
Sync Timeout 5m 10m 10m Longer in prod to handle slow database migrations or init containers
10

Kubernetes Setup for 20 Services

Namespace strategy — per-environment, not per-service

# Recommended namespace structure:

dev # all 20 services in dev
staging # all 20 services in staging
production # all 20 services in prod
monitoring # Prometheus, Grafana, Loki, Jaeger
argocd # ArgoCD control plane itself
ingress-nginx # ingress controller
cert-manager # TLS certificate management
Do not create one namespace per service With 20 services, per-service namespaces means 20+ namespaces × 3 environments = 60+ namespaces. Network policy complexity, RBAC complexity, and ArgoCD application count all explode. Per-environment namespaces with RBAC and NetworkPolicy within the namespace is the right pattern until you have hundreds of services.

Resource management — mandatory for 20 services

Every pod must have resource requests and limits defined. Without them, a single runaway service (memory leak in queue-worker) can OOM-kill pods across the entire node, taking unrelated services down. Enforce via a ValidatingAdmissionWebhook or via OPA/Gatekeeper policy.

# Typical values for a medium-sized service:
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m # CPU limit prevents noisy neighbours
    memory: 512Mi # Memory limit triggers OOMKill before node pressure

Autoscaling strategy

11

Observability & Alerts — The Three Pillars

metrics

Prometheus + Grafana

Prometheus scrapes metrics from all pods via /metrics endpoints. The kube-prometheus-stack Helm chart installs everything: Prometheus, Alertmanager, node exporters, kube-state-metrics, and a default Grafana dashboard. Golden signals per service: request rate, error rate, P95 latency, saturation.

✓ Industry standard, massive ecosystem
logs

Loki + Promtail

Loki is the cost-effective choice for log aggregation at this scale. Unlike Elasticsearch, Loki does not index log content — it indexes only metadata labels (service, pod, namespace). Logs are compressed and stored cheaply. Promtail runs as a DaemonSet and ships logs to Loki. Grafana queries both Prometheus and Loki in the same dashboard.

✓ 10× cheaper than ELK at this log volume
traces

OpenTelemetry + Tempo

Instrument services with the OpenTelemetry SDK (language-specific). The OTel Collector aggregates spans and exports to Grafana Tempo. Traces allow you to follow a request across all the services it touches — essential for debugging latency in a 20-service system where a slow database call in user-service cascades to order-service.

✓ OTel is the vendor-neutral standard

Alert routing policy

CRITICAL → PagerDuty on-call rotation (wakes someone up)
  - Service error rate > 5% for 5 minutes
  - P99 latency > 10s for 5 minutes
  - Pod crash-looping (>3 restarts in 15m)
  - ArgoCD sync failed in production

WARNING → Slack #alerts-staging (notify, don't wake)
  - Memory usage > 80% of limit
  - HPA at max replicas
  - Certificate expiry < 14 days

INFO → Slack #deploys (FYI only)
  - Successful deploy
  - ArgoCD sync complete
12

Secrets Management

The rule: secrets never go in Git

Not in the service repos. Not in the GitOps repo. Not as base64 in Kubernetes Secrets (base64 is encoding, not encryption). The GitOps repo contains only references to secrets, not the values.

External Secrets Operator — recommended ✓

ESO + HashiCorp Vault or AWS Secrets Manager

Install the External Secrets Operator in Kubernetes. Define ExternalSecret resources in your GitOps repo — these reference a secret path in Vault or AWS Secrets Manager, not the value itself. ESO syncs the actual value into a Kubernetes Secret at deploy time. Git contains the reference (safe). The value lives in Vault (safe). The K8s Secret lives only in the cluster.

✓ GitOps-compatible secrets without storing values in Git
Sealed Secrets — alternative

Bitnami Sealed Secrets

Encrypt secrets with a cluster-specific public key. Store the encrypted SealedSecret in Git (safe — only the cluster can decrypt). Simpler than ESO but couples secrets to the cluster — if you lose the cluster's private key, secrets are irrecoverable. Use ESO for production, Sealed Secrets for dev environments.

CI secrets (GitHub Actions)

Use GitHub Org Secrets for secrets shared across all 20 repos (registry push token, ArgoCD token, Vault token). Use Repository Secrets only for service-specific values. Never hardcode tokens in workflow YAML — even if the repo is private. Use OIDC-based authentication where possible (GitHub OIDC → AWS/GCP avoids storing cloud credentials entirely).

13

Environment Strategy

dev
Development

Auto-sync on every push to main. Shared by all developers. Frequent deploys (10–30/day across 20 services). Relaxed resource limits. Mock external services. Ephemeral — can be torn down and rebuilt.

staging
Staging

Auto-sync, but deploy triggers require passing integration test suite. Mirrors production config as closely as possible. Used for QA sign-off, performance testing. Real (non-prod) external service credentials where possible.

production
Production

Manual ArgoCD sync trigger only. Requires PR approval in GitOps repo before sync. PodDisruptionBudgets active. Multiple replicas. Full observability stack. Blue/green or canary for high-risk services.

DR
Disaster Recovery

Warm standby cluster (or passive) in second region. ArgoCD replicates prod state. RTO/RPO targets drive the cluster sizing. At 20 services, active-active DR may be overkill — warm standby is often sufficient.

Ephemeral environments for PRs — optional but high-value GitHub Actions can spin up a temporary Kubernetes namespace for each open PR using vcluster or Helm + namespace-per-PR pattern. The PR gets a real URL to test against. Destroyed on PR close. This eliminates the "works on dev, fails on staging" problem. At 20 services, this may be scoped to only the most critical services (api-gateway, auth-service) rather than all 20.
14

Security & Compliance

15

Tool Selection Summary

LayerToolWhy This, Not Alternatives
Source ControlGitHub (20 repos + 1 gitops)Native Actions integration. Org-level policy enforcement. OIDC for cloud auth.
CI PlatformGitHub ActionsZero-friction GitHub integration. Reusable workflows for 20-repo governance. Pay-per-minute, no infra to manage.
Pre-commitpre-commit frameworkLanguage-agnostic. Community hook ecosystem. Identical gates locally and in CI.
SASTSemgrep / CodeQLCodeQL is GitHub-native (free on public repos). Semgrep for custom rules. Both integrate with GitHub Security tab.
SCA / CVETrivySingle tool for dependency AND image scanning. Fast. OSS. Excellent GitHub Actions support.
Container Registryghcr.ioSame auth as GitHub. Free for private at org scale. No extra service.
Image SigningCosign (Sigstore)Keyless signing via OIDC. No private key management. Growing standard.
Package ManagerHelm 3Templating at env scale. Library charts for standards. ArgoCD native. Kustomize for 3rd-party patches only.
GitOps CDArgoCDUI for visibility at 20 services. Multi-cluster. App-of-Apps for 20+ app management. Drift detection.
RuntimeKubernetesIndustry standard. Helm + ArgoCD ecosystem. HPA + KEDA for autoscaling.
MetricsPrometheus + GrafanaDe facto standard. kube-prometheus-stack Helm chart sets everything up.
LogsGrafana Loki10× cheaper than ELK. Grafana native. Index-free. Scales well at this service count.
TracesOpenTelemetry + TempoOTel is the vendor-neutral standard. Tempo integrates in Grafana for unified observability.
AlertingAlertmanager → PagerDuty/SlackBundled with Prometheus. Routing logic keeps it simple. PagerDuty for on-call, Slack for awareness.
SecretsExternal Secrets Operator + VaultNo secrets in Git. GitOps-compatible. Supports rotation. Works with AWS/GCP/Vault.
Security RuntimeFalcoeBPF-based runtime threat detection. Minimal overhead. Alerts on anomalous container behaviour.
16

Starter Repository

All the CI/CD patterns, Docker configurations, pipeline scripts, and Docker Compose setups referenced in this handbook have a concrete starting point. Use this reference repository to bootstrap your pipeline implementation:

🚀

DevOps Playbook — CI/CD Reference Implementation

A comprehensive starter repository containing CI scripts, CD pipelines, Docker Compose configurations, GitHub Actions workflows, and Docker best practices. Covers the full pipeline from local development through to production deployment — aligned with the architecture decisions in this handbook.

Includes: Docker multi-stage build patterns · GitHub Actions CI template · CD deploy scripts · Docker Compose for local dev · Pre-commit hook configurations

→ vivek-doshi.github.io/devops-playbook/
GitHub Actions Docker CI Scripts CD Scripts Docker Compose Pre-commit Helm
How to use this handbook + the starter repo together This handbook answers "what should I build and why." The starter repo answers "here is working code to start from." Apply the architectural decisions in §03–§14 to the repository patterns in the starter repo. The two are designed to complement each other — strategic decisions here, tactical implementation there.
DevOps Architecture Handbook v1.0 · 20 Microservices Pattern · Based on CNCF, Google SRE, and DORA industry standards