DevOps Architecture for
20 Microservices
A complete decision framework — not code, just architecture. Every tool choice is justified. Every trade-off is explained. Designed for senior engineers who want to know why, not just what.
System Landscape — The 20 Services
At 20 services, you are in the "medium microservices" zone. Large enough that manual deploys are impossible, small enough that you do not need a platform engineering team of 10. The right architecture needs to be automatable, self-service, and auditable. Below is a representative domain decomposition.
The Full Pipeline — Every Stage Justified
Here is the complete flow a code change travels, from a developer's workstation to production. Each stage is a gate. Nothing proceeds without passing the previous one.
Stage-by-stage rationale
Local Dev → pre-commit
Pre-commit hooks catch the cheap problems (trailing whitespace, secrets in code, linting errors) before they ever hit CI. At 20 services, CI minutes are expensive and slow. Pre-commit is free and instant. Use the pre-commit framework (Python-based, language-agnostic). Hooks to include: detect-secrets, hadolint (Dockerfile linting), gitleaks (credential scanning), and language-specific formatters (black, gofmt, prettier). Every engineer installs this on checkout — enforced via a Makefile target and onboarding docs.
CI — Build + Test
GitHub Actions is the natural choice because your 20 repos already live in GitHub. No context-switching, no extra auth, and GitHub-hosted runners are cheap for the volume a 20-service system generates. Each service repo contains its own .github/workflows/ci.yml. Build steps: compile (or lint for interpreted), unit tests, integration tests against ephemeral containers (use services: in the workflow to spin up a Postgres or Redis for the duration of the job). Fail fast — tests before images, never the reverse.
CI — Security Scanning
Two mandatory scan types in CI: (a) SAST — static code analysis using language-specific tools (Semgrep for Python/JS/Go, SpotBugs for Java, or GitHub's CodeQL which is built-in). (b) SCA — software composition analysis to catch vulnerable dependencies using Trivy (open source, fast, multi-ecosystem). Trivy also scans the built Docker image for OS-level CVEs. Thresholds: fail CI on CRITICAL, warn on HIGH, report but do not block on MEDIUM. These thresholds are encoded in the CI YAML — not in developer heads.
Image Push — Tag Strategy + Signing
Build the container image and push to an OCI-compliant registry (GitHub Container Registry ghcr.io is the default — zero extra auth needed from Actions). Tag with the full Git SHA (sha-a1b2c3d) — never latest in production. Also tag with the semantic version if release-triggered (v1.4.2). Sign the image using Cosign (Sigstore) — this creates a tamper-evident provenance record. ArgoCD can be configured to verify signatures before deploying.
GitOps Update — Automated PR to Config Repo
After a successful image push, the CI pipeline opens a Pull Request in the gitops-config repo that bumps the image digest in the relevant Helm values file. This is the critical handoff between the CI world (code) and the CD world (desired state). The PR can be auto-merged for non-production environments (dev, staging) or require a human approval for production. Never write directly to the GitOps repo's main branch from CI — always via PR for auditability.
ArgoCD Sync — Reconciliation Loop
ArgoCD watches the gitops-config repo. When the image digest changes, it detects drift between desired state (Git) and actual state (Kubernetes). It reconciles by applying the updated Helm chart. In production: sync policy is set to selfHeal: false (manual sync trigger for a human gate). In dev/staging: automated: selfHeal: true for fully hands-free deploys. The reconciliation loop runs every 3 minutes by default.
Kubernetes — Rolling Deploy
Kubernetes applies the new manifest with a rolling update strategy — replacing pods one at a time, respecting readiness probes. This means zero downtime for well-configured services. Health checks must be correctly defined: livenessProbe (is the pod stuck?) and readinessProbe (is the pod ready to receive traffic?). For stateful services (order-service) consider canary deployments using Argo Rollouts.
Observability → Alerts
Prometheus scrapes metrics from all pods. Grafana dashboards visualise the golden signals: latency, traffic, errors, saturation. Alertmanager routes to PagerDuty (on-call) for critical alerts and to Slack for warning-level. Loki ingests structured logs. Jaeger or Tempo handles distributed traces. Every service must emit the standard telemetry set — this is enforced via a shared Helm library chart (sidecar injector or SDK requirement).
Source Control — 20 Repos, Not a Monorepo
Polyrepo vs Monorepo decision
With 20 services across independent teams or domain boundaries, a polyrepo (one repo per service) is the right call. Here is why:
Independent release cadence
payment-service and notification-service do not need to be versioned together. In a monorepo, a CI run for any service triggers tooling decisions for all of them. Independent repos mean independent pipelines, independent secrets, and independent access control.
Blast radius control
A misconfigured CI pipeline in user-service does not disable order-service's deploys. In a monorepo with shared CI config, a syntax error can block all 20 services simultaneously.
Shared library problem
Cross-cutting concerns (logging, auth middleware, tracing) cannot be copy-pasted. Solve with an internal package registry (GitHub Packages) and a shared-libs repo with its own versioning. Services consume specific versions — no implicit coupling.
GitHub Organisation structure
├── api-gateway # service repo
├── auth-service # service repo
├── user-service # service repo
├── ... # 17 more service repos
├── gitops-config # ★ THE gitops repo — no application code here
├── shared-libs # internal SDK, middleware, observability
└── infra-modules # Terraform modules for cloud infra (optional)
Branch Strategy — Trunk-Based Development
At 20 services, long-lived feature branches are poison. They create merge conflicts, delay integration feedback, and make the concept of "done" fuzzy. The industry-standard answer at this scale is Trunk-Based Development (TBD).
Why not GitFlow?
GitFlow (develop, release, hotfix branches) was designed for scheduled release cycles of quarterly or monthly software. At 20 microservices deploying multiple times per day, GitFlow introduces merge complexity with no benefit. The develop branch becomes a buffer that delays integration. Trunk-Based Development with feature flags solves the "unfinished feature" problem without a long-lived branch.
Feature flags over long branches
For large, in-progress features: merge the code to main behind a feature flag (managed by config-service). The code ships but the feature is off. When ready, toggle the flag. This eliminates the "branch diverged for 3 weeks" problem entirely.
feat:, fix:, chore:, ci:). This enables automated changelogs and semantic versioning triggers in CI. At 20 repos, consistent convention is critical for understanding cross-service changes.
Should There Be a GitOps Repo? Yes — Mandatory
This is one of the most debated architectural decisions in microservice DevOps. The answer is unambiguous at 20 services: yes, you need a dedicated, separate GitOps repository.
Why separate from the service repos?
Separation of concerns
A developer making a code change to order-service should not need access to the deployment manifests for payment-service. Keeping config separate enforces a clear boundary: application code in service repos, desired infrastructure state in the GitOps repo.
Single source of truth for all environments
The gitops-config repo holds what is running in dev, staging, and prod — all in one place. Anyone can look at this repo and know exactly what version of every service is deployed where. No manual tracking, no wikis, no "check the CI logs."
Audit trail and compliance
Every change to production is a Git commit with author, timestamp, and message. This is your audit log. For financial services or regulated industries, this is not optional. A merge history in the GitOps repo is your deployment log.
Access control
Only ArgoCD and senior engineers need write access to the production namespace in the GitOps repo. Service developers get read access. This is impossible to enforce cleanly if configs live alongside application code in 20 separate repos.
GitOps repo structure
├── apps/ # ArgoCD Application manifests
│ ├── dev/ # one file per service
│ │ ├── api-gateway.yaml
│ │ ├── auth-service.yaml
│ │ └── ... (20 files)
│ ├── staging/
│ └── prod/
├── charts/ # Helm chart per service (or shared library chart)
│ ├── api-gateway/
│ │ ├── Chart.yaml
│ │ ├── templates/
│ │ └── values.yaml
│ └── ... (20 charts)
├── environments/ # Per-env overrides
│ ├── dev/
│ │ └── api-gateway.yaml # image.tag: sha-a1b2c3d
│ ├── staging/
│ └── prod/
│ └── api-gateway.yaml # image.tag: v1.4.2
└── bootstrap/ # ArgoCD App-of-Apps root
The "App of Apps" pattern
With 20 services, create a single ArgoCD Application in bootstrap/ that points to the apps/ directory. ArgoCD discovers and manages all 20 child applications from this one parent. Adding a new service is a two-file operation: add a Helm chart, add an Application manifest. No manual ArgoCD configuration.
CI — Build, Test, Scan
Why GitHub Actions (not Jenkins, not GitLab CI)?
Native GitHub integration
Your 20 repos are on GitHub. GitHub Actions requires zero authentication setup to check out code, post status checks, and open PRs to other repos within the same org. Jenkins requires SSH keys, webhooks, and credential management for the same operations.
Reusable workflows
GitHub Actions supports reusable workflows — define the CI pattern once in a .github/workflows/ shared location and call it from all 20 service repos. This means updating the scan step in one place propagates to all services. Critical for 20-repo governance.
Vendor lock-in
GitHub Actions YAML is GitHub-specific. However, the actual build logic (Docker, shell scripts, Trivy) is portable. If you ever migrate off GitHub, the scripts move; only the orchestration YAML changes. This is an acceptable trade-off for the productivity gain.
CI workflow — per service repo
Triggers:
- push to any branch
- pull_request to main
Jobs (in order):
1. lint → pre-commit hooks run in CI (redundant safety net)
2. test → unit + integration tests (with service containers)
3. sast → Semgrep / CodeQL (parallel to test)
4. build-image → docker build (multi-stage, pinned base images)
5. scan-image → trivy image scan (fails on CRITICAL CVE)
6. push-image → ghcr.io/your-org/service:sha-XXXX (on main only)
7. sign-image → cosign sign (keyless, OIDC-based)
8. update-gitops → open PR in gitops-config repo (on main only)
Reusable workflow pattern
Create a shared-ci-workflows repo (or use the shared-libs repo) that contains the canonical workflow YAML. Each service repo references it with:
jobs:
ci:
uses: your-org/shared-ci-workflows/.github/workflows/service-ci.yml@main
with:
service-name: 'api-gateway'
language: 'python'
This pattern means all 20 services get security scan upgrades, new lint rules, or changed image tag strategies by merging one PR in one repo.
Image Push — Registry Choice and Tagging
Registry: GitHub Container Registry (ghcr.io)
| Option | Why / Why Not | Verdict |
|---|---|---|
| ghcr.io | Native GitHub auth, free for public, integrated with Actions OIDC, same access model as repos, no extra service to manage | ✓ Default choice |
| ECR (AWS) | Excellent if you run on EKS, tighter IAM integration, but requires OIDC federation setup and couples you to AWS | Use if on EKS |
| Artifact Registry (GCP) | Best choice if running on GKE, Workload Identity makes authentication clean | Use if on GKE |
| Docker Hub | Rate limits on pulls (100/6h unauthenticated), privacy limitations on free tier, no OIDC auth from Actions without token management | ✗ Avoid for prod |
| Harbor (self-hosted) | Full control, built-in vulnerability scanning, good for air-gapped or compliance-heavy environments. Adds operational overhead. | Regulated only |
Image tagging convention
Development: ghcr.io/org/api-gateway:sha-a1b2c3d
Staging: ghcr.io/org/api-gateway:sha-a1b2c3d (same SHA, different values.yaml)
Production: ghcr.io/org/api-gateway:v1.4.2 (semantic version on release tag)
# Why SHA not semver for dev/staging?
# SHA is immutable. A tag can be overwritten. SHA cannot.
# Immutable image references = reproducible deploys = reliable rollbacks.
Image signing with Cosign
Every pushed image is signed using Cosign in keyless mode (leveraging GitHub Actions OIDC token — no private key to manage). The signature is stored in the same registry alongside the image. ArgoCD can be configured with an admission webhook to reject unsigned images. This closes the supply chain loop — you can prove that an image running in prod was built by your CI, not pushed manually by a human with registry write access.
Why Helm — Not Raw Manifests, Not Kustomize
This question comes up constantly. Here is the honest trade-off analysis for 20 services across multiple environments.
Templating is mandatory at 20 services
You have 20 services × 3 environments = 60 combinations. Without templating, you have 60 sets of YAML files. A change to the liveness probe timeout means editing 60 files. Helm reduces this to 1 chart change + 60 value overrides (or a shared library chart default).
Helm Library Charts
Create a common library Helm chart that encodes your organisational defaults: resource limits, security context, probe paths, pod disruption budgets, network policies. Each service chart depends on this library. When standards change, one PR to the library chart propagates through all 20 services on next deploy.
ArgoCD native support
ArgoCD understands Helm natively. It renders Helm templates server-side, tracks drift correctly, and shows the diff between current and desired state in its UI. The integration is first-class and requires no shims.
Why not Kustomize?
Kustomize is excellent for patching existing manifests you do not control. For your own services, Kustomize's overlay model becomes hard to reason about at scale — you end up with bases, overlays, and patches that interact in non-obvious ways. Helm's values hierarchy is more explicit for environment differentiation across 20 services.
Helm chart structure per service
├── Chart.yaml # version, dependencies (→ common library chart)
├── values.yaml # sane defaults for all envs
└── templates/
├── deployment.yaml # uses library chart's deployment helper
├── service.yaml
├── ingress.yaml
├── hpa.yaml # Horizontal Pod Autoscaler
├── pdb.yaml # Pod Disruption Budget (always include)
└── networkpolicy.yaml # default-deny, explicit allows
Why ArgoCD — Not Flux, Not Spinnaker, Not Raw kubectl
GitOps-native with a real UI
ArgoCD was purpose-built for GitOps. It continuously watches a Git repo and reconciles Kubernetes state to match. The web UI shows the health of all 20 applications, drift detection, sync status, and the rendered YAML diff — all without kubectl. For a team managing 20 services, this visibility is invaluable.
Multi-cluster support
ArgoCD can deploy to multiple Kubernetes clusters from one control plane. When you add a DR cluster or a second region, you add it as a cluster target in ArgoCD — no new CD infrastructure needed. Critical for production resilience at this service count.
Why not Flux v2?
Flux is architecturally purer (no UI, fully CLI-driven) and has a smaller attack surface. However, at 20 services, the lack of a built-in UI is a real operational cost — you need to either build dashboards or accept reduced visibility. ArgoCD's UI pays for its complexity in this use case. If you have a platform team that prefers CLI and GitOps purity, Flux is a valid alternative.
Why not Spinnaker?
Spinnaker is a full CD platform with approval gates, canary analysis, and multi-cloud deployment. It is significantly more complex to operate — typically requiring a dedicated team. For 20 microservices without dedicated platform engineering, the operational overhead of Spinnaker outweighs its benefits. ArgoCD + Argo Rollouts covers 95% of what Spinnaker offers with 20% of the complexity.
ArgoCD configuration decisions
| Setting | Dev | Staging | Production | Reason |
|---|---|---|---|---|
| Sync Policy | Auto | Auto | Manual | Prod deploys need a human trigger — auto-sync in prod removes the last human gate |
| selfHeal | true | true | false | In prod, manual kubectl changes should drift-alert but not be auto-overwritten |
| Prune | true | true | Reviewed | Pruning in prod (deleting old resources) should be a conscious action |
| Sync Timeout | 5m | 10m | 10m | Longer in prod to handle slow database migrations or init containers |
Kubernetes Setup for 20 Services
Namespace strategy — per-environment, not per-service
dev # all 20 services in dev
staging # all 20 services in staging
production # all 20 services in prod
monitoring # Prometheus, Grafana, Loki, Jaeger
argocd # ArgoCD control plane itself
ingress-nginx # ingress controller
cert-manager # TLS certificate management
Resource management — mandatory for 20 services
Every pod must have resource requests and limits defined. Without them, a single runaway service (memory leak in queue-worker) can OOM-kill pods across the entire node, taking unrelated services down. Enforce via a ValidatingAdmissionWebhook or via OPA/Gatekeeper policy.
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m # CPU limit prevents noisy neighbours
memory: 512Mi # Memory limit triggers OOMKill before node pressure
Autoscaling strategy
- Use Horizontal Pod Autoscaler (HPA) on CPU/memory for stateless services (api-gateway, user-service)
- Use KEDA (Kubernetes Event-Driven Autoscaling) for queue-worker — scale on Kafka consumer lag, not CPU
- Use Vertical Pod Autoscaler (VPA) in recommendation mode to tune resource requests over time
- Do NOT set HPA and VPA in auto mode on the same deployment — they conflict
- Set
minReplicas: 2in production for all services — single-replica services have no rolling update window
Observability & Alerts — The Three Pillars
Prometheus + Grafana
Prometheus scrapes metrics from all pods via /metrics endpoints. The kube-prometheus-stack Helm chart installs everything: Prometheus, Alertmanager, node exporters, kube-state-metrics, and a default Grafana dashboard. Golden signals per service: request rate, error rate, P95 latency, saturation.
Loki + Promtail
Loki is the cost-effective choice for log aggregation at this scale. Unlike Elasticsearch, Loki does not index log content — it indexes only metadata labels (service, pod, namespace). Logs are compressed and stored cheaply. Promtail runs as a DaemonSet and ships logs to Loki. Grafana queries both Prometheus and Loki in the same dashboard.
OpenTelemetry + Tempo
Instrument services with the OpenTelemetry SDK (language-specific). The OTel Collector aggregates spans and exports to Grafana Tempo. Traces allow you to follow a request across all the services it touches — essential for debugging latency in a 20-service system where a slow database call in user-service cascades to order-service.
Alert routing policy
- Service error rate > 5% for 5 minutes
- P99 latency > 10s for 5 minutes
- Pod crash-looping (>3 restarts in 15m)
- ArgoCD sync failed in production
WARNING → Slack #alerts-staging (notify, don't wake)
- Memory usage > 80% of limit
- HPA at max replicas
- Certificate expiry < 14 days
INFO → Slack #deploys (FYI only)
- Successful deploy
- ArgoCD sync complete
Secrets Management
The rule: secrets never go in Git
Not in the service repos. Not in the GitOps repo. Not as base64 in Kubernetes Secrets (base64 is encoding, not encryption). The GitOps repo contains only references to secrets, not the values.
ESO + HashiCorp Vault or AWS Secrets Manager
Install the External Secrets Operator in Kubernetes. Define ExternalSecret resources in your GitOps repo — these reference a secret path in Vault or AWS Secrets Manager, not the value itself. ESO syncs the actual value into a Kubernetes Secret at deploy time. Git contains the reference (safe). The value lives in Vault (safe). The K8s Secret lives only in the cluster.
Bitnami Sealed Secrets
Encrypt secrets with a cluster-specific public key. Store the encrypted SealedSecret in Git (safe — only the cluster can decrypt). Simpler than ESO but couples secrets to the cluster — if you lose the cluster's private key, secrets are irrecoverable. Use ESO for production, Sealed Secrets for dev environments.
CI secrets (GitHub Actions)
Use GitHub Org Secrets for secrets shared across all 20 repos (registry push token, ArgoCD token, Vault token). Use Repository Secrets only for service-specific values. Never hardcode tokens in workflow YAML — even if the repo is private. Use OIDC-based authentication where possible (GitHub OIDC → AWS/GCP avoids storing cloud credentials entirely).
Environment Strategy
Development
Auto-sync on every push to main. Shared by all developers. Frequent deploys (10–30/day across 20 services). Relaxed resource limits. Mock external services. Ephemeral — can be torn down and rebuilt.
Staging
Auto-sync, but deploy triggers require passing integration test suite. Mirrors production config as closely as possible. Used for QA sign-off, performance testing. Real (non-prod) external service credentials where possible.
Production
Manual ArgoCD sync trigger only. Requires PR approval in GitOps repo before sync. PodDisruptionBudgets active. Multiple replicas. Full observability stack. Blue/green or canary for high-risk services.
Disaster Recovery
Warm standby cluster (or passive) in second region. ArgoCD replicates prod state. RTO/RPO targets drive the cluster sizing. At 20 services, active-active DR may be overkill — warm standby is often sufficient.
Security & Compliance
- Supply chain security: Cosign image signing + SLSA Level 2 provenance via GitHub Actions. Verify signatures in ArgoCD before deploy.
- Network policies: Default-deny all ingress/egress per namespace. Explicit allows only. payment-service has the most restrictive policy — only api-gateway may call it.
- Pod Security Standards: Enforce
restrictedprofile via Kubernetes Pod Security Admission. No privileged containers, no hostPath mounts, read-only root filesystem where possible. - RBAC: ArgoCD uses a dedicated ServiceAccount with minimal permissions per namespace. Developers get read-only kubectl access to prod, full access to dev.
- Dependency updates: Renovate Bot runs weekly PRs for dependency bumps across all 20 repos. Auto-merge on passing CI for patch versions. Manual review for minor/major.
- Runtime security: Falco as a DaemonSet. Detects anomalous container behaviour (unexpected outbound connections, privilege escalation attempts) at runtime.
- payment-service isolation: If PCI-scoped, isolate to a dedicated node pool with a taint. Only payment pods tolerate this taint. Reduces PCI audit scope significantly.
Tool Selection Summary
| Layer | Tool | Why This, Not Alternatives |
|---|---|---|
| Source Control | GitHub (20 repos + 1 gitops) | Native Actions integration. Org-level policy enforcement. OIDC for cloud auth. |
| CI Platform | GitHub Actions | Zero-friction GitHub integration. Reusable workflows for 20-repo governance. Pay-per-minute, no infra to manage. |
| Pre-commit | pre-commit framework | Language-agnostic. Community hook ecosystem. Identical gates locally and in CI. |
| SAST | Semgrep / CodeQL | CodeQL is GitHub-native (free on public repos). Semgrep for custom rules. Both integrate with GitHub Security tab. |
| SCA / CVE | Trivy | Single tool for dependency AND image scanning. Fast. OSS. Excellent GitHub Actions support. |
| Container Registry | ghcr.io | Same auth as GitHub. Free for private at org scale. No extra service. |
| Image Signing | Cosign (Sigstore) | Keyless signing via OIDC. No private key management. Growing standard. |
| Package Manager | Helm 3 | Templating at env scale. Library charts for standards. ArgoCD native. Kustomize for 3rd-party patches only. |
| GitOps CD | ArgoCD | UI for visibility at 20 services. Multi-cluster. App-of-Apps for 20+ app management. Drift detection. |
| Runtime | Kubernetes | Industry standard. Helm + ArgoCD ecosystem. HPA + KEDA for autoscaling. |
| Metrics | Prometheus + Grafana | De facto standard. kube-prometheus-stack Helm chart sets everything up. |
| Logs | Grafana Loki | 10× cheaper than ELK. Grafana native. Index-free. Scales well at this service count. |
| Traces | OpenTelemetry + Tempo | OTel is the vendor-neutral standard. Tempo integrates in Grafana for unified observability. |
| Alerting | Alertmanager → PagerDuty/Slack | Bundled with Prometheus. Routing logic keeps it simple. PagerDuty for on-call, Slack for awareness. |
| Secrets | External Secrets Operator + Vault | No secrets in Git. GitOps-compatible. Supports rotation. Works with AWS/GCP/Vault. |
| Security Runtime | Falco | eBPF-based runtime threat detection. Minimal overhead. Alerts on anomalous container behaviour. |
Starter Repository
All the CI/CD patterns, Docker configurations, pipeline scripts, and Docker Compose setups referenced in this handbook have a concrete starting point. Use this reference repository to bootstrap your pipeline implementation:
DevOps Playbook — CI/CD Reference Implementation
A comprehensive starter repository containing CI scripts, CD pipelines, Docker Compose configurations, GitHub Actions workflows, and Docker best practices. Covers the full pipeline from local development through to production deployment — aligned with the architecture decisions in this handbook.
Includes: Docker multi-stage build patterns · GitHub Actions CI template · CD deploy scripts · Docker Compose for local dev · Pre-commit hook configurations
→ vivek-doshi.github.io/devops-playbook/