Production-Grade
DevOps & Platform
from Day One
A battle-tested, opinionated reference kit for cloud-native teams — covering CI/CD, Kubernetes, security, observability, FinOps, and SRE practices with copy-ready templates and clear guardrails.
Implementation reference: https://vivek-doshi.github.io/devops-playbook/
Quick Start
From zero to first PR in under an hour. Follow this sequence — no archaeology of the repo required.
Check Prerequisites
Run the environment checker to verify your local toolchain is complete before touching anything else.
bash scripts/env-checker.sh
Requires: Git 2.40+, Docker Desktop 4.x, kubectl 1.29+, kind 0.24+, Helm 3.14+, Terraform 1.7+, pre-commit 3.x
Install Git Hooks
Hooks catch secrets, IaC formatting errors, and code quality issues before they reach CI — making feedback loops faster and CI less noisy.
make hooks
# equivalent to:
pre-commit install
pre-commit install --hook-type pre-push
Start Local Kubernetes Cluster
A kind (Kubernetes in Docker) cluster that mirrors the production overlay structure — with a local registry, ingress-nginx, and a dev namespace ready.
make dev
# Creates kind cluster "devops-playbook"
# Starts local registry at localhost:5001
# Installs ingress-nginx via Helm
# Applies dev Kustomize overlay
Pick Your Golden Path
Each path is an opinionated, end-to-end workflow with exact file references, guardrails, and validation steps baked in.
Open a PR
Branch names follow type/description. Commits must follow Conventional Commits — the pre-push hook will reject non-conforming messages.
git checkout -b feat/add-service-name
make lint # run all pre-commit hooks
git commit -m "feat(api): add health endpoint"
git push origin feat/add-service-name
Repository Structure
Organized by functional concern first, then by platform and technology — so you ask "I need to deploy to AKS" and navigate directly there, not "what type of file is this".
| Directory | Purpose | Key files |
|---|---|---|
docker/ | Multi-stage Dockerfiles for every major stack (.NET, Python, Node.js, Java, Go, Ruby, React, Angular) | Dockerfile.api, Dockerfile.worker, security-hardened.Dockerfile |
compose/ | Docker Compose stacks for local development with realistic service topologies | microservices-example/, python-postgres-redis/ |
ci/ | CI pipeline templates across GitHub Actions, Azure Pipelines, GitLab CI, Jenkins — all major stacks | _shared/reusable-*.yml, _strategies/ |
cd/ | CD manifests, GitOps definitions, Helm charts, Kustomize overlays, deployment targets | kubernetes/_base/, gitops/argocd/, targets/ |
terraform/ | IaC blueprints for EKS, AKS, GKE, ECS, Lambda with bootstrap modules | _bootstrap/, aws-eks/backup.tf |
ci-security/ | Security scanning integrations: SAST, container scan, secret detection, IaC scan, dependency audit | trivy-scan.yml, gitleaks.yml, checkov.yml |
secops/ | Runtime security, supply chain controls, compliance control libraries, incident runbooks | runbooks/, compliance/, supply-chain/ |
policy/ | Kyverno admission policies and Conftest/OPA rules enforced at cluster level | kyverno/require-*.yaml, conftest/kubernetes/ |
secrets/ | External Secrets Operator configs, rotation workflows, lifecycle guides | external-secrets/, rotation/, guides/ |
observability/ | Prometheus, Loki, Tempo, OpenTelemetry — full telemetry stack with SLOs and dashboards | prometheus/slos/, prometheus/recording-rules/ |
finops/ | Cost monitoring, Kyverno cost policies, Infracost CI, rightsizing scripts, 9 Grafana dashboards | scripts/analyze-rightsizing.py, policies/, dashboards/ |
catalog/ | Git-native service and team registry with CI validation and CODEOWNERS generation | schema/service.yaml, scripts/validate-catalog.py |
backup/ | Velero cluster backup and DB PITR Terraform modules | velero/schedule.yaml, terraform/aws-rds-backup.tf |
docs/ | Golden paths, architecture guides, runbooks, ADRs, environment strategy | golden-paths/, guides/, runbooks/ |
Design Principles
Clarity Over Abstraction
Every template is readable in one pass. No hidden control flow or magic variables buried three layers deep.
Golden Paths Over Flexibility
Opinionated choices are made for you. Adapt the templates but preserve the guardrails — that's the contract.
Production-Grade Defaults
Resource limits, non-root containers, read-only filesystems, and health probes are required, not optional.
Security by Default
OIDC over static credentials. External secrets over ConfigMaps. Kyverno blocks before admission, not after.
Dev Environment
Two modes: the devcontainer (recommended for onboarding and pair programming — guarantees identical toolchain) or local machine with manual tool installation.
Open the repo in VS Code and select Reopen in Container. The devcontainer installs all required tools at pinned versions and configures the environment automatically.
# All tools are pre-installed in the container:
# kubectl, helm, kind, terraform, pre-commit, ruff, mypy
# hadolint, checkov, gitleaks, cosign, velero, yq, jq
# Config: .devcontainer/devcontainer.json
# Dockerfile: .devcontainer/Dockerfile
# Post-create: .devcontainer/scripts/post-create.sh
pre-commit autoupdate quarterly to refresh hook revisions.# 1. Verify all tools
bash scripts/env-checker.sh
# 2. Start local Kubernetes cluster
make dev
# → kind cluster "devops-playbook" with registry + ingress
# 3. Install git hooks
make hooks
# 4. Run all linters
make lint
# 5. Deploy to local cluster
make deploy-dev
# 6. Check status
make k8s-status
# 7. Tear down
make teardown
For MLOps and GPU workloads, use the dedicated CUDA devcontainer:
# Open: .devcontainer/gpu/devcontainer.json
# Requires: NVIDIA drivers + Docker GPU pass-through
# Validate GPU visibility
nvidia-smi
# Start Jupyter
python3 -m jupyter lab --ip=0.0.0.0 --no-browser
See .devcontainer/gpu/README.md for full GPU cluster provisioning with Terraform GPU node groups.
Golden Paths
Golden paths are opinionated, end-to-end workflows with exact file references and enforced guardrails. They eliminate decision fatigue and encode hard-won production experience. Pick your scenario and follow the steps — the right templates, security gates, and operational hooks are already connected.
Kubernetes Microservice — End to End
The most complete path: backend API or microservice from local dev to production on any cloud-managed Kubernetes cluster. Every step names the exact file to copy or edit.
File Map by Step
| Step | What | File(s) |
|---|---|---|
| 1 | Local kind cluster | local-dev/kind/setup.sh |
| 2 | Pre-commit hooks | .pre-commit-config.yaml |
| 3 | CI pipeline (pick stack) | ci/github-actions/{stack}/build-test.yml |
| 4 | Docker build (reusable workflow) | ci/github-actions/_shared/reusable-docker-build.yml |
| 5 | Security scans (parallel) | ci-security/container-scanning/trivy-scan.ymlci-security/secret-detection/gitleaks.ymlci-security/sast/semgrep.yml |
| 6 | Bootstrap Terraform state (once) | terraform/_bootstrap/{cloud}/main.tf |
| 7 | Provision cluster + DB (with backup) | terraform/{aws-eks|azure-aks|gcp-gke}/backup.tf |
| 8 | Kubernetes manifests | cd/kubernetes/_base/deployment.yaml, hpa.yaml, pdb.yaml |
| 9 | DB migrations | cd/kubernetes/_patterns/db-migration-job.yaml |
| 10 | Secrets (ESO) | secrets/external-secrets/example-external-secret.yaml |
| 11 | GitOps deploy (ArgoCD) | cd/gitops/argocd/application.yaml |
| 12 | Observability + SLOs | observability/prometheus/slos/availability-slo.yaml |
| 13 | Backup (prod only) | backup/velero/schedule.yaml |
Kubernetes Security Baseline (Kyverno-Enforced)
The following fields are required on every pod. Kyverno will block deployments missing any of them in production namespaces:
| Field | Required value | Policy |
|---|---|---|
spec.securityContext.runAsNonRoot | true | require-non-root.yaml — Enforce |
containers[].securityContext.allowPrivilegeEscalation | false | require-non-root.yaml — Enforce |
containers[].resources.requests.cpu | any value | require-resource-limits.yaml — Enforce |
containers[].resources.requests.memory | any value | require-resource-limits.yaml — Enforce |
containers[].resources.limits.cpu | any value | require-resource-limits.yaml — Enforce |
containers[].resources.limits.memory | any value | require-resource-limits.yaml — Enforce |
metadata.labels.app | non-empty string | require-labels.yaml — Audit |
containers[].securityContext.readOnlyRootFilesystem | true | require-readonly-filesystem.yaml — Warn |
finops.org/costcenter | string matching budgets.yaml | require-cost-labels.yaml — Enforce |
emptyDir volume at that path. The pattern is in cd/kubernetes/_base/deployment.yaml.FinOps Checkpoints (Pre-Production)
Three checkpoints are embedded into this path and are not optional before promoting to production:
# Checkpoint 1: Verify cost labels are present
python finops/scripts/validate-cost-tags.py --namespace <your-ns>
# Checkpoint 2: Review VPA rightsizing (after 24h in staging)
python finops/scripts/analyze-rightsizing.py --namespace <staging-ns>
# Checkpoint 3: Confirm budget headroom
# Check: Grafana → FinOps — Budget Tracking dashboard
SLO-Driven Development
Translate reliability requirements into measurable targets, automated alerts, and an engineering process that balances feature velocity with operational risk.
Core Concepts
SLI — Service Level Indicator
The metric you measure. For availability: good_requests / total_requests. For latency: requests_under_200ms / total.
SLO — Service Level Objective
The target. "99.9% of requests succeed over a 30-day window." This is an internal engineering commitment, not a customer-facing SLA.
Error Budget
The allowed unreliability. At 99.9%, you have 43.8 minutes/month of downtime budget. Spend it on risk-taking features; run out and reliability work takes priority.
Burn Rate
How fast you're consuming the budget. Burn rate of 1x = on track. 14.4x = budget gone in 2 hours. Multi-window alerts catch both fast burns and slow drains.
Burn-Rate Alert Tiers
Four alert tiers are required. The two-window approach (fast + slow detection per tier) reduces both false positives and missed gradual degradations:
Critical — Tier 1
Burn ≥ 14.4×
Budget gone in ~2 hours
Alert after 2 minutes
→ Page immediately, open incident
High — Tier 2
Burn ≥ 6×
Budget gone in ~5 days
Alert after 15 minutes
→ Page, investigate within 15 min
Medium — Tier 3
Burn ≥ 3×
Budget gone in ~10 days
Alert after 1 hour
→ Ticket, investigate next standup
Low — Tier 4
Burn ≥ 1×
Will exhaust at end of month
Alert after 3 hours
→ Review in next sprint
Implementation Steps
# 1. Define your SLO (copy + fill the schema)
cp observability/prometheus/slos/slo-schema.yaml \
observability/prometheus/slos/my-service-slo.yaml
# 2. Write recording rules (replace "api-gateway" with your service)
cp observability/prometheus/recording-rules/slo-burn-rates.yaml \
observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml
kubectl apply -f observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml
# 3. Deploy burn-rate alerts
cp observability/prometheus/alerts/slo-burn-rate-alerts.yaml \
observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml
kubectl apply -f observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml
# 4. Deploy the SLO dashboard ConfigMap (auto-discovered by Grafana)
kubectl apply -f observability/prometheus/dashboards/slo-status-configmap.yaml
# 5. Validate naming convention
make slo-validate
slo:<service>:<metric>:<window>. Example: slo:api-gateway:availability_burn_rate:1h. The make slo-validate target checks this.SLO Target Selection Guide
Most teams default to 99.9%. That is often wrong. Use these questions first:
- What is the user impact of 1 minute of downtime?
- What is your current measured baseline? (never set target above baseline)
- Can your team sustain the on-call burden of a tighter target?
Incident Response
A 5-phase ops runbook that applies to any service using this playbook. Start here when PagerDuty fires or a user reports production degradation.
Severity Levels
| Severity | Definition | Response time |
|---|---|---|
| SEV-1 | Total service outage — no users can access | Immediate — wake on-call |
| SEV-2 | Partial outage — significant % of users affected | Within 15 minutes |
| SEV-3 | Degraded performance or non-critical feature failure | Within 1 hour |
| SEV-4 | Minor issue, workaround available | Next business day |
Phase 1 — Acknowledge & Assemble (0–5 min)
# Acknowledge PagerDuty within 5 minutes to stop escalation
# Open dedicated Slack channel: #inc-YYYY-MM-DD-<service-name>
# Post immediately:
Service: <name>
Severity: SEV-X
Started: HH:MM UTC
IC: @name
Impact: <what users see>
# Verify ownership via service catalog
cat catalog/services/<service-name>.yaml
# → spec.oncall.pagerduty_service, spec.oncall.slack_channel
Phase 2 — Triage (5–15 min)
# Layer 1: Is the service returning errors?
curl -I https://my-service.example.com/health
# Layer 2: Kubernetes health
kubectl get pods -n <namespace> -l app=<service>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Layer 3: Recent changes (most common root cause)
kubectl rollout history deployment/<name> -n <namespace>
git log --oneline --since="1 hour ago" main
# Layer 4: Logs
kubectl logs -n <namespace> -l app=<service> --previous | tail -100
Phase 3 — Mitigate Options
| Scenario | Mitigation | Command |
|---|---|---|
| Recent bad deploy | Roll back | kubectl rollout undo deployment/<name> -n <ns> |
| Capacity overload | Scale up | kubectl scale deployment/<name> --replicas=N |
| OOMKilled pods | Increase memory limit | kubectl set resources deployment/<name> --limits=memory=512Mi |
| Feature-specific failure | Disable via ConfigMap | kubectl edit configmap <name> -n <ns> |
| Data corruption | Velero restore | velero restore create --from-backup <name> --include-namespaces <ns> |
Phase 5 — Post-Incident Review
Mandatory for SEV-1 and SEV-2. Complete within 48 hours. Include: timeline, root cause, contributing factors, action items with owners and due dates. Write or update the runbook: cp docs/runbooks/template.md docs/runbooks/<service>-<symptom>.md
Platform Onboarding
New team setup — get every foundation in place before any application path. Completing this path means the application paths work on first attempt.
Workstation Setup
Each engineer runs bash scripts/env-checker.sh and make hooks independently on their machine.
Local Kind Cluster
Run bash local-dev/kind/setup.sh — idempotent, creates a 3-node cluster with ingress and local registry in one command.
GitHub Repo Setup
Branch protection on main, conventional commits CI workflow, OIDC federation (no static credentials in GitHub Secrets), release strategy.
Bootstrap Terraform State (once)
Run terraform/_bootstrap/{cloud}/ before any other Terraform. Creates the remote state backend. Without this, state is local and will be lost.
Kubernetes Namespace + RBAC
Request namespace from platform team. Gets: namespace with labels, developer read-only RBAC, CI deployer role, default-deny NetworkPolicy, DNS egress allow.
Secrets (External Secrets Operator)
Apply SecretStore for your cloud, create first ExternalSecret, verify sync with kubectl get externalsecret.
Kyverno Policies
Install Kyverno cluster-wide. Run kubectl get policyreport -n <ns> to check violations before your first deploy attempt.
Observability (Prometheus + Loki + Tempo)
Install in order: Prometheus → Loki → Tempo. Apply alert rules and SLO templates for your service.
Alert Routing
Configure Alertmanager receivers: Slack for dev/staging, PagerDuty for production. Fire a test alert to confirm routing works before going live.
Write First Runbook
cp docs/runbooks/template.md docs/runbooks/<service>-crash-loop.md. Link it from alert annotations with runbook_url.
Production Readiness Checklist
bash scripts/env-checker.shpasses on all engineers' machinespre-commit installrun on all engineers' clones- Branch protection enabled with required CI checks
- OIDC federation configured — no long-lived credentials in GitHub Secrets
- Remote Terraform state backend configured
- Namespace with correct RBAC and NetworkPolicy applied
- External Secrets store configured and syncing
- Kyverno policies installed — no failures in
kubectl get policyreport - At least one alert rule deployed and routing verified
- PagerDuty on-call rotation configured
- Runbook written for top 2 failure modes with
runbook_urlin alert - Velero backup schedule applied to production namespace
Mobile Backend (BFF)
Backend for Frontend pattern for iOS/Android apps — covering the concerns that make mobile different from a standard API: versioning, PKCE auth, push notifications, and aggressive client retry protection.
Key Mobile-Specific Requirements
API Versioning (Required)
Mobile clients cannot be force-updated. Old app versions live for 12+ months. URL path versioning (/v1/, /v2/) is the default. Deprecated versions must serve a Deprecation: true + Sunset: header.
OAuth2/PKCE (Required)
Authorization Code + PKCE (RFC 7636) is mandatory — mobile apps cannot safely store a client secret. The BFF validates JWTs from the JWKS endpoint on every request. Never hardcode public keys.
Push Notifications
APNs keys and FCM server keys must be stored via External Secrets Operator — never in code or manifests. Upsert device tokens on every app launch (they rotate).
Rate Limiting (Required)
All public endpoints must have Ingress-layer IP-based limiting (nginx.ingress.kubernetes.io/limit-rps). Authenticated endpoints additionally need per-user Redis sliding-window counters. Return 429 with Retry-After.
Mobile-Specific Prometheus Metrics
| Metric | Alert when |
|---|---|
http_requests_total{version="v1"} | Drops to 0 (all clients migrated — safe to sunset) |
auth_token_validation_errors_total | Spikes (potential credential stuffing) |
push_notification_delivery_failures_total | Failure rate > 5% |
rate_limit_rejections_total | Sustained spike (client misbehaviour or DDoS) |
Multi-Tenant SaaS
Namespace-per-tenant isolation model. Scales to ~100 tenants per cluster before control plane overhead becomes significant. Beyond that, provision additional clusters.
Isolation Layers
| Layer | Mechanism | File |
|---|---|---|
| Network | Default-deny NetworkPolicy per namespace + explicit allow rules | cd/kubernetes/_base/network-policies/default-deny.yaml |
| Identity | RBAC scoped to namespace — no cross-tenant secret access | cd/kubernetes/_base/rbac/ |
| Secrets | ESO namespace-scoped SecretStore per tenant | secrets/external-secrets/ |
| Data | Schema-per-tenant on shared DB (scales to ~1000 tenants; add PgBouncer beyond) | cd/kubernetes/_patterns/db-migration-job.yaml |
| Billing | All Prometheus metrics carry a tenant label — enforced by Kyverno | policy/kyverno/require-labels.yaml |
Tenant Onboarding Flow
Add a YAML record to tenants/<slug>.yaml with 6 required fields. CI detects the new file and creates exactly 6 resources per tenant: namespace, RBAC, NetworkPolicy, ESO store config, database schema, and ArgoCD Application. The job is idempotent — re-running is safe.
Tenant Offboarding (Order Matters)
Never automate schema drops. The required sequence:
- Set
status: offboardingin tenant YAML → merge PR - Export billing metrics → take Velero namespace snapshot
- Delete ArgoCD Application
- Delete namespace
- Manual DBA step: Drop tenant schema (requires explicit approval)
- Remove tenant secrets from secrets store
DROP SCHEMA. This step must NEVER be automated.Service Catalog Registration
Every production service must be registered in the Git-native catalog with owner, on-call routing, runbook link, SLO file, and cost center. CI validates metadata integrity. Kyverno can block deployments from unregistered services in production namespaces.
Required Service Fields
# catalog/services/<service-name>.yaml
metadata:
name: api-gateway # must match Deployment app label + filename
owner: platform-team # must exist in catalog/teams/
tier: tier-1 # tier-1 | tier-2 | tier-3
spec:
oncall:
pagerduty_service: "..."
slack_channel: "#alerts-api"
slo:
definition_file: "observability/prometheus/slos/api-gateway-slo.yaml"
cost_center: engineering # must exist in finops/config/budgets.yaml
runbook: "docs/runbooks/api-gateway-crashloop.md"
Validation & Automation
# Validate locally (strict mode, skip URL checks)
python catalog/scripts/validate-catalog.py --strict --skip-url-check
# Regenerate .github/CODEOWNERS from team ownership metadata
python catalog/scripts/generate-codeowners.py --output .github/CODEOWNERS
make catalog-codeowners
# Export to Backstage format (when service count approaches 50)
python catalog/scripts/migrate-to-backstage.py --output-dir backstage/catalog
catalog/services/<name>.yaml as the source of truth for responder routing. Keep spec.oncall.pagerduty_service and spec.oncall.slack_channel current.CI Core Concepts
Every CI pipeline in this repo follows the same conceptual model regardless of platform — build once, test in isolation, scan in parallel, deploy immutable artifacts. Platform-specific syntax differs; the stages and contracts do not.
Universal Pipeline Stages
Reusable vs Standalone Workflows
| Use case | Pattern | File |
|---|---|---|
| Docker build shared across services | Reusable workflow | ci/github-actions/_shared/reusable-docker-build.yml |
| Slack notify shared across pipelines | Reusable workflow | ci/github-actions/_shared/reusable-notify-slack.yml |
| Trivy scan shared across pipelines | Reusable workflow | ci/github-actions/_shared/reusable-security-scan.yml |
| Supply chain (sign + SBOM + SLSA) | Reusable workflow | ci/github-actions/_shared/reusable-supply-chain.yml |
| Single-service owns its own build | Standalone workflow | Service-specific .github/workflows/build.yml |
| Monorepo — run only affected services | Strategy | ci/github-actions/_strategies/monorepo-affected.yml |
| Multi-version matrix (Node 20+22) | Strategy | ci/github-actions/_strategies/matrix-build.yml |
| Automated SemVer releases | Strategy | ci/github-actions/_strategies/release-please.yml |
OIDC vs Static Credentials
AWS_ACCESS_KEY_ID, AZURE_CLIENT_SECRET, and GCP service account JSON keys cannot be scoped to a specific repository or branch, and persist indefinitely if leaked. OIDC tokens are short-lived (≤15 minutes) and automatically scoped.| Aspect | Long-lived secrets | OIDC tokens |
|---|---|---|
| Lifetime | Until manually rotated | Minutes (auto-expires) |
| Rotation | Manual, easy to forget | Automatic, every workflow run |
| Blast radius | Leaked secret = persistent access | Leaked token expires in minutes |
| Audit trail | Hard to attribute | Each token tied to repo/branch/workflow |
| Scope control | Global until revoked | Restricted by trust policy to specific repo + branch |
CI Templates
Copy the workflow file for your stack into .github/workflows/ and add security scanning as a parallel job. All pipelines emit the same stage contract regardless of language.
| Stack | Build + Test | Docker Publish |
|---|---|---|
| .NET / ASP.NET Core | ci/github-actions/dotnet/build-test.yml | ci/github-actions/dotnet/docker-publish.yml |
| Python (pytest + ruff) | ci/github-actions/python/build-test.yml | via reusable-docker-build |
| Go | ci/github-actions/go/build-test.yml | ci/github-actions/go/docker-publish.yml |
| Java (Maven/Gradle) | ci/github-actions/java/build-test.yml | via reusable-docker-build |
| Node.js / React | ci/github-actions/react/build-test.yml | via reusable-docker-build |
| Angular | ci/github-actions/angular/build-test.yml | ci/github-actions/angular/lighthouse-audit.yml |
| Ruby on Rails | ci/github-actions/ruby/build-test.yml | via reusable-docker-build |
| Terraform | ci/github-actions/terraform/plan-apply.yml | ci/github-actions/terraform/cost-estimation.yml |
# Minimal wiring example: build → security scan
jobs:
build-test:
uses: ./.github/workflows/reusable-docker-build.yml
with:
image-name: my-service
dockerfile: docker/python/Dockerfile.fastapi
permissions:
id-token: write # Required for OIDC
contents: read
security-scan:
needs: build-test
uses: ./.github/workflows/reusable-security-scan.yml
with:
image: my-service:${{ github.sha }}
| Stack | Pipeline file |
|---|---|
| .NET | ci/azure-pipelines/dotnet/azure-pipelines.yml |
| Angular | ci/azure-pipelines/angular/azure-pipelines.yml |
| Python | ci/azure-pipelines/python/azure-pipelines.yml |
| Terraform | ci/azure-pipelines/terraform/azure-pipelines.yml |
Azure Pipelines templates use Variable Groups + Key Vault link or Managed Identity (OIDC equivalent) for cloud credentials — never hardcoded pipeline variables.
| Stack | Pipeline file |
|---|---|
| .NET | ci/gitlab-ci/dotnet/.gitlab-ci.yml |
| Python | ci/gitlab-ci/python/.gitlab-ci.yml |
| Terraform | ci/gitlab-ci/terraform/.gitlab-ci.yml |
GitLab CI uses include: to pull shared includes from ci/gitlab-ci/_includes/ — Docker build, SAST scan, and Slack notify are all shared components.
| Stack | Jenkinsfile |
|---|---|
| .NET | ci/jenkins/dotnet/Jenkinsfile |
| Python | ci/jenkins/python/Jenkinsfile |
Deployment Targets
Choose your deployment target based on whether you need Kubernetes features. If unsure, start with the decision tree:
| Target | Terraform | Deploy workflow | Use when |
|---|---|---|---|
| AWS EKS | terraform/aws-eks/ | cd/targets/aws-eks/github-actions-deploy.yml | Full K8s on AWS |
| Azure AKS | terraform/azure-aks/ | cd/targets/azure-aks/github-actions-deploy.yml | Full K8s on Azure |
| GCP GKE | terraform/gcp-gke/ | cd/targets/gcp-gke/github-actions-deploy.yml | Full K8s on GCP |
| Azure App Service | terraform/azure-app-service/ | cd/targets/azure-app-service/github-actions-deploy.yml | Stateless apps, no K8s |
| AWS ECS | terraform/aws-ecs/ | cd/targets/aws-ecs/github-actions-deploy.yml | Containers without K8s on AWS |
| AWS Lambda | terraform/aws-lambda/ | cd/targets/aws-lambda/serverless-deploy.yml | Event-driven, short-lived |
| GCP Cloud Run | via gcp-gke module | cd/targets/gcp-gke/cloudbuild.yaml | Serverless containers on GCP |
| OpenShift | — | cd/targets/openshift/github-actions-deploy.yml | Enterprise OpenShift environments |
| AWS CodePipeline | — | cd/targets/aws-codepipeline/codepipeline.yml | AWS-native pipeline requirement |
GitOps & ArgoCD
GitOps treats Git as the single source of truth for cluster state. A controller running inside the cluster continuously reconciles actual state with desired state from Git — no push-based deployments from CI, no manual kubectl apply in production.
GitOps Principles
Declarative
All desired state defined in YAML committed to Git. What's in Git is what's in the cluster.
Versioned
Git history is the complete audit log. Every cluster change is traceable to a PR, author, and timestamp.
Pull-Based
The cluster pulls from Git — CI never pushes directly to the cluster. This eliminates a large attack surface.
Auto-Reconciled
If someone manually edits a resource in the cluster (drift), the controller automatically reverts it to the Git state.
ArgoCD Application Patterns
| Pattern | File | Use when |
|---|---|---|
| Single app | cd/gitops/argocd/application.yaml | One service, one repo |
| Multi-environment generator | cd/gitops/argocd/applicationset.yaml | Same app across dev/staging/prod |
| App-of-apps bootstrap | cd/gitops/argocd/app-of-apps.yaml | Bootstrap many apps from one definition |
| Fleet (3–20 clusters) | cd/gitops/argocd/fleet/fleet-applicationset.yaml | Multi-cluster platform component delivery |
Promotion Flow
# 1. CI builds image, tags with Git SHA
# 2. CI updates dev overlay: image tag → new SHA
# cd/kubernetes/_overlays/dev/kustomization.yaml
# 3. ArgoCD syncs dev automatically
# 4. After testing: PR to update staging overlay
# 5. After staging approval: PR to update prod overlay
# 6. ArgoCD syncs prod (with manual sync gate)
# Verify sync status
kubectl get applications -n argocd
kubectl describe application <app-name> -n argocd | grep -A 20 "History:"
Kubernetes Concepts & Patterns
Core workload patterns and the decision logic for choosing between them. All base manifests are in cd/kubernetes/_base/; reusable patterns for specific scenarios are in cd/kubernetes/_patterns/.
Workload Type Selection
| App type | Workload | Notes |
|---|---|---|
| Stateless web API / frontend | Deployment | Default choice. Use HPA for autoscaling. PDB for HA. |
| Stateful app (database, queue, cache) | StatefulSet | Stable network identity, ordered rollout, persistent storage. |
| Node-level daemon (log collector, metrics) | DaemonSet | Runs exactly one pod per node automatically. |
| One-off task | Job | Set restartPolicy: Never and activeDeadlineSeconds. |
| Recurring scheduled task | CronJob | Wraps a Job. Set concurrencyPolicy: Forbid for batch jobs. |
Autoscaling Decision
| Scale on | Tool | File |
|---|---|---|
| CPU or memory utilization | HPA | cd/kubernetes/_base/hpa.yaml — target CPU at 70%, memory at 80% |
| Queue depth, custom metrics, scale-to-zero | KEDA | Add separately — not in this repo by default |
| Right-size resource requests automatically | VPA (Off mode) | cd/kubernetes/_base/vpa.yaml — recommendations only, no auto-apply |
minReplicas: 1 in production — a single replica means no high availability.Helm vs Kustomize
| Situation | Use |
|---|---|
| Packaging an app for distribution / multiple teams to install | Helm (cd/helm/) |
| Environment-specific configuration of manifests you own | Kustomize (cd/kubernetes/_overlays/) |
| Internal apps with simple env-to-env differences | Kustomize — less indirection, easier to debug |
| Complex conditional logic across many parameters | Helm — but if it feels like programming, reconsider |
Deployment Strategies
| Strategy | Pattern file | When to use |
|---|---|---|
| Rolling update (default) | cd/kubernetes/_base/deployment.yaml | Standard — Kubernetes handles pod replacement automatically |
| Blue/Green | cd/kubernetes/_patterns/blue-green.yaml | Zero-downtime cutover, instant rollback by switching Service selector |
| Canary | cd/kubernetes/_patterns/canary.yaml | Progressive traffic shift — stable + canary replica ratio |
Base Manifests & Overlays
Start from the base manifests and layer environment differences with Kustomize overlays. Never duplicate full manifests per environment — only patch what differs.
Base Manifest File Map
| File | What it configures |
|---|---|
_base/deployment.yaml | Deployment with health checks, security context, resource limits, emptyDir for writable paths |
_base/service.yaml | ClusterIP service with correct selector labels |
_base/ingress.yaml | nginx Ingress with TLS (cert-manager), security headers |
_base/hpa.yaml | HPA targeting 70% CPU, min 2 replicas |
_base/pdb.yaml | PodDisruptionBudget ensuring at least 1 pod available during disruptions |
_base/vpa.yaml | VPA in Off mode — generates recommendations without applying them |
_base/networkpolicy.yaml | Allow only required ingress/egress; deny all by default |
_base/rbac.yaml | ServiceAccount + minimal RBAC roles |
_base/configmap.yaml | Non-secret configuration (use ExternalSecret for secrets) |
_base/network-policies/default-deny.yaml | Deny-all baseline — apply first, then add allow rules |
_patterns/canary.yaml | Canary rollout with stable/canary replica split |
_patterns/blue-green.yaml | Blue/green Deployments with Service selector swap |
_patterns/init-containers.yaml | Wait-for-dependency init container pattern |
_patterns/dev-scale-to-zero.yaml | Scale dev/staging to zero overnight to save cost |
Kustomize Overlay Structure
cd/kubernetes/
_base/ # Shared defaults (all environments)
_overlays/
dev/kustomization.yaml # replicas: 1, relaxed resources, mutable tags
staging/kustomization.yaml
prod/kustomization.yaml # replicas: 3, tight PDB, pinned image SHA
# Preview what Kustomize will generate (validate before applying)
kubectl kustomize cd/kubernetes/_overlays/dev
# Run policy checks locally
conftest test cd/kubernetes/_base/ --policy policy/conftest/kubernetes
Kyverno Policy Engine
Kyverno enforces governance at Kubernetes admission time — before a resource is admitted to the cluster. It complements static analysis (Checkov, Conftest) which runs earlier in CI. These operate at different lifecycle stages and are not alternatives.
Policy Modes
| Mode | Effect | Use when |
|---|---|---|
| Enforce | Blocks resource from being created/updated | Policy is well-understood; existing resources are compliant |
| Audit | Admits resource but records violation in PolicyReport | Migrating existing workloads; discovering current violations |
| Warn | Admits resource with a warning in the API response | Developer feedback without hard blocking |
Installed Policies
| Policy file | Mode | What it enforces |
|---|---|---|
require-non-root.yaml | Enforce | runAsNonRoot: true, allowPrivilegeEscalation: false |
require-resource-limits.yaml | Enforce | CPU and memory requests + limits on all containers |
disallow-latest-tag.yaml | Enforce (prod) | Image tags must be pinned — :latest is blocked |
require-labels.yaml | Audit | app and version labels on Deployments |
require-liveness-readiness.yaml | Audit | Liveness + readiness probes on multi-replica Deployments |
require-readonly-filesystem.yaml | Warn | readOnlyRootFilesystem: true |
require-catalog-registration.yaml | Audit / Enforce (prod) | Service must be registered in catalog/services/ |
enforce-finops-labels.yaml | Enforce | finops.org/costcenter and finops.org/environment labels |
fleet-policy-propagation.yaml | Audit | Cross-cluster policy compliance tracking for fleet deployments |
# Check violations in your namespace
kubectl get policyreport -n <your-namespace>
kubectl describe policyreport -n <your-namespace> | grep -A 5 "fail"
# Check cluster-wide (requires cluster-admin)
kubectl get clusterpolicyreport
Terraform
All cloud infrastructure is provisioned declaratively via Terraform modules. Each cloud target is a self-contained module with its own state key. Bootstrap the remote state backend once before using any other module.
Bootstrap Remote State (Required First)
terraform.tfstate to Git — it contains plaintext secrets.# Run ONCE per cloud account, by a human with admin access (not CI)
# AWS — creates S3 bucket + DynamoDB lock table
cd terraform/_bootstrap/aws
terraform init && terraform plan -out=tfplan && terraform apply tfplan
# Azure — creates Storage Account + Blob container
cd terraform/_bootstrap/azure
terraform init && terraform apply
# GCP — creates GCS bucket
cd terraform/_bootstrap/gcp
terraform init && terraform apply
# Then uncomment the backend block in your workload module's main.tf
# and run: terraform init -migrate-state
Cloud Modules
| Module | Path | Provisions |
|---|---|---|
| AWS EKS | terraform/aws-eks/ | EKS cluster, managed node groups, VPC, ALB, IRSA roles, optional GPU node group |
| Azure AKS | terraform/azure-aks/ | AKS cluster, node pools, ACR, Key Vault, workload identity, optional GPU pool |
| GCP GKE | terraform/gcp-gke/ | GKE cluster, Artifact Registry, Workload Identity Federation, PITR Cloud SQL |
| AWS ECS | terraform/aws-ecs/ | ECS cluster, task definition, IAM execution role, CloudWatch log group |
| AWS Lambda | terraform/aws-lambda/ | Lambda function, least-privilege IAM role, API Gateway, CloudWatch alarms |
| Azure App Service | terraform/azure-app-service/ | App Service Plan, Web App (container), Application Insights |
Database Backup Modules (Include with Cluster)
| Cloud | File | Backup approach |
|---|---|---|
| AWS RDS | terraform/aws-eks/backup.tf | PITR enabled, cross-region replica, CloudWatch alarm on backup age |
| Azure PostgreSQL | terraform/azure-aks/backup.tf | Flexible Server with geo-redundant backup, PITR to 35 days |
| GCP Cloud SQL | terraform/gcp-gke/backup.tf | PITR enabled, cross-region read replica, point-in-time clone |
State Strategy
This repo uses separate state keys per module (e.g., prod/eks/terraform.tfstate) rather than Terraform workspaces. Workspaces share provider configuration — this makes it harder to use different cloud accounts per environment, which is the recommended security posture.
Environment Strategy
Three environments with distinct purposes, triggers, and data policies. Never bake environment config into images — externalize via Kubernetes ConfigMaps, Helm values files, or Kustomize overlays.
| Environment | Purpose | Deploy trigger | Data |
|---|---|---|---|
| dev | Rapid feedback, feature work | Every push / PR | Synthetic / mocked — never real PII |
| staging | Pre-prod verification, QA sign-off | Merge to main | Anonymised production snapshot |
| production | Live system | Manual / scheduled after staging approval | Real production data |
Production Access Controls
- No direct
kubectl execto production pods - All changes via Git PR (GitOps) — ArgoCD applies, never CI push
- Break-glass procedures documented and audited in
secops/runbooks/ - RBAC: least-privilege service accounts — use
cd/kubernetes/_base/rbac/ci-deployer.yaml - Separate Kubernetes namespaces per environment
OIDC / Keyless Cloud Authentication
OpenID Connect federation lets GitHub Actions authenticate directly to cloud providers with short-lived, auto-rotating tokens — no secrets stored in GitHub, no rotation to manage, full audit trail per workflow run.
Setup by Cloud
Mechanism: IAM OIDC Provider + IAM Role with trust policy
# Step 1: Create OIDC provider (once per AWS account)
aws iam create-open-id-connect-provider \
--url "https://token.actions.githubusercontent.com" \
--client-id-list "sts.amazonaws.com"
# Step 2: Create IAM role with trust policy (restrict to your repo)
# Condition: "token.actions.githubusercontent.com:sub": "repo:YOUR-ORG/YOUR-REPO:*"
# Step 3: Add GitHub secret
# AWS_DEPLOY_ROLE_ARN = arn:aws:iam::123456789012:role/github-actions-deploy
# Step 4: Use in workflow
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
aws-region: us-east-1
Mechanism: Azure AD App Registration + Federated Identity Credential
# Step 1: Create app registration + service principal
az ad app create --display-name "github-actions-deploy"
az ad sp create --id $APP_ID
# Step 2: Add federated credential (restrict to main branch)
# subject: "repo:YOUR-ORG/YOUR-REPO:ref:refs/heads/main"
# Step 3: Add GitHub secrets
# AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
# Step 4: Use in workflow
permissions:
id-token: write
steps:
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
Mechanism: Workload Identity Pool + OIDC Provider + Service Account binding
# Step 1: Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-actions-pool" \
--location="global"
# Step 2: Create OIDC provider with attribute condition
# --attribute-condition="assertion.repository_owner == 'YOUR-ORG'"
# Critical: without this, ANY GitHub repo could request tokens
# Step 3: Grant Service Account access to pool
# --member="principalSet://iam.googleapis.com/${POOL_ID}/attribute.repository/YOUR-ORG/YOUR-REPO"
# Step 4: Use in workflow
permissions:
id-token: write
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
Common Troubleshooting
| Cloud | Error | Fix |
|---|---|---|
| Azure | AADSTS70021: No matching federated identity record | Subject claim mismatch — check branch/environment in federated credential matches workflow trigger |
| AWS | Not authorized to perform: sts:AssumeRoleWithWebIdentity | Trust policy sub condition doesn't match workflow repo/branch/environment |
| GCP | The caller does not have permission to use this identity pool | attribute-condition rejects the org — check assertion.repository_owner matches exactly |
Shift-Left Security Scanning
Security checks run at three stages: locally via pre-commit hooks, in CI on every push, and at cluster admission via Kyverno. Each layer catches different classes of issues — they are complements, not alternatives.
Scanner Selection Guide
| Purpose | Tool | When | File |
|---|---|---|---|
| Secrets in Git history | Gitleaks | Every push + pre-commit | ci-security/secret-detection/gitleaks.yml |
| Verified live secrets in PRs | TruffleHog | PR only (slow) | ci-security/secret-detection/trufflehog.yml |
| Container image CVEs | Trivy | After every image build | ci-security/container-scanning/trivy-scan.yml |
| Container image CVEs (alternative) | Grype | After every image build | ci-security/container-scanning/grype-scan.yml |
| Terraform misconfigurations | Checkov | Push + PR on terraform/** | ci-security/iac-scanning/checkov.yml |
| Terraform cloud-specific checks | tfsec | Push + PR on terraform/** | ci-security/iac-scanning/tfsec.yml |
| SAST (fast, no account) | Semgrep | Push + PR | ci-security/sast/semgrep.yml |
| SAST (deep, quality gates) | SonarQube | Push + PR | ci-security/sast/sonarqube.yml |
| Dockerfile lint | Hadolint | Pre-commit | .pre-commit-config.yaml |
| npm vulnerabilities | npm audit | Weekly schedule | ci-security/dependency-audit/npm-audit.yml |
| Python vulnerabilities | pip-audit | Weekly schedule | ci-security/dependency-audit/pip-audit.yml |
| .NET NuGet vulnerabilities | dotnet list | Weekly schedule | ci-security/dependency-audit/nuget-audit.yml |
When a Scanner Fires
False Positive — Gitleaks
Add to .gitleaks.toml allowlist with justification comment. Never make skipping the default workflow — CI still runs the shared detection policy.
Real Secret in Git History
Rotate immediately, then use git-filter-repo to remove from history. Assume the secret is compromised even if the repo is private.
Container CVE (Trivy)
Update base image or the vulnerable package. Check if a fixed version is available — use trivy image --ignore-unfixed to focus on actionable findings.
IaC Misconfiguration (Checkov)
Fix the Terraform resource. Do not suppress with #checkov:skip without understanding the risk and documenting the exception with a justification.
Secrets Management
Secrets must never be stored in Git — even in private repos. Every secret needs a source of truth in a cloud-managed secret store, a sync mechanism into Kubernetes, a rotation policy, and an offboarding procedure.
Where Secrets Live
| Secret type | Store here | Access via |
|---|---|---|
| Long-lived infrastructure credentials | AWS Secrets Manager / Azure Key Vault / GCP Secret Manager | External Secrets Operator |
| Short-lived cloud credentials in CI | OIDC (no secret storage needed) | GitHub Actions OIDC |
| TLS certificates | cert-manager + Let's Encrypt | Kubernetes Secret (auto-managed) |
| App configuration (non-secret) | Kubernetes ConfigMap | envFrom.configMapRef |
| App secrets | External Secrets Operator → Kubernetes Secret | envFrom.secretRef |
External Secrets Operator
ESO watches ExternalSecret resources in the cluster and syncs values from your cloud secret store into Kubernetes Secrets automatically. The source of truth stays in the cloud store — Kubernetes Secrets are ephemeral replicas.
# 1. Apply SecretStore for your cloud (edit annotations first)
kubectl apply -f secrets/external-secrets/aws-secret-store.yaml
# or: azure-secret-store.yaml | gcp-secret-store.yaml
# 2. Create your ExternalSecret
cp secrets/external-secrets/example-external-secret.yaml \
secrets/external-secrets/my-app-secrets.yaml
kubectl apply -f secrets/external-secrets/my-app-secrets.yaml
# 3. Verify sync (STATUS should be "SecretSynced")
kubectl get externalsecret my-app-secrets -n <namespace>
kubectl rollout restart deployment/app) or mount secrets as volumes (volume mounts pick up new versions without restart; env vars do not).Secret Lifecycle Guides
| Need | Guide |
|---|---|
| Full lifecycle (provision → rotate → offboard) | secrets/guides/secret-lifecycle.md |
| Emergency rotation (secret exposed) | secrets/guides/emergency-rotation.md |
| Decommission a service's secrets | secrets/guides/secret-offboarding.md |
| Scheduled rotation (Terraform-managed) | secrets/rotation/aws-rotation.yml etc. |
Anti-Patterns to Avoid
- ❌
.envfiles committed to Git - ❌ Secrets in
docker-compose.yml - ❌ Long-lived service account keys / personal access tokens
- ❌ Same secret in dev, staging, and prod
- ❌ Secrets in container image layers
- ❌ Secrets in Kubernetes ConfigMaps (base64 is not encryption)
Runtime Security
Shift-left scanning catches known vulnerabilities before deployment. Runtime security detects threats and anomalous behavior in live environments — attacks that only appear after a workload is running.
Falco
Falco watches kernel syscalls and Kubernetes audit events for suspicious behavior. Custom rules are in secops/runtime/falco/rules/custom-rules.yaml. Alerts route to Slack or PagerDuty via secops/runtime/falco/rules/alerts.yaml.
Audit Logging
Kubernetes API server audit logs are shipped to Loki via Promtail (secops/runtime/audit-logging/loki-shipper.yaml). The audit policy (secops/runtime/audit-logging/audit-policy.yaml) records all resource modifications and secret access at the metadata level.
SecOps Runbooks
| Incident type | Runbook |
|---|---|
| Compromised pod (unusual network or process behavior) | secops/runbooks/compromised-pod.md |
| Node compromise (node-level access detected) | secops/runbooks/node-compromise.md |
| Secret exposure (credentials in logs/repo) | secops/runbooks/secret-exposure.md |
| Supply chain incident (malicious image/dependency) | secops/runbooks/supply-chain-incident.md |
Compliance as Code
Machine-readable control libraries map specific Kyverno policies and cluster checks to compliance framework controls. Evidence is collected automatically and scored weekly — turning audits into continuous engineering practice.
Supported Frameworks
| Framework | Control library | Key controls covered |
|---|---|---|
| SOC 2 Type II | secops/compliance/control-library/soc2-controls.yaml | CC6.1 (access control), CC6.8 (supply chain), CC7.2 (monitoring), CC7.4 (incident detection), CC8.1 (resource governance) |
| CIS Kubernetes Benchmark | secops/compliance/control-library/cis-kubernetes.yaml | 5.2.x (pod security), 5.3.x (network segmentation), 5.5.1 (image verification) |
| ISO 27001 | secops/compliance/control-library/iso27001.yaml | 8.4 (source control), 8.8 (vulnerability management), 8.28 (SBOM/secure coding) |
Evidence Pipeline
# Run evidence collection manually
bash secops/compliance/scripts/collect-evidence.sh --output-dir ./evidence
# Generate compliance report (fails if score < 85%)
python secops/compliance/scripts/generate-compliance-report.py \
--control-library-dir secops/compliance/control-library \
--evidence-dir ./evidence \
--fail-below 0.85 \
--format markdown
# CI gate equivalent
make compliance-report
secops/compliance/control-library/control-to-policy-map.yaml links each compliance control to the Kyverno policy or script that provides evidence for it. This is the traceability bridge for auditors.Observability Stack
The three pillars of observability — metrics, logs, and traces — each answer different operational questions. Instrument all three from day one; correlate them with shared labels and trace IDs to cut mean-time-to-diagnosis by orders of magnitude.
Signal → Question Matrix
| Question | Signal | Tool |
|---|---|---|
| Is my service up and healthy? | Metrics | Prometheus + Alertmanager |
| What happened at 14:23 during the incident? | Logs | Loki + Grafana |
| Why is this specific request slow? | Traces | Tempo + Grafana |
| Is my error rate above the SLO? | Metrics + SLO rules | Prometheus recording rules |
| What did a specific user's request traverse? | Traces | Tempo — filter by trace ID |
| Which service is causing cascading failures? | Traces (service map) | Tempo + Grafana service graph |
Install Order
Install in this order — each layer builds on the previous:
Prometheus Stack
Baseline cluster visibility and alert routing first. helm install kube-prometheus-stack -f observability/prometheus/values.yaml
Loki Stack
Log aggregation second — incidents need both metrics and logs. helm install loki grafana/loki-stack -f observability/loki/values.yaml
Tempo
Distributed tracing last — most value after metrics and logs are stable. helm install tempo grafana/tempo -f observability/tempo/values.yaml
OTel Collector Sidecar
Add to each app Deployment: kubectl patch deployment <name> --patch-file observability/opentelemetry/collector-sidecar.yaml
OpenTelemetry Language Env Vars
| Language | Env var file |
|---|---|
| .NET | observability/opentelemetry/env-vars/dotnet.env |
| Java | observability/opentelemetry/env-vars/java.env |
| Python | observability/opentelemetry/env-vars/python.env |
Trace Sampling Rates
| Environment | Rate | Rationale |
|---|---|---|
| Local / Kind | 100% | Every request traced — debugging is the priority |
| Dev | 100% | Same as local |
| Staging | 50% | Enough to catch integration issues |
| Production | 10% | Statistically representative; bounded storage cost |
Override with OTEL_TRACES_SAMPLER_ARG in your Helm values or Kustomize overlay. See observability/opentelemetry/env-vars/ for language-specific env files.
SLOs & Error Budgets
File Layout
| File | Purpose |
|---|---|
observability/prometheus/slos/slo-schema.yaml | Schema to copy and fill for any service |
observability/prometheus/slos/my-service-availability-slo.yaml | Worked availability SLO example |
observability/prometheus/slos/my-service-latency-slo.yaml | Worked latency SLO example |
observability/prometheus/recording-rules/slo-burn-rates.yaml | Multi-window burn-rate recording rules template |
observability/prometheus/alerts/slo-burn-rate-alerts.yaml | Four-tier burn-rate alert rules template |
observability/prometheus/dashboards/slo-status-configmap.yaml | Grafana SLO dashboard (auto-discovered via ConfigMap) |
docs/runbooks/slo-breach-response.md | On-call runbook for SLO burn-rate alerts |
docs/runbooks/slo-quarterly-review.md | Quarterly review agenda, decision criteria, error budget policy template |
Error Budget Policy Template
# docs/slo-policies/<service>-error-budget-policy.yaml
service: api-gateway
slo_target: 99.9%
error_budget_monthly_minutes: 43.8
policy:
above_50_percent_remaining:
action: normal operations, proceed with planned changes
below_25_percent_remaining:
action: pause non-critical feature work, prioritize reliability
owner: engineering manager
below_10_percent_remaining:
action: halt all deployments except rollbacks and critical fixes
owner: engineering director approval required
exhausted:
action: incident response, all hands on reliability
escalation: VP Engineering
Alert Routing
Alerts route to different channels based on severity. Configure receivers in the Alertmanager config in observability/prometheus/values.yaml under alertmanager.config.
| Severity | Goes to | Wake someone? | Config file |
|---|---|---|---|
| critical | PagerDuty | Yes — immediately | notifications/pagerduty-notify.yml |
| warning | Slack | No — review during business hours | notifications/slack-notify.yml |
| info | Grafana annotation / Slack | No — informational only | notifications/grafana-notify.yml |
Available Notification Integrations
| Channel | Config file | Use for |
|---|---|---|
| Slack | notifications/slack-notify.yml | Dev and staging alerts, team notifications |
| PagerDuty | notifications/pagerduty-notify.yml | Production on-call paging (critical severity) |
| Microsoft Teams | notifications/teams-notify.yml | Teams using Microsoft 365 |
| Grafana | notifications/grafana-notify.yml | Deployment markers on dashboards |
| Datadog | notifications/datadog-notify.yml | DORA metrics and APM correlation |
# Test alert routing before going live
kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093
curl -XPOST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning","team":"my-team"}}]'
FinOps — Cost Governance & Optimization
A production-grade FinOps system that integrates cost visibility, governance, and optimization into the CI/CD lifecycle — from Terraform PRs through to monthly chargeback reports. Cost control is an engineering discipline, not a monthly finance surprise.
Required Cost Labels (Enforced at Admission)
# All pods must carry these labels — Kyverno blocks any missing them
metadata:
labels:
app: my-service
finops.org/costcenter: "engineering" # Must match finops/config/budgets.yaml
finops.org/environment: "production" # dev | staging | production
finops.org/team: "platform" # Optional but recommended
finops.org/project: "my-service" # Optional
FinOps Quick Start
# 1. Install cost monitoring (Kubecost or OpenCost)
./finops/scripts/install-cost-monitoring.sh --tool kubecost
# 2. Deploy Kyverno policies in audit mode first (non-blocking)
./finops/scripts/deploy-policies.sh --audit-mode
# 3. Import all 9 Grafana dashboards
export GRAFANA_API_KEY=your-key
./finops/scripts/deploy-dashboards.sh
# 4. Validate cost label compliance
python finops/scripts/validate-cost-tags.py --all-namespaces
# 5. Find rightsizing opportunities
python finops/scripts/analyze-rightsizing.py --all-namespaces
FinOps Optimization Loop
The optimization loop closes the gap between cost alert and verified savings with a repeatable, automated workflow. Every step produces artifacts — patches, PR descriptions, advisor reports — so engineers don't start from scratch each time.
Alert Response Guide
| Alert | Immediate action | Script |
|---|---|---|
BudgetThreshold80 | Identify top spender by team | python finops/scripts/normalize-cloud-costs.py --by team |
BudgetThreshold100 | Stop non-critical deployments | python finops/scripts/normalize-cloud-costs.py --by team |
CostAnomalyDetected | Check for runaway workloads | python finops/scripts/detect-underutilized.py |
| VPA recommendation > 7 days old | Generate optimization PR | python finops/scripts/generate-optimization-pr.py |
| Unused volumes detected | Review and delete | python finops/scripts/detect-unused-volumes.py |
Safety Guardrails in Optimization Scripts
- Never reduce memory limits > 50% in one change (use
--allow-aggressiveto override with justification) - Never recommend limits below VPA lower bound
- Warn on single-replica workloads with strict PDB before reducing resources
- Never commit reserved capacity for workloads < 3 months old
- Reserved capacity commitments > $10,000/year require leadership approval (documented in
finops/docs/optimization-runbook.md)
Scripts Reference
| Script | What it does | Make target |
|---|---|---|
analyze-rightsizing.py | Reads VPA recommendations, calculates potential savings per workload, flags high-priority candidates | make finops-rightsizing |
generate-optimization-pr.py | Generates Kustomize resource patches + PR description from VPA recommendations, with safety checks | make finops-optimize-pr |
normalize-cloud-costs.py | Cross-cloud cost normalization to (vCPU_hours × 0.048) + (GiB_hours × 0.006). Supports markdown/json/csv output | make finops-normalize-costs |
reserved-capacity-advisor.py | Risk-scored reserved instance recommendations (low/medium/high) with break-even and annual savings | make finops-reserved-capacity |
validate-cost-tags.py | Checks all pods carry required finops.org/* labels; exits non-zero if compliance < threshold | — |
detect-underutilized.py | Finds pods where actual CPU/memory usage is < 20% of requests over the past 7 days | — |
detect-unused-volumes.py | Lists PVCs with no active pod binding in the last N days | — |
generate-cost-report.py | Monthly chargeback CSV + JSON report by cost center, exported to S3/Blob/GCS | — |
install-cost-monitoring.sh | Installs Kubecost or OpenCost via Helm with values from finops/helm/ | — |
Database Migration Patterns
All migrations must be backwards-compatible, idempotent, and tested locally before production. These three golden rules prevent the most common zero-downtime deployment failures.
Pattern Selection
| Pattern | File | When to use | Runs when |
|---|---|---|---|
| Init container | cd/kubernetes/_patterns/db-migration-init-container.yaml | Fast migrations (< 5 min), non-Helm workload | Every pod start — blocks rollout until complete |
| Job | cd/kubernetes/_patterns/db-migration-job.yaml | Long migrations (> 5 min), backfills, one-time jobs | Explicitly from CI before deployment rollout |
| Helm hook | cd/kubernetes/_patterns/db-migration-hook.yaml | Helm-managed apps | Automatically as pre-upgrade hook; Helm blocks on failure |
Expand / Contract Pattern
Never drop or rename a column in the same release that removes the code using it. Old pods are still running when new pods start:
| Release | Schema change | App change |
|---|---|---|
| Release N | ADD new column (nullable) | New app writes to new column; old app ignores it |
| Release N+1 | Backfill rows; add NOT NULL | Both old and new app work against new schema |
| Release N+2 | No schema change | Remove code using the old column |
| Release N+3 | DROP old column | Safe — nothing reads it |
Zero-Downtime Checklist
- Migration is backwards-compatible with the old app version
- Migration is idempotent (safe to run twice)
- Rollback migration script written and tested
activeDeadlineSecondsset on the Job or init container- Database backup taken within the last 24 hours
- Migration tested on staging with production-scale data volume
Versioning Strategy
Two automated release tools are included. Both require Conventional Commits format — commit messages are parsed to determine version bumps and generate changelogs automatically.
Semantic Versioning (SemVer)
Format: MAJOR.MINOR.PATCH — recommended for libraries and APIs.
| Type | SemVer bump | Example commit |
|---|---|---|
feat | MINOR (0.x.0) | feat(auth): add OAuth2 login |
fix | PATCH (0.0.x) | fix(db): handle null cursor on retry |
feat! or BREAKING CHANGE: footer | MAJOR (x.0.0) | feat!: remove /v1 API endpoints |
chore, docs, refactor, test | None | docs: update API reference |
Image Tagging Rules
# Tag with Git SHA (always unique, traceable)
docker build -t my-app:sha-$(git rev-parse --short HEAD) .
# Also tag with semver on release
docker tag my-app:sha-abc1234 my-app:1.2.3
docker tag my-app:sha-abc1234 my-app:1.2
:latest in production. It is not immutable — it changes on every build, making rollbacks impossible and Dependabot drift detection unreliable. Kyverno blocks :latest in production namespaces.Branching Strategy
| Strategy | Best for | Branches |
|---|---|---|
| Trunk-Based (Recommended for CI/CD) | Teams with good test coverage and CI discipline | Short-lived feature branches (1–3 days) → main |
| GitFlow | Scheduled release cadence, multiple versions in production | main, develop, feature/*, release/*, hotfix/* |
| GitHub Flow | Web apps with continuous deployment | Feature branches merged via PR → main = deploy |
Branch Protection (Required on main)
- Require pull request with at least 1 reviewer
- Require status checks to pass (CI workflow names)
- Require branches to be up to date before merging
- No force push; no deletion
Runbooks
Action-oriented operational procedures for incidents and routine operations. Every alert in production must link to a runbook via the runbook_url annotation on the PrometheusRule.
Included Runbooks
| Runbook | Triggers | Key content |
|---|---|---|
docs/runbooks/podcrashloobackoff.md | PodCrashLoopBackOff alert | Exit code diagnosis (137=OOM, 1=app crash, 132/139=segfault), liveness probe misconfiguration, rollback procedures |
docs/runbooks/slo-breach-response.md | SLOBurnRateCritical/High/Medium/Low | Severity triage by burn rate tier, mitigation options (rollback/scale/feature flag), error budget remaining check |
docs/runbooks/slo-quarterly-review.md | Quarterly schedule | 60-min agenda, decision criteria for tightening/relaxing targets, error budget policy template |
docs/runbooks/template.md | Write new runbooks | Standard sections: overview, user impact, immediate steps, diagnosis checklist, likely causes, Grafana queries, escalation, post-incident |
secops/runbooks/compromised-pod.md | Falco alert / anomaly detection | Pod isolation, evidence collection, forensic preservation, remediation |
secops/runbooks/secret-exposure.md | TruffleHog alert / security report | Immediate rotation, scope assessment, Git history remediation, audit |
Write a New Runbook
# Copy the template
cp docs/runbooks/template.md docs/runbooks/<alertname>.md
# Name the file to match the alert name exactly (lowercase, no spaces)
# e.g.: docs/runbooks/highlatencyp99.md
# Link it from the PrometheusRule
annotations:
runbook_url: "https://github.com/your-org/repo/blob/main/docs/runbooks/<alertname>.md"
# Commit to docs/runbooks/ so it appears in PR review
Architecture Decision Records
ADRs capture why decisions were made — not just what was decided. They prevent repeated debates and preserve context when team members change. All ADRs live in docs/decisions/.
| ADR | Decision | Key rationale |
|---|---|---|
ADR-001-folder-structure.md | Organize by functional concern first, then platform | Engineers ask "I need to deploy to AKS" not "I need a GitHub Actions file." Concern-first mirrors how teams reason about problems. |
ADR-002-helm-vs-kustomize.md | Include both; Kustomize for internal apps, Helm for distributed packages | Kustomize is kubectl-native, lower learning curve, readable YAML. Helm adds value for versioned chart distribution and complex conditional logic. |
ADR-003-gitops-strategy.md | ArgoCD as default; Flux included | ArgoCD: richer UI, intuitive Application CRD, App-of-Apps scales well. Flux: better multi-tenancy for large orgs. 2026 extension adds fleet management via ApplicationSet list generator. |
Makefile & Taskfile Targets
All common commands are abstracted behind make targets. Run make help to see the full list at any time.
| Target | What it does |
|---|---|
make check-prereqs | Verify all required tools are installed and at correct versions |
make dev | Start local kind cluster with registry, ingress, and dev namespace |
make dev-compose | Start local stack via Docker Compose (no Kubernetes) |
make hooks | Install pre-commit hooks (run once after cloning) |
make lint | Run all pre-commit hooks against all files |
make teardown | Destroy local kind cluster and registry |
make logs | Follow Docker Compose logs |
| Target | What it does |
|---|---|
make build | Build Docker image tagged IMAGE_NAME:IMAGE_TAG |
make build-push | Build and push image to local kind registry |
make test | Run unit tests (override with your test command) |
make tag-release | Create and push signed semver git tag (prompts for version) |
| Target | What it does |
|---|---|
make deploy-dev | Build, push, apply Kustomize dev overlay to local cluster |
make deploy-staging | Apply Kustomize staging overlay (requires staging kubeconfig) |
make k8s-status | Show pods, services, ingresses across all namespaces |
make rollout-status | Check rollout health for all deployments |
| Target | What it does |
|---|---|
make policy-report | Show Kyverno policy violation report across all namespaces |
make compliance-report | Generate SOC 2/CIS/ISO 27001 compliance report, fail if score < 80% |
make slo-validate | Lint SLO YAML files, verify recording rule naming convention |
| Target | What it does |
|---|---|
make finops-rightsizing | Analyze CPU/memory rightsizing across all namespaces |
make finops-optimize-pr | Generate draft optimization PR with rightsizing changes |
make finops-normalize-costs | Normalize cross-cloud costs to standard unit for comparison |
make finops-reserved-capacity | Evaluate reserved instance / savings plan recommendations |
| Target | What it does |
|---|---|
make catalog-validate | Validate all service and team catalog entries (CI gate equivalent) |
make catalog-codeowners | Regenerate .github/CODEOWNERS from catalog team definitions |
Supply Chain Security
Keyless signing, SBOM attestation, SLSA provenance, and Kyverno admission verification — wired as a standard delivery path. Zero private keys are stored anywhere.
Prerequisites
| Requirement | Why required |
|---|---|
| cosign CLI | Verify signatures and attestations locally |
| GitHub Actions OIDC federation | Required before keyless signing can mint OIDC-backed certificates |
| Kyverno installed in cluster | Admission verification and PolicyReport compliance output |
Implementation Steps
# Step 1: Configure OIDC first (prerequisite for keyless signing)
# See docs/guides/github-actions-oidc.md
# Step 2: Apply supply chain policies
kubectl apply -f secops/supply-chain/cosign-verify-policy.yaml
kubectl apply -f secops/supply-chain/sbom-policy.yaml
kubectl apply -f secops/supply-chain/slsa-verify.yaml
# Step 3: Wire the reusable workflow (copy from dotnet example)
# ci/github-actions/dotnet/supply-chain-integration.yml
# Step 4: Start in audit mode — non-blocking migration
kubectl label namespace <ns> policy.kyverno.io/supply-chain=audit --overwrite
# Step 5: Verify signature
cosign verify <image> \
--certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com
# Step 6: Verify SBOM attestation
cosign verify-attestation \
--certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
--type spdx <image>
# Step 7: Graduate to enforce (after all violations resolved)
kubectl label namespace <ns> policy.kyverno.io/supply-chain- --overwrite
# Step 8: Monitor compliance
kubectl get policyreport -A
Policy Guardrails
| Rule | Enforced by |
|---|---|
| No unsigned images in production | cosign-verify-policy.yaml (Enforce) |
| No images without SBOM in production | sbom-policy.yaml (Enforce) |
| No images without SLSA provenance | slsa-verify.yaml (Enforce) |
| Supply chain tool versions pinned to SHA | Code review + reusable-supply-chain.yml |
| Keyless signing only — no private keys stored | OIDC-only workflow design |
FinOps Optimization
Close the loop from a cost alert to a merged PR to verified savings with a repeatable, automated workflow. Every step produces concrete artifacts — no starting from scratch each time.
Identify the Trigger
Confirm which alert fired or which KPI changed in the monthly report. Reference: finops/docs/optimization-runbook.md
Run Cross-Cloud Normalization
python finops/scripts/normalize-cloud-costs.py --by team
Find top movers by team and cloud before selecting an action branch.
Rightsizing Analysis
python finops/scripts/analyze-rightsizing.py --namespace <ns>
Focus on candidates with highest monthly savings and low operational risk first.
Generate Optimization PR Assets
python finops/scripts/generate-optimization-pr.py \
--namespace <ns> --min-savings 20
Outputs Kustomize patches under optimization-patches/<ns>/ and a pre-filled PR-DESCRIPTION.md.
Test in Staging for 24 Hours
kubectl apply -k optimization-patches/<ns>/
Monitor for OOM kills, error-rate changes, and latency regressions before promoting to production.
Merge & Verify Savings
Merge the PR and validate savings in Grafana Optimization dashboards within 48 hours. If savings materially change the budget forecast, update finops/config/budgets.yaml.
Reserved Capacity (Quarterly Review)
# Risk-scored reserved instance recommendations
python finops/scripts/reserved-capacity-advisor.py \
--min-savings 500 --max-risk medium --term 1yr
# Safety rules:
# - Never recommend 3-year for workloads < 3 months old
# - Commitments > $10k/year require leadership approval
# - Always test in staging 24h before production changes
Disaster Recovery
Recovery procedures for the three most common disaster scenarios. Use the decision tree to identify which section applies and follow the steps in order.
RTO / RPO Summary
| Component | RPO (data loss) | RTO (downtime) | Method |
|---|---|---|---|
| Kubernetes workloads | 24h (daily backup) | 2h | Velero restore to new cluster |
| AWS RDS | 5 min (PITR) | 1h | RDS PITR restore |
| Azure PostgreSQL | 5 min (PITR) | 1–2h | Flexible Server PITR / geo-restore |
| GCP Cloud SQL | 5 min (PITR) | 30 min | Cloud SQL PITR or replica promotion |
| Secrets (ESO) | 0 (live in cloud store) | 15 min | Reinstall ESO + apply ClusterSecretStore |
Section 1 — Cluster Recovery
# 1. Provision new cluster from existing Terraform
terraform -chdir=terraform/aws-eks apply \
-var="project=<project>" -var="environment=production"
# 2. Install Velero (pointing to same backup bucket)
bash backup/velero/aws-install.sh
# 3. Restore latest scheduled backup
velero backup get
velero restore create \
--from-schedule daily-full-backup \
--restore-volumes=true
velero restore describe <restore-name> --details
# 4. Verify ESO re-syncs secrets automatically
kubectl get externalsecret -A
# 5. Validate cluster health
kubectl get nodes
kubectl get pods -A
curl -f https://my-app.example.com/health
Section 2 — Database Recovery (AWS RDS PITR)
# Find last good timestamp before corruption event
aws rds describe-db-instances \
--db-instance-identifier rds-<project>-production
# Restore to new instance
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier rds-<project>-production \
--target-db-instance-identifier rds-<project>-production-restored \
--restore-time "2026-01-15T14:30:00Z"
# Wait for availability (~10-20 minutes)
aws rds wait db-instance-available \
--db-instance-identifier rds-<project>-production-restored
# Update connection string in AWS Secrets Manager, then restart pods
Post-Recovery Checklist
- All pods are Running (no CrashLoopBackOff)
- Ingress endpoints respond with HTTP 200
- Database connection strings point to recovered instance
- ExternalSecrets are Synced (
kubectl get externalsecret -A) - Monitoring/alerting is active (Prometheus targets healthy)
- On-call acknowledged — incident ticket updated with timeline
- Post-mortem scheduled within 48 hours
Concepts & Glossary
Canonical terminology used consistently across all templates, guides, and review language in this repository.
| Term | Definition |
|---|---|
| Golden Path | Preferred, opinionated, end-to-end implementation workflow. Not optional guidance — it encodes production experience and enforces guardrails. |
| Template | Reusable starting file intended for adaptation by teams. Copy it, change the <-- CHANGE THIS markers, keep the guardrails. |
| Guardrail | Mandatory safety or governance control embedded in templates and CI/CD. Not configurable — removing a guardrail is a deliberate exception requiring documented justification. |
| Baseline | Minimum acceptable standard for production readiness. A service not meeting the baseline is not production-ready, regardless of feature completeness. |
| Runbook | Action-oriented operational procedure for incidents or routine operations. Linked from alert annotations so on-call engineers see it immediately when an alert fires. |
| Target | Deployment destination template under cd/targets/. One target per cloud platform and deployment model. |
| Overlay | Environment-specific Kustomize customization on top of base Kubernetes manifests. Only patches what differs — never duplicates the full manifest. |
| Error Budget | The allowed unreliability calculated from the SLO target. Spend it on risky features; run out and reliability work takes priority over feature work. |
| Burn Rate | Speed of error budget consumption relative to the expected rate. 1× = on track. 14.4× = budget exhausted in 2 hours from now. |
| OIDC | OpenID Connect federation for short-lived cloud authentication. The correct alternative to static long-lived credentials in CI/CD. |
| ESO | External Secrets Operator. Syncs secrets from cloud secret managers (AWS SM, Azure KV, GCP SM) into Kubernetes Secrets automatically. |
| GitOps | Deployment model where desired cluster state is declared in Git and a controller continuously reconciles actual state to match. ArgoCD and Flux are the GitOps controllers in this repo. |
| SBOM | Software Bill of Materials. Machine-readable inventory of all components in a built artifact — enables supply chain risk assessment and CVE impact analysis. |
| SLSA | Supply-chain Levels for Software Artifacts. Framework for build provenance — proves an artifact was built from a specific source by a specific process. |
| ADR | Architecture Decision Record. Lightweight document capturing what was decided, why, what alternatives were considered, and what the consequences are. |
| FinOps | Cloud financial operations — the practice of bringing financial accountability to cloud spend as an engineering discipline, not a monthly finance task. |
| VPA | Vertical Pod Autoscaler. Generates CPU and memory right-sizing recommendations. Used in Off mode here — recommendations only, no automatic mutation of running pods. |
| HPA | Horizontal Pod Autoscaler. Scales pod count based on CPU, memory, or custom metrics. Target CPU at 70% to provide headroom before scaling lag causes user-visible latency. |
| PDB | PodDisruptionBudget. Ensures a minimum number of pods remain available during voluntary disruptions (node drains, rolling updates). Required for all production services. |
| Kyverno | Kubernetes-native policy engine. Validates, mutates, and generates Kubernetes resources at admission time. Policies are Kubernetes custom resources — no Rego required. |
| Falco | Runtime security engine. Watches kernel syscalls and Kubernetes audit events for suspicious behavior patterns in live containers. |
| Velero | Kubernetes backup and restore tool. Backs up cluster objects and persistent volume snapshots. Used with DB-native backups for complete DR coverage. |
| Infracost | Terraform PR cost estimation tool. Posts monthly cost delta as a PR comment and blocks merges when cost growth exceeds configured thresholds. |
| OTel / OpenTelemetry | Vendor-neutral telemetry collection standard. Provides a single API for metrics, logs, and traces across languages and backends. |
| SLI / SLO / SLA | SLI = the metric measured. SLO = target level for that metric (internal). SLA = external customer-facing commitment derived from the SLO with a penalty clause. |
| Cosign | Tool for signing and verifying container images and other OCI artifacts. Used in keyless mode here — signatures are tied to OIDC identity, not stored private keys. |