Back to handbooks index

Production-Grade
DevOps & Platform
from Day One

A battle-tested, opinionated reference kit for cloud-native teams — covering CI/CD, Kubernetes, security, observability, FinOps, and SRE practices with copy-ready templates and clear guardrails.

16 Golden Paths Multi-Cloud · AWS / Azure / GCP Security by Default FinOps Governance Built-in

Implementation reference: https://vivek-doshi.github.io/devops-playbook/

Quick Start

From zero to first PR in under an hour. Follow this sequence — no archaeology of the repo required.

1

Check Prerequisites

Run the environment checker to verify your local toolchain is complete before touching anything else.

bash scripts/env-checker.sh

Requires: Git 2.40+, Docker Desktop 4.x, kubectl 1.29+, kind 0.24+, Helm 3.14+, Terraform 1.7+, pre-commit 3.x

2

Install Git Hooks

Hooks catch secrets, IaC formatting errors, and code quality issues before they reach CI — making feedback loops faster and CI less noisy.

make hooks
# equivalent to:
pre-commit install
pre-commit install --hook-type pre-push
3

Start Local Kubernetes Cluster

A kind (Kubernetes in Docker) cluster that mirrors the production overlay structure — with a local registry, ingress-nginx, and a dev namespace ready.

make dev
# Creates kind cluster "devops-playbook"
# Starts local registry at localhost:5001
# Installs ingress-nginx via Helm
# Applies dev Kustomize overlay
4

Pick Your Golden Path

Each path is an opinionated, end-to-end workflow with exact file references, guardrails, and validation steps baked in.

Kubernetes Microservice · Frontend SPA · Serverless App · Data Pipeline · MLOps
5

Open a PR

Branch names follow type/description. Commits must follow Conventional Commits — the pre-push hook will reject non-conforming messages.

git checkout -b feat/add-service-name
make lint          # run all pre-commit hooks
git commit -m "feat(api): add health endpoint"
git push origin feat/add-service-name

Repository Structure

Organized by functional concern first, then by platform and technology — so you ask "I need to deploy to AKS" and navigate directly there, not "what type of file is this".

DirectoryPurposeKey files
docker/Multi-stage Dockerfiles for every major stack (.NET, Python, Node.js, Java, Go, Ruby, React, Angular)Dockerfile.api, Dockerfile.worker, security-hardened.Dockerfile
compose/Docker Compose stacks for local development with realistic service topologiesmicroservices-example/, python-postgres-redis/
ci/CI pipeline templates across GitHub Actions, Azure Pipelines, GitLab CI, Jenkins — all major stacks_shared/reusable-*.yml, _strategies/
cd/CD manifests, GitOps definitions, Helm charts, Kustomize overlays, deployment targetskubernetes/_base/, gitops/argocd/, targets/
terraform/IaC blueprints for EKS, AKS, GKE, ECS, Lambda with bootstrap modules_bootstrap/, aws-eks/backup.tf
ci-security/Security scanning integrations: SAST, container scan, secret detection, IaC scan, dependency audittrivy-scan.yml, gitleaks.yml, checkov.yml
secops/Runtime security, supply chain controls, compliance control libraries, incident runbooksrunbooks/, compliance/, supply-chain/
policy/Kyverno admission policies and Conftest/OPA rules enforced at cluster levelkyverno/require-*.yaml, conftest/kubernetes/
secrets/External Secrets Operator configs, rotation workflows, lifecycle guidesexternal-secrets/, rotation/, guides/
observability/Prometheus, Loki, Tempo, OpenTelemetry — full telemetry stack with SLOs and dashboardsprometheus/slos/, prometheus/recording-rules/
finops/Cost monitoring, Kyverno cost policies, Infracost CI, rightsizing scripts, 9 Grafana dashboardsscripts/analyze-rightsizing.py, policies/, dashboards/
catalog/Git-native service and team registry with CI validation and CODEOWNERS generationschema/service.yaml, scripts/validate-catalog.py
backup/Velero cluster backup and DB PITR Terraform modulesvelero/schedule.yaml, terraform/aws-rds-backup.tf
docs/Golden paths, architecture guides, runbooks, ADRs, environment strategygolden-paths/, guides/, runbooks/

Design Principles

Clarity Over Abstraction

Every template is readable in one pass. No hidden control flow or magic variables buried three layers deep.

Golden Paths Over Flexibility

Opinionated choices are made for you. Adapt the templates but preserve the guardrails — that's the contract.

Production-Grade Defaults

Resource limits, non-root containers, read-only filesystems, and health probes are required, not optional.

Security by Default

OIDC over static credentials. External secrets over ConfigMaps. Kyverno blocks before admission, not after.

Dev Environment

Two modes: the devcontainer (recommended for onboarding and pair programming — guarantees identical toolchain) or local machine with manual tool installation.

Open the repo in VS Code and select Reopen in Container. The devcontainer installs all required tools at pinned versions and configures the environment automatically.

# All tools are pre-installed in the container:
# kubectl, helm, kind, terraform, pre-commit, ruff, mypy
# hadolint, checkov, gitleaks, cosign, velero, yq, jq

# Config: .devcontainer/devcontainer.json
# Dockerfile: .devcontainer/Dockerfile
# Post-create: .devcontainer/scripts/post-create.sh
💡
Version pinning: The devcontainer Dockerfile pins every CLI tool by exact version. Run pre-commit autoupdate quarterly to refresh hook revisions.
# 1. Verify all tools
bash scripts/env-checker.sh

# 2. Start local Kubernetes cluster
make dev
# → kind cluster "devops-playbook" with registry + ingress

# 3. Install git hooks
make hooks

# 4. Run all linters
make lint

# 5. Deploy to local cluster
make deploy-dev

# 6. Check status
make k8s-status

# 7. Tear down
make teardown

For MLOps and GPU workloads, use the dedicated CUDA devcontainer:

# Open: .devcontainer/gpu/devcontainer.json
# Requires: NVIDIA drivers + Docker GPU pass-through

# Validate GPU visibility
nvidia-smi

# Start Jupyter
python3 -m jupyter lab --ip=0.0.0.0 --no-browser

See .devcontainer/gpu/README.md for full GPU cluster provisioning with Terraform GPU node groups.

Golden Paths

Golden paths are opinionated, end-to-end workflows with exact file references and enforced guardrails. They eliminate decision fatigue and encode hard-won production experience. Pick your scenario and follow the steps — the right templates, security gates, and operational hooks are already connected.

🗺
How to choose: Start with your deployment target (Kubernetes, serverless, app service) then refine by workload type (API, frontend, batch, mobile). If in doubt, start with Platform Onboarding to establish foundations before any application path.
Kubernetes Microservice
Backend API on EKS/AKS/GKE. Includes GitOps, DB migrations, secrets, observability, FinOps checkpoints, and backup.
13 StepsMost Complete
📊
SLO-Driven Development
Define reliability targets, write recording rules, configure multi-window burn-rate alerts, and establish error budget policy.
8 StepsSRE Practice
💰
FinOps Optimization
Close the loop from cost alert to merged PR to verified savings. Rightsizing, reserved capacity, and cross-cloud normalization.
9 StepsCost Reduction
🔐
Supply Chain Security
Keyless image signing, SBOM attestation, SLSA provenance, and Kyverno admission verification. Zero private keys stored.
9 StepsSecurity
🚨
Incident Response
5-phase ops runbook: acknowledge → triage → mitigate → resolve → PIR. Covers rollback, scaling, DB recovery, and Velero restore.
5 PhasesOperations
📋
Service Catalog
Git-native service registration with owner, on-call routing, SLO link, cost center — validated in CI and enforced at admission.
7 StepsGovernance
🏗
Platform Onboarding
New team setup: toolchain, hooks, GitHub config, OIDC, Terraform state, namespace, secrets, observability, alert routing, runbooks.
11 StepsFoundation
📱
Mobile Backend (BFF)
API versioning, OAuth2/PKCE, APNs/FCM push notifications, per-user rate limiting. 12-month deprecation policy built in.
8 StepsMobile
🏢
Multi-Tenant SaaS
Namespace-per-tenant isolation, automated onboarding, per-tenant secrets/schema, billing instrumentation, safe offboarding.
9 StepsSaaS
📜
Compliance Reporting
SOC 2, CIS Kubernetes, ISO 27001 — machine-readable control libraries, automated evidence collection, CI scoring gates.
9 StepsCompliance

Kubernetes Microservice — End to End

The most complete path: backend API or microservice from local dev to production on any cloud-managed Kubernetes cluster. Every step names the exact file to copy or edit.

local dev pre-commit CI build/test/scan image push GitOps update ArgoCD sync Kubernetes alerts

File Map by Step

StepWhatFile(s)
1Local kind clusterlocal-dev/kind/setup.sh
2Pre-commit hooks.pre-commit-config.yaml
3CI pipeline (pick stack)ci/github-actions/{stack}/build-test.yml
4Docker build (reusable workflow)ci/github-actions/_shared/reusable-docker-build.yml
5Security scans (parallel)ci-security/container-scanning/trivy-scan.yml
ci-security/secret-detection/gitleaks.yml
ci-security/sast/semgrep.yml
6Bootstrap Terraform state (once)terraform/_bootstrap/{cloud}/main.tf
7Provision cluster + DB (with backup)terraform/{aws-eks|azure-aks|gcp-gke}/backup.tf
8Kubernetes manifestscd/kubernetes/_base/deployment.yaml, hpa.yaml, pdb.yaml
9DB migrationscd/kubernetes/_patterns/db-migration-job.yaml
10Secrets (ESO)secrets/external-secrets/example-external-secret.yaml
11GitOps deploy (ArgoCD)cd/gitops/argocd/application.yaml
12Observability + SLOsobservability/prometheus/slos/availability-slo.yaml
13Backup (prod only)backup/velero/schedule.yaml

Kubernetes Security Baseline (Kyverno-Enforced)

The following fields are required on every pod. Kyverno will block deployments missing any of them in production namespaces:

FieldRequired valuePolicy
spec.securityContext.runAsNonRoottruerequire-non-root.yaml — Enforce
containers[].securityContext.allowPrivilegeEscalationfalserequire-non-root.yaml — Enforce
containers[].resources.requests.cpuany valuerequire-resource-limits.yaml — Enforce
containers[].resources.requests.memoryany valuerequire-resource-limits.yaml — Enforce
containers[].resources.limits.cpuany valuerequire-resource-limits.yaml — Enforce
containers[].resources.limits.memoryany valuerequire-resource-limits.yaml — Enforce
metadata.labels.appnon-empty stringrequire-labels.yaml — Audit
containers[].securityContext.readOnlyRootFilesystemtruerequire-readonly-filesystem.yaml — Warn
finops.org/costcenterstring matching budgets.yamlrequire-cost-labels.yaml — Enforce
Read-only filesystem: If your app writes to the filesystem (logs, tmp files), mount an emptyDir volume at that path. The pattern is in cd/kubernetes/_base/deployment.yaml.

FinOps Checkpoints (Pre-Production)

Three checkpoints are embedded into this path and are not optional before promoting to production:

# Checkpoint 1: Verify cost labels are present
python finops/scripts/validate-cost-tags.py --namespace <your-ns>

# Checkpoint 2: Review VPA rightsizing (after 24h in staging)
python finops/scripts/analyze-rightsizing.py --namespace <staging-ns>

# Checkpoint 3: Confirm budget headroom
# Check: Grafana → FinOps — Budget Tracking dashboard

SLO-Driven Development

Translate reliability requirements into measurable targets, automated alerts, and an engineering process that balances feature velocity with operational risk.

Core Concepts

SLI — Service Level Indicator

The metric you measure. For availability: good_requests / total_requests. For latency: requests_under_200ms / total.

SLO — Service Level Objective

The target. "99.9% of requests succeed over a 30-day window." This is an internal engineering commitment, not a customer-facing SLA.

Error Budget

The allowed unreliability. At 99.9%, you have 43.8 minutes/month of downtime budget. Spend it on risk-taking features; run out and reliability work takes priority.

Burn Rate

How fast you're consuming the budget. Burn rate of 1x = on track. 14.4x = budget gone in 2 hours. Multi-window alerts catch both fast burns and slow drains.

Burn-Rate Alert Tiers

Four alert tiers are required. The two-window approach (fast + slow detection per tier) reduces both false positives and missed gradual degradations:

Critical — Tier 1

Burn ≥ 14.4×
Budget gone in ~2 hours
Alert after 2 minutes
→ Page immediately, open incident

High — Tier 2

Burn ≥
Budget gone in ~5 days
Alert after 15 minutes
→ Page, investigate within 15 min

Medium — Tier 3

Burn ≥
Budget gone in ~10 days
Alert after 1 hour
→ Ticket, investigate next standup

Low — Tier 4

Burn ≥
Will exhaust at end of month
Alert after 3 hours
→ Review in next sprint

Implementation Steps

# 1. Define your SLO (copy + fill the schema)
cp observability/prometheus/slos/slo-schema.yaml \
   observability/prometheus/slos/my-service-slo.yaml

# 2. Write recording rules (replace "api-gateway" with your service)
cp observability/prometheus/recording-rules/slo-burn-rates.yaml \
   observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml
kubectl apply -f observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml

# 3. Deploy burn-rate alerts
cp observability/prometheus/alerts/slo-burn-rate-alerts.yaml \
   observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml
kubectl apply -f observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml

# 4. Deploy the SLO dashboard ConfigMap (auto-discovered by Grafana)
kubectl apply -f observability/prometheus/dashboards/slo-status-configmap.yaml

# 5. Validate naming convention
make slo-validate
Recording rule naming convention is enforced: All SLO recording rules must follow slo:<service>:<metric>:<window>. Example: slo:api-gateway:availability_burn_rate:1h. The make slo-validate target checks this.

SLO Target Selection Guide

Most teams default to 99.9%. That is often wrong. Use these questions first:

  • What is the user impact of 1 minute of downtime?
  • What is your current measured baseline? (never set target above baseline)
  • Can your team sustain the on-call burden of a tighter target?
🚫
99.99% allows only 4.3 minutes/month. A single poorly tuned rolling deployment can exhaust this. Only commit to 99.99% if you have automated rollback, canary releases, and mature incident response.

Incident Response

A 5-phase ops runbook that applies to any service using this playbook. Start here when PagerDuty fires or a user reports production degradation.

Alert fires Phase 1: Acknowledge Phase 2: Triage Phase 3: Mitigate Phase 4: Resolve Phase 5: PIR

Severity Levels

SeverityDefinitionResponse time
SEV-1Total service outage — no users can accessImmediate — wake on-call
SEV-2Partial outage — significant % of users affectedWithin 15 minutes
SEV-3Degraded performance or non-critical feature failureWithin 1 hour
SEV-4Minor issue, workaround availableNext business day

Phase 1 — Acknowledge & Assemble (0–5 min)

# Acknowledge PagerDuty within 5 minutes to stop escalation

# Open dedicated Slack channel: #inc-YYYY-MM-DD-<service-name>

# Post immediately:
Service: <name>
Severity: SEV-X
Started: HH:MM UTC
IC: @name
Impact: <what users see>

# Verify ownership via service catalog
cat catalog/services/<service-name>.yaml
# → spec.oncall.pagerduty_service, spec.oncall.slack_channel

Phase 2 — Triage (5–15 min)

# Layer 1: Is the service returning errors?
curl -I https://my-service.example.com/health

# Layer 2: Kubernetes health
kubectl get pods -n <namespace> -l app=<service>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Layer 3: Recent changes (most common root cause)
kubectl rollout history deployment/<name> -n <namespace>
git log --oneline --since="1 hour ago" main

# Layer 4: Logs
kubectl logs -n <namespace> -l app=<service> --previous | tail -100

Phase 3 — Mitigate Options

ScenarioMitigationCommand
Recent bad deployRoll backkubectl rollout undo deployment/<name> -n <ns>
Capacity overloadScale upkubectl scale deployment/<name> --replicas=N
OOMKilled podsIncrease memory limitkubectl set resources deployment/<name> --limits=memory=512Mi
Feature-specific failureDisable via ConfigMapkubectl edit configmap <name> -n <ns>
Data corruptionVelero restorevelero restore create --from-backup <name> --include-namespaces <ns>

Phase 5 — Post-Incident Review

Mandatory for SEV-1 and SEV-2. Complete within 48 hours. Include: timeline, root cause, contributing factors, action items with owners and due dates. Write or update the runbook: cp docs/runbooks/template.md docs/runbooks/<service>-<symptom>.md

Platform Onboarding

New team setup — get every foundation in place before any application path. Completing this path means the application paths work on first attempt.

1

Workstation Setup

Each engineer runs bash scripts/env-checker.sh and make hooks independently on their machine.

2

Local Kind Cluster

Run bash local-dev/kind/setup.sh — idempotent, creates a 3-node cluster with ingress and local registry in one command.

3

GitHub Repo Setup

Branch protection on main, conventional commits CI workflow, OIDC federation (no static credentials in GitHub Secrets), release strategy.

4

Bootstrap Terraform State (once)

Run terraform/_bootstrap/{cloud}/ before any other Terraform. Creates the remote state backend. Without this, state is local and will be lost.

5

Kubernetes Namespace + RBAC

Request namespace from platform team. Gets: namespace with labels, developer read-only RBAC, CI deployer role, default-deny NetworkPolicy, DNS egress allow.

6

Secrets (External Secrets Operator)

Apply SecretStore for your cloud, create first ExternalSecret, verify sync with kubectl get externalsecret.

7

Kyverno Policies

Install Kyverno cluster-wide. Run kubectl get policyreport -n <ns> to check violations before your first deploy attempt.

8

Observability (Prometheus + Loki + Tempo)

Install in order: Prometheus → Loki → Tempo. Apply alert rules and SLO templates for your service.

9

Alert Routing

Configure Alertmanager receivers: Slack for dev/staging, PagerDuty for production. Fire a test alert to confirm routing works before going live.

10

Write First Runbook

cp docs/runbooks/template.md docs/runbooks/<service>-crash-loop.md. Link it from alert annotations with runbook_url.

Production Readiness Checklist

  • bash scripts/env-checker.sh passes on all engineers' machines
  • pre-commit install run on all engineers' clones
  • Branch protection enabled with required CI checks
  • OIDC federation configured — no long-lived credentials in GitHub Secrets
  • Remote Terraform state backend configured
  • Namespace with correct RBAC and NetworkPolicy applied
  • External Secrets store configured and syncing
  • Kyverno policies installed — no failures in kubectl get policyreport
  • At least one alert rule deployed and routing verified
  • PagerDuty on-call rotation configured
  • Runbook written for top 2 failure modes with runbook_url in alert
  • Velero backup schedule applied to production namespace

Mobile Backend (BFF)

Backend for Frontend pattern for iOS/Android apps — covering the concerns that make mobile different from a standard API: versioning, PKCE auth, push notifications, and aggressive client retry protection.

Key Mobile-Specific Requirements

API Versioning (Required)

Mobile clients cannot be force-updated. Old app versions live for 12+ months. URL path versioning (/v1/, /v2/) is the default. Deprecated versions must serve a Deprecation: true + Sunset: header.

OAuth2/PKCE (Required)

Authorization Code + PKCE (RFC 7636) is mandatory — mobile apps cannot safely store a client secret. The BFF validates JWTs from the JWKS endpoint on every request. Never hardcode public keys.

Push Notifications

APNs keys and FCM server keys must be stored via External Secrets Operator — never in code or manifests. Upsert device tokens on every app launch (they rotate).

Rate Limiting (Required)

All public endpoints must have Ingress-layer IP-based limiting (nginx.ingress.kubernetes.io/limit-rps). Authenticated endpoints additionally need per-user Redis sliding-window counters. Return 429 with Retry-After.

Mobile-Specific Prometheus Metrics

MetricAlert when
http_requests_total{version="v1"}Drops to 0 (all clients migrated — safe to sunset)
auth_token_validation_errors_totalSpikes (potential credential stuffing)
push_notification_delivery_failures_totalFailure rate > 5%
rate_limit_rejections_totalSustained spike (client misbehaviour or DDoS)

Multi-Tenant SaaS

Namespace-per-tenant isolation model. Scales to ~100 tenants per cluster before control plane overhead becomes significant. Beyond that, provision additional clusters.

Isolation Layers

LayerMechanismFile
NetworkDefault-deny NetworkPolicy per namespace + explicit allow rulescd/kubernetes/_base/network-policies/default-deny.yaml
IdentityRBAC scoped to namespace — no cross-tenant secret accesscd/kubernetes/_base/rbac/
SecretsESO namespace-scoped SecretStore per tenantsecrets/external-secrets/
DataSchema-per-tenant on shared DB (scales to ~1000 tenants; add PgBouncer beyond)cd/kubernetes/_patterns/db-migration-job.yaml
BillingAll Prometheus metrics carry a tenant label — enforced by Kyvernopolicy/kyverno/require-labels.yaml

Tenant Onboarding Flow

Add a YAML record to tenants/<slug>.yaml with 6 required fields. CI detects the new file and creates exactly 6 resources per tenant: namespace, RBAC, NetworkPolicy, ESO store config, database schema, and ArgoCD Application. The job is idempotent — re-running is safe.

Tenant Offboarding (Order Matters)

Never automate schema drops. The required sequence:

  1. Set status: offboarding in tenant YAML → merge PR
  2. Export billing metrics → take Velero namespace snapshot
  3. Delete ArgoCD Application
  4. Delete namespace
  5. Manual DBA step: Drop tenant schema (requires explicit approval)
  6. Remove tenant secrets from secrets store
🚫
Schema drop is irreversible. The DBA must verify Velero backup is complete and tenant has confirmed data export before executing DROP SCHEMA. This step must NEVER be automated.

Service Catalog Registration

Every production service must be registered in the Git-native catalog with owner, on-call routing, runbook link, SLO file, and cost center. CI validates metadata integrity. Kyverno can block deployments from unregistered services in production namespaces.

Required Service Fields

# catalog/services/<service-name>.yaml
metadata:
  name: api-gateway          # must match Deployment app label + filename
  owner: platform-team       # must exist in catalog/teams/
  tier: tier-1               # tier-1 | tier-2 | tier-3
spec:
  oncall:
    pagerduty_service: "..."
    slack_channel: "#alerts-api"
  slo:
    definition_file: "observability/prometheus/slos/api-gateway-slo.yaml"
  cost_center: engineering   # must exist in finops/config/budgets.yaml
  runbook: "docs/runbooks/api-gateway-crashloop.md"

Validation & Automation

# Validate locally (strict mode, skip URL checks)
python catalog/scripts/validate-catalog.py --strict --skip-url-check

# Regenerate .github/CODEOWNERS from team ownership metadata
python catalog/scripts/generate-codeowners.py --output .github/CODEOWNERS
make catalog-codeowners

# Export to Backstage format (when service count approaches 50)
python catalog/scripts/migrate-to-backstage.py --output-dir backstage/catalog
💡
Incident routing: The incident response path uses catalog/services/<name>.yaml as the source of truth for responder routing. Keep spec.oncall.pagerduty_service and spec.oncall.slack_channel current.

CI Core Concepts

Every CI pipeline in this repo follows the same conceptual model regardless of platform — build once, test in isolation, scan in parallel, deploy immutable artifacts. Platform-specific syntax differs; the stages and contracts do not.

Universal Pipeline Stages

Source Build Test Quality Gate Security Scan Package Deploy

Reusable vs Standalone Workflows

Use casePatternFile
Docker build shared across servicesReusable workflowci/github-actions/_shared/reusable-docker-build.yml
Slack notify shared across pipelinesReusable workflowci/github-actions/_shared/reusable-notify-slack.yml
Trivy scan shared across pipelinesReusable workflowci/github-actions/_shared/reusable-security-scan.yml
Supply chain (sign + SBOM + SLSA)Reusable workflowci/github-actions/_shared/reusable-supply-chain.yml
Single-service owns its own buildStandalone workflowService-specific .github/workflows/build.yml
Monorepo — run only affected servicesStrategyci/github-actions/_strategies/monorepo-affected.yml
Multi-version matrix (Node 20+22)Strategyci/github-actions/_strategies/matrix-build.yml
Automated SemVer releasesStrategyci/github-actions/_strategies/release-please.yml

OIDC vs Static Credentials

🚫
Never use static long-lived credentials in CI. AWS_ACCESS_KEY_ID, AZURE_CLIENT_SECRET, and GCP service account JSON keys cannot be scoped to a specific repository or branch, and persist indefinitely if leaked. OIDC tokens are short-lived (≤15 minutes) and automatically scoped.
AspectLong-lived secretsOIDC tokens
LifetimeUntil manually rotatedMinutes (auto-expires)
RotationManual, easy to forgetAutomatic, every workflow run
Blast radiusLeaked secret = persistent accessLeaked token expires in minutes
Audit trailHard to attributeEach token tied to repo/branch/workflow
Scope controlGlobal until revokedRestricted by trust policy to specific repo + branch

CI Templates

Copy the workflow file for your stack into .github/workflows/ and add security scanning as a parallel job. All pipelines emit the same stage contract regardless of language.

StackBuild + TestDocker Publish
.NET / ASP.NET Coreci/github-actions/dotnet/build-test.ymlci/github-actions/dotnet/docker-publish.yml
Python (pytest + ruff)ci/github-actions/python/build-test.ymlvia reusable-docker-build
Goci/github-actions/go/build-test.ymlci/github-actions/go/docker-publish.yml
Java (Maven/Gradle)ci/github-actions/java/build-test.ymlvia reusable-docker-build
Node.js / Reactci/github-actions/react/build-test.ymlvia reusable-docker-build
Angularci/github-actions/angular/build-test.ymlci/github-actions/angular/lighthouse-audit.yml
Ruby on Railsci/github-actions/ruby/build-test.ymlvia reusable-docker-build
Terraformci/github-actions/terraform/plan-apply.ymlci/github-actions/terraform/cost-estimation.yml
# Minimal wiring example: build → security scan
jobs:
  build-test:
    uses: ./.github/workflows/reusable-docker-build.yml
    with:
      image-name: my-service
      dockerfile: docker/python/Dockerfile.fastapi
    permissions:
      id-token: write   # Required for OIDC
      contents: read

  security-scan:
    needs: build-test
    uses: ./.github/workflows/reusable-security-scan.yml
    with:
      image: my-service:${{ github.sha }}
StackPipeline file
.NETci/azure-pipelines/dotnet/azure-pipelines.yml
Angularci/azure-pipelines/angular/azure-pipelines.yml
Pythonci/azure-pipelines/python/azure-pipelines.yml
Terraformci/azure-pipelines/terraform/azure-pipelines.yml

Azure Pipelines templates use Variable Groups + Key Vault link or Managed Identity (OIDC equivalent) for cloud credentials — never hardcoded pipeline variables.

StackPipeline file
.NETci/gitlab-ci/dotnet/.gitlab-ci.yml
Pythonci/gitlab-ci/python/.gitlab-ci.yml
Terraformci/gitlab-ci/terraform/.gitlab-ci.yml

GitLab CI uses include: to pull shared includes from ci/gitlab-ci/_includes/ — Docker build, SAST scan, and Slack notify are all shared components.

StackJenkinsfile
.NETci/jenkins/dotnet/Jenkinsfile
Pythonci/jenkins/python/Jenkinsfile
Jenkins is the last resort. Prefer cloud-native CI when possible. Jenkins requires self-managed infrastructure, plugin version management, and significantly more operational burden than GitHub Actions or GitLab CI.

Deployment Targets

Choose your deployment target based on whether you need Kubernetes features. If unsure, start with the decision tree:

Need Kubernetes features? (service mesh, custom autoscaling, StatefulSets, DaemonSets) ├── Yes → Which cloud? │ ├── Azure → terraform/azure-aks/ + cd/targets/azure-aks/ │ ├── AWS → terraform/aws-eks/ + cd/targets/aws-eks/ │ └── GCP → terraform/gcp-gke/ + cd/targets/gcp-gke/ └── No → Is this a stateless web app or API? ├── Yes, on Azure → terraform/azure-app-service/ + cd/targets/azure-app-service/ ├── Yes, on AWS no K8s→ terraform/aws-ecs/ + cd/targets/aws-ecs/ └── Event-driven → terraform/aws-lambda/ + cd/targets/aws-lambda/
TargetTerraformDeploy workflowUse when
AWS EKSterraform/aws-eks/cd/targets/aws-eks/github-actions-deploy.ymlFull K8s on AWS
Azure AKSterraform/azure-aks/cd/targets/azure-aks/github-actions-deploy.ymlFull K8s on Azure
GCP GKEterraform/gcp-gke/cd/targets/gcp-gke/github-actions-deploy.ymlFull K8s on GCP
Azure App Serviceterraform/azure-app-service/cd/targets/azure-app-service/github-actions-deploy.ymlStateless apps, no K8s
AWS ECSterraform/aws-ecs/cd/targets/aws-ecs/github-actions-deploy.ymlContainers without K8s on AWS
AWS Lambdaterraform/aws-lambda/cd/targets/aws-lambda/serverless-deploy.ymlEvent-driven, short-lived
GCP Cloud Runvia gcp-gke modulecd/targets/gcp-gke/cloudbuild.yamlServerless containers on GCP
OpenShiftcd/targets/openshift/github-actions-deploy.ymlEnterprise OpenShift environments
AWS CodePipelinecd/targets/aws-codepipeline/codepipeline.ymlAWS-native pipeline requirement

GitOps & ArgoCD

GitOps treats Git as the single source of truth for cluster state. A controller running inside the cluster continuously reconciles actual state with desired state from Git — no push-based deployments from CI, no manual kubectl apply in production.

GitOps Principles

Declarative

All desired state defined in YAML committed to Git. What's in Git is what's in the cluster.

Versioned

Git history is the complete audit log. Every cluster change is traceable to a PR, author, and timestamp.

Pull-Based

The cluster pulls from Git — CI never pushes directly to the cluster. This eliminates a large attack surface.

Auto-Reconciled

If someone manually edits a resource in the cluster (drift), the controller automatically reverts it to the Git state.

ArgoCD Application Patterns

PatternFileUse when
Single appcd/gitops/argocd/application.yamlOne service, one repo
Multi-environment generatorcd/gitops/argocd/applicationset.yamlSame app across dev/staging/prod
App-of-apps bootstrapcd/gitops/argocd/app-of-apps.yamlBootstrap many apps from one definition
Fleet (3–20 clusters)cd/gitops/argocd/fleet/fleet-applicationset.yamlMulti-cluster platform component delivery

Promotion Flow

# 1. CI builds image, tags with Git SHA
# 2. CI updates dev overlay: image tag → new SHA
#    cd/kubernetes/_overlays/dev/kustomization.yaml
# 3. ArgoCD syncs dev automatically
# 4. After testing: PR to update staging overlay
# 5. After staging approval: PR to update prod overlay
# 6. ArgoCD syncs prod (with manual sync gate)

# Verify sync status
kubectl get applications -n argocd
kubectl describe application <app-name> -n argocd | grep -A 20 "History:"
💡
ArgoCD vs Flux: Both are included. ArgoCD is the default — richer UI, intuitive Application CRD, App-of-Apps pattern, good for most teams. Flux is preferred for Helm-heavy stacks and large multi-tenant organizations with complex tenancy models. Pick one per cluster; don't mix.

Kubernetes Concepts & Patterns

Core workload patterns and the decision logic for choosing between them. All base manifests are in cd/kubernetes/_base/; reusable patterns for specific scenarios are in cd/kubernetes/_patterns/.

Workload Type Selection

App typeWorkloadNotes
Stateless web API / frontendDeploymentDefault choice. Use HPA for autoscaling. PDB for HA.
Stateful app (database, queue, cache)StatefulSetStable network identity, ordered rollout, persistent storage.
Node-level daemon (log collector, metrics)DaemonSetRuns exactly one pod per node automatically.
One-off taskJobSet restartPolicy: Never and activeDeadlineSeconds.
Recurring scheduled taskCronJobWraps a Job. Set concurrencyPolicy: Forbid for batch jobs.

Autoscaling Decision

Scale onToolFile
CPU or memory utilizationHPAcd/kubernetes/_base/hpa.yaml — target CPU at 70%, memory at 80%
Queue depth, custom metrics, scale-to-zeroKEDAAdd separately — not in this repo by default
Right-size resource requests automaticallyVPA (Off mode)cd/kubernetes/_base/vpa.yaml — recommendations only, no auto-apply
HPA thresholds: Set CPU target at 70% (not 80%) to provide headroom before scaling lag causes user-visible latency. Never set minReplicas: 1 in production — a single replica means no high availability.

Helm vs Kustomize

SituationUse
Packaging an app for distribution / multiple teams to installHelm (cd/helm/)
Environment-specific configuration of manifests you ownKustomize (cd/kubernetes/_overlays/)
Internal apps with simple env-to-env differencesKustomize — less indirection, easier to debug
Complex conditional logic across many parametersHelm — but if it feels like programming, reconsider

Deployment Strategies

StrategyPattern fileWhen to use
Rolling update (default)cd/kubernetes/_base/deployment.yamlStandard — Kubernetes handles pod replacement automatically
Blue/Greencd/kubernetes/_patterns/blue-green.yamlZero-downtime cutover, instant rollback by switching Service selector
Canarycd/kubernetes/_patterns/canary.yamlProgressive traffic shift — stable + canary replica ratio

Base Manifests & Overlays

Start from the base manifests and layer environment differences with Kustomize overlays. Never duplicate full manifests per environment — only patch what differs.

Base Manifest File Map

FileWhat it configures
_base/deployment.yamlDeployment with health checks, security context, resource limits, emptyDir for writable paths
_base/service.yamlClusterIP service with correct selector labels
_base/ingress.yamlnginx Ingress with TLS (cert-manager), security headers
_base/hpa.yamlHPA targeting 70% CPU, min 2 replicas
_base/pdb.yamlPodDisruptionBudget ensuring at least 1 pod available during disruptions
_base/vpa.yamlVPA in Off mode — generates recommendations without applying them
_base/networkpolicy.yamlAllow only required ingress/egress; deny all by default
_base/rbac.yamlServiceAccount + minimal RBAC roles
_base/configmap.yamlNon-secret configuration (use ExternalSecret for secrets)
_base/network-policies/default-deny.yamlDeny-all baseline — apply first, then add allow rules
_patterns/canary.yamlCanary rollout with stable/canary replica split
_patterns/blue-green.yamlBlue/green Deployments with Service selector swap
_patterns/init-containers.yamlWait-for-dependency init container pattern
_patterns/dev-scale-to-zero.yamlScale dev/staging to zero overnight to save cost

Kustomize Overlay Structure

cd/kubernetes/
  _base/                    # Shared defaults (all environments)
  _overlays/
    dev/kustomization.yaml  # replicas: 1, relaxed resources, mutable tags
    staging/kustomization.yaml
    prod/kustomization.yaml # replicas: 3, tight PDB, pinned image SHA

# Preview what Kustomize will generate (validate before applying)
kubectl kustomize cd/kubernetes/_overlays/dev

# Run policy checks locally
conftest test cd/kubernetes/_base/ --policy policy/conftest/kubernetes

Kyverno Policy Engine

Kyverno enforces governance at Kubernetes admission time — before a resource is admitted to the cluster. It complements static analysis (Checkov, Conftest) which runs earlier in CI. These operate at different lifecycle stages and are not alternatives.

Policy Modes

ModeEffectUse when
EnforceBlocks resource from being created/updatedPolicy is well-understood; existing resources are compliant
AuditAdmits resource but records violation in PolicyReportMigrating existing workloads; discovering current violations
WarnAdmits resource with a warning in the API responseDeveloper feedback without hard blocking
Migration path: Audit → fix violations → Enforce. Never go straight to Enforce on a production cluster with existing workloads.

Installed Policies

Policy fileModeWhat it enforces
require-non-root.yamlEnforcerunAsNonRoot: true, allowPrivilegeEscalation: false
require-resource-limits.yamlEnforceCPU and memory requests + limits on all containers
disallow-latest-tag.yamlEnforce (prod)Image tags must be pinned — :latest is blocked
require-labels.yamlAuditapp and version labels on Deployments
require-liveness-readiness.yamlAuditLiveness + readiness probes on multi-replica Deployments
require-readonly-filesystem.yamlWarnreadOnlyRootFilesystem: true
require-catalog-registration.yamlAudit / Enforce (prod)Service must be registered in catalog/services/
enforce-finops-labels.yamlEnforcefinops.org/costcenter and finops.org/environment labels
fleet-policy-propagation.yamlAuditCross-cluster policy compliance tracking for fleet deployments
# Check violations in your namespace
kubectl get policyreport -n <your-namespace>
kubectl describe policyreport -n <your-namespace> | grep -A 5 "fail"

# Check cluster-wide (requires cluster-admin)
kubectl get clusterpolicyreport

Terraform

All cloud infrastructure is provisioned declaratively via Terraform modules. Each cloud target is a self-contained module with its own state key. Bootstrap the remote state backend once before using any other module.

Bootstrap Remote State (Required First)

🚫
Never use local Terraform state in production. Local state is lost if you lose your machine. Never commit terraform.tfstate to Git — it contains plaintext secrets.
# Run ONCE per cloud account, by a human with admin access (not CI)

# AWS — creates S3 bucket + DynamoDB lock table
cd terraform/_bootstrap/aws
terraform init && terraform plan -out=tfplan && terraform apply tfplan

# Azure — creates Storage Account + Blob container
cd terraform/_bootstrap/azure
terraform init && terraform apply

# GCP — creates GCS bucket
cd terraform/_bootstrap/gcp
terraform init && terraform apply

# Then uncomment the backend block in your workload module's main.tf
# and run: terraform init -migrate-state

Cloud Modules

ModulePathProvisions
AWS EKSterraform/aws-eks/EKS cluster, managed node groups, VPC, ALB, IRSA roles, optional GPU node group
Azure AKSterraform/azure-aks/AKS cluster, node pools, ACR, Key Vault, workload identity, optional GPU pool
GCP GKEterraform/gcp-gke/GKE cluster, Artifact Registry, Workload Identity Federation, PITR Cloud SQL
AWS ECSterraform/aws-ecs/ECS cluster, task definition, IAM execution role, CloudWatch log group
AWS Lambdaterraform/aws-lambda/Lambda function, least-privilege IAM role, API Gateway, CloudWatch alarms
Azure App Serviceterraform/azure-app-service/App Service Plan, Web App (container), Application Insights

Database Backup Modules (Include with Cluster)

CloudFileBackup approach
AWS RDSterraform/aws-eks/backup.tfPITR enabled, cross-region replica, CloudWatch alarm on backup age
Azure PostgreSQLterraform/azure-aks/backup.tfFlexible Server with geo-redundant backup, PITR to 35 days
GCP Cloud SQLterraform/gcp-gke/backup.tfPITR enabled, cross-region read replica, point-in-time clone

State Strategy

This repo uses separate state keys per module (e.g., prod/eks/terraform.tfstate) rather than Terraform workspaces. Workspaces share provider configuration — this makes it harder to use different cloud accounts per environment, which is the recommended security posture.

Environment Strategy

Three environments with distinct purposes, triggers, and data policies. Never bake environment config into images — externalize via Kubernetes ConfigMaps, Helm values files, or Kustomize overlays.

EnvironmentPurposeDeploy triggerData
devRapid feedback, feature workEvery push / PRSynthetic / mocked — never real PII
stagingPre-prod verification, QA sign-offMerge to mainAnonymised production snapshot
productionLive systemManual / scheduled after staging approvalReal production data

Production Access Controls

  • No direct kubectl exec to production pods
  • All changes via Git PR (GitOps) — ArgoCD applies, never CI push
  • Break-glass procedures documented and audited in secops/runbooks/
  • RBAC: least-privilege service accounts — use cd/kubernetes/_base/rbac/ci-deployer.yaml
  • Separate Kubernetes namespaces per environment

OIDC / Keyless Cloud Authentication

OpenID Connect federation lets GitHub Actions authenticate directly to cloud providers with short-lived, auto-rotating tokens — no secrets stored in GitHub, no rotation to manage, full audit trail per workflow run.

┌──────────────┐ JWT ┌──────────────┐ Token ┌──────────────┐ │ GitHub │ ───────► │ Cloud │ ───────► │ Cloud │ │ Actions │ │ Identity │ │ Resources │ │ Runner │ │ Provider │ │ (deploy) │ └──────────────┘ └──────────────┘ └──────────────┘ 1. GH mints JWT 2. Cloud validates JWT with claims and issues short-lived (repo, branch, env) scoped credentials

Setup by Cloud

Mechanism: IAM OIDC Provider + IAM Role with trust policy

# Step 1: Create OIDC provider (once per AWS account)
aws iam create-open-id-connect-provider \
  --url "https://token.actions.githubusercontent.com" \
  --client-id-list "sts.amazonaws.com"

# Step 2: Create IAM role with trust policy (restrict to your repo)
# Condition: "token.actions.githubusercontent.com:sub": "repo:YOUR-ORG/YOUR-REPO:*"

# Step 3: Add GitHub secret
# AWS_DEPLOY_ROLE_ARN = arn:aws:iam::123456789012:role/github-actions-deploy

# Step 4: Use in workflow
permissions:
  id-token: write    # Required for OIDC
  contents: read
steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
      aws-region: us-east-1

Mechanism: Azure AD App Registration + Federated Identity Credential

# Step 1: Create app registration + service principal
az ad app create --display-name "github-actions-deploy"
az ad sp create --id $APP_ID

# Step 2: Add federated credential (restrict to main branch)
# subject: "repo:YOUR-ORG/YOUR-REPO:ref:refs/heads/main"

# Step 3: Add GitHub secrets
# AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID

# Step 4: Use in workflow
permissions:
  id-token: write
steps:
  - uses: azure/login@v2
    with:
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Mechanism: Workload Identity Pool + OIDC Provider + Service Account binding

# Step 1: Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-actions-pool" \
  --location="global"

# Step 2: Create OIDC provider with attribute condition
# --attribute-condition="assertion.repository_owner == 'YOUR-ORG'"
# Critical: without this, ANY GitHub repo could request tokens

# Step 3: Grant Service Account access to pool
# --member="principalSet://iam.googleapis.com/${POOL_ID}/attribute.repository/YOUR-ORG/YOUR-REPO"

# Step 4: Use in workflow
permissions:
  id-token: write
steps:
  - uses: google-github-actions/auth@v2
    with:
      workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
      service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}

Common Troubleshooting

CloudErrorFix
AzureAADSTS70021: No matching federated identity recordSubject claim mismatch — check branch/environment in federated credential matches workflow trigger
AWSNot authorized to perform: sts:AssumeRoleWithWebIdentityTrust policy sub condition doesn't match workflow repo/branch/environment
GCPThe caller does not have permission to use this identity poolattribute-condition rejects the org — check assertion.repository_owner matches exactly

Shift-Left Security Scanning

Security checks run at three stages: locally via pre-commit hooks, in CI on every push, and at cluster admission via Kyverno. Each layer catches different classes of issues — they are complements, not alternatives.

Stage 1 — Local (pre-commit) Gitleaks → catches secrets in staged files before commit terraform fmt → catches IaC formatting before push Stage 2 — CI (parallel jobs after build) Gitleaks → history scan across all commits TruffleHog → verified live secrets (PR only — expensive) Trivy → container CVE scan (HIGH/CRITICAL blocks merge) Grype → alternative container scanner Checkov → Terraform/K8s misconfiguration tfsec → cloud-specific Terraform checks Semgrep → SAST (fast, no account needed) SonarQube → SAST + quality gates (deeper analysis) npm audit / pip-audit / dotnet list → dependency CVEs Stage 3 — Cluster admission (Kyverno) disallow-latest-tag → blocks untagged images require-non-root → blocks root containers require-resource-limits → blocks unconstrained pods require-catalog-registration → blocks unregistered services

Scanner Selection Guide

PurposeToolWhenFile
Secrets in Git historyGitleaksEvery push + pre-commitci-security/secret-detection/gitleaks.yml
Verified live secrets in PRsTruffleHogPR only (slow)ci-security/secret-detection/trufflehog.yml
Container image CVEsTrivyAfter every image buildci-security/container-scanning/trivy-scan.yml
Container image CVEs (alternative)GrypeAfter every image buildci-security/container-scanning/grype-scan.yml
Terraform misconfigurationsCheckovPush + PR on terraform/**ci-security/iac-scanning/checkov.yml
Terraform cloud-specific checkstfsecPush + PR on terraform/**ci-security/iac-scanning/tfsec.yml
SAST (fast, no account)SemgrepPush + PRci-security/sast/semgrep.yml
SAST (deep, quality gates)SonarQubePush + PRci-security/sast/sonarqube.yml
Dockerfile lintHadolintPre-commit.pre-commit-config.yaml
npm vulnerabilitiesnpm auditWeekly scheduleci-security/dependency-audit/npm-audit.yml
Python vulnerabilitiespip-auditWeekly scheduleci-security/dependency-audit/pip-audit.yml
.NET NuGet vulnerabilitiesdotnet listWeekly scheduleci-security/dependency-audit/nuget-audit.yml

When a Scanner Fires

False Positive — Gitleaks

Add to .gitleaks.toml allowlist with justification comment. Never make skipping the default workflow — CI still runs the shared detection policy.

Real Secret in Git History

Rotate immediately, then use git-filter-repo to remove from history. Assume the secret is compromised even if the repo is private.

Container CVE (Trivy)

Update base image or the vulnerable package. Check if a fixed version is available — use trivy image --ignore-unfixed to focus on actionable findings.

IaC Misconfiguration (Checkov)

Fix the Terraform resource. Do not suppress with #checkov:skip without understanding the risk and documenting the exception with a justification.

Secrets Management

Secrets must never be stored in Git — even in private repos. Every secret needs a source of truth in a cloud-managed secret store, a sync mechanism into Kubernetes, a rotation policy, and an offboarding procedure.

Where Secrets Live

Secret typeStore hereAccess via
Long-lived infrastructure credentialsAWS Secrets Manager / Azure Key Vault / GCP Secret ManagerExternal Secrets Operator
Short-lived cloud credentials in CIOIDC (no secret storage needed)GitHub Actions OIDC
TLS certificatescert-manager + Let's EncryptKubernetes Secret (auto-managed)
App configuration (non-secret)Kubernetes ConfigMapenvFrom.configMapRef
App secretsExternal Secrets Operator → Kubernetes SecretenvFrom.secretRef

External Secrets Operator

ESO watches ExternalSecret resources in the cluster and syncs values from your cloud secret store into Kubernetes Secrets automatically. The source of truth stays in the cloud store — Kubernetes Secrets are ephemeral replicas.

# 1. Apply SecretStore for your cloud (edit annotations first)
kubectl apply -f secrets/external-secrets/aws-secret-store.yaml
# or: azure-secret-store.yaml | gcp-secret-store.yaml

# 2. Create your ExternalSecret
cp secrets/external-secrets/example-external-secret.yaml \
   secrets/external-secrets/my-app-secrets.yaml
kubectl apply -f secrets/external-secrets/my-app-secrets.yaml

# 3. Verify sync (STATUS should be "SecretSynced")
kubectl get externalsecret my-app-secrets -n <namespace>
Rotation gap: Rotating the secret value in the cloud store does NOT automatically update running pods. You must either restart the deployment (kubectl rollout restart deployment/app) or mount secrets as volumes (volume mounts pick up new versions without restart; env vars do not).

Secret Lifecycle Guides

NeedGuide
Full lifecycle (provision → rotate → offboard)secrets/guides/secret-lifecycle.md
Emergency rotation (secret exposed)secrets/guides/emergency-rotation.md
Decommission a service's secretssecrets/guides/secret-offboarding.md
Scheduled rotation (Terraform-managed)secrets/rotation/aws-rotation.yml etc.

Anti-Patterns to Avoid

  • .env files committed to Git
  • ❌ Secrets in docker-compose.yml
  • ❌ Long-lived service account keys / personal access tokens
  • ❌ Same secret in dev, staging, and prod
  • ❌ Secrets in container image layers
  • ❌ Secrets in Kubernetes ConfigMaps (base64 is not encryption)

Runtime Security

Shift-left scanning catches known vulnerabilities before deployment. Runtime security detects threats and anomalous behavior in live environments — attacks that only appear after a workload is running.

Falco

Falco watches kernel syscalls and Kubernetes audit events for suspicious behavior. Custom rules are in secops/runtime/falco/rules/custom-rules.yaml. Alerts route to Slack or PagerDuty via secops/runtime/falco/rules/alerts.yaml.

Audit Logging

Kubernetes API server audit logs are shipped to Loki via Promtail (secops/runtime/audit-logging/loki-shipper.yaml). The audit policy (secops/runtime/audit-logging/audit-policy.yaml) records all resource modifications and secret access at the metadata level.

SecOps Runbooks

Incident typeRunbook
Compromised pod (unusual network or process behavior)secops/runbooks/compromised-pod.md
Node compromise (node-level access detected)secops/runbooks/node-compromise.md
Secret exposure (credentials in logs/repo)secops/runbooks/secret-exposure.md
Supply chain incident (malicious image/dependency)secops/runbooks/supply-chain-incident.md

Compliance as Code

Machine-readable control libraries map specific Kyverno policies and cluster checks to compliance framework controls. Evidence is collected automatically and scored weekly — turning audits into continuous engineering practice.

Supported Frameworks

FrameworkControl libraryKey controls covered
SOC 2 Type IIsecops/compliance/control-library/soc2-controls.yamlCC6.1 (access control), CC6.8 (supply chain), CC7.2 (monitoring), CC7.4 (incident detection), CC8.1 (resource governance)
CIS Kubernetes Benchmarksecops/compliance/control-library/cis-kubernetes.yaml5.2.x (pod security), 5.3.x (network segmentation), 5.5.1 (image verification)
ISO 27001secops/compliance/control-library/iso27001.yaml8.4 (source control), 8.8 (vulnerability management), 8.28 (SBOM/secure coding)

Evidence Pipeline

# Run evidence collection manually
bash secops/compliance/scripts/collect-evidence.sh --output-dir ./evidence

# Generate compliance report (fails if score < 85%)
python secops/compliance/scripts/generate-compliance-report.py \
  --control-library-dir secops/compliance/control-library \
  --evidence-dir ./evidence \
  --fail-below 0.85 \
  --format markdown

# CI gate equivalent
make compliance-report
💡
Control-to-policy map: secops/compliance/control-library/control-to-policy-map.yaml links each compliance control to the Kyverno policy or script that provides evidence for it. This is the traceability bridge for auditors.

Observability Stack

The three pillars of observability — metrics, logs, and traces — each answer different operational questions. Instrument all three from day one; correlate them with shared labels and trace IDs to cut mean-time-to-diagnosis by orders of magnitude.

Metrics
Prometheus + Grafana
observability/prometheus/
Logs
Loki + Grafana
observability/loki/
Traces
Tempo + Grafana
observability/tempo/
OTel
Collector Sidecar
observability/opentelemetry/

Signal → Question Matrix

QuestionSignalTool
Is my service up and healthy?MetricsPrometheus + Alertmanager
What happened at 14:23 during the incident?LogsLoki + Grafana
Why is this specific request slow?TracesTempo + Grafana
Is my error rate above the SLO?Metrics + SLO rulesPrometheus recording rules
What did a specific user's request traverse?TracesTempo — filter by trace ID
Which service is causing cascading failures?Traces (service map)Tempo + Grafana service graph

Install Order

Install in this order — each layer builds on the previous:

1

Prometheus Stack

Baseline cluster visibility and alert routing first. helm install kube-prometheus-stack -f observability/prometheus/values.yaml

2

Loki Stack

Log aggregation second — incidents need both metrics and logs. helm install loki grafana/loki-stack -f observability/loki/values.yaml

3

Tempo

Distributed tracing last — most value after metrics and logs are stable. helm install tempo grafana/tempo -f observability/tempo/values.yaml

4

OTel Collector Sidecar

Add to each app Deployment: kubectl patch deployment <name> --patch-file observability/opentelemetry/collector-sidecar.yaml

OpenTelemetry Language Env Vars

LanguageEnv var file
.NETobservability/opentelemetry/env-vars/dotnet.env
Javaobservability/opentelemetry/env-vars/java.env
Pythonobservability/opentelemetry/env-vars/python.env

Trace Sampling Rates

EnvironmentRateRationale
Local / Kind100%Every request traced — debugging is the priority
Dev100%Same as local
Staging50%Enough to catch integration issues
Production10%Statistically representative; bounded storage cost

Override with OTEL_TRACES_SAMPLER_ARG in your Helm values or Kustomize overlay. See observability/opentelemetry/env-vars/ for language-specific env files.

SLOs & Error Budgets

File Layout

FilePurpose
observability/prometheus/slos/slo-schema.yamlSchema to copy and fill for any service
observability/prometheus/slos/my-service-availability-slo.yamlWorked availability SLO example
observability/prometheus/slos/my-service-latency-slo.yamlWorked latency SLO example
observability/prometheus/recording-rules/slo-burn-rates.yamlMulti-window burn-rate recording rules template
observability/prometheus/alerts/slo-burn-rate-alerts.yamlFour-tier burn-rate alert rules template
observability/prometheus/dashboards/slo-status-configmap.yamlGrafana SLO dashboard (auto-discovered via ConfigMap)
docs/runbooks/slo-breach-response.mdOn-call runbook for SLO burn-rate alerts
docs/runbooks/slo-quarterly-review.mdQuarterly review agenda, decision criteria, error budget policy template

Error Budget Policy Template

# docs/slo-policies/<service>-error-budget-policy.yaml
service: api-gateway
slo_target: 99.9%
error_budget_monthly_minutes: 43.8

policy:
  above_50_percent_remaining:
    action: normal operations, proceed with planned changes
  below_25_percent_remaining:
    action: pause non-critical feature work, prioritize reliability
    owner: engineering manager
  below_10_percent_remaining:
    action: halt all deployments except rollbacks and critical fixes
    owner: engineering director approval required
  exhausted:
    action: incident response, all hands on reliability
    escalation: VP Engineering

Alert Routing

Alerts route to different channels based on severity. Configure receivers in the Alertmanager config in observability/prometheus/values.yaml under alertmanager.config.

SeverityGoes toWake someone?Config file
criticalPagerDutyYes — immediatelynotifications/pagerduty-notify.yml
warningSlackNo — review during business hoursnotifications/slack-notify.yml
infoGrafana annotation / SlackNo — informational onlynotifications/grafana-notify.yml

Available Notification Integrations

ChannelConfig fileUse for
Slacknotifications/slack-notify.ymlDev and staging alerts, team notifications
PagerDutynotifications/pagerduty-notify.ymlProduction on-call paging (critical severity)
Microsoft Teamsnotifications/teams-notify.ymlTeams using Microsoft 365
Grafananotifications/grafana-notify.ymlDeployment markers on dashboards
Datadognotifications/datadog-notify.ymlDORA metrics and APM correlation
# Test alert routing before going live
kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093
curl -XPOST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning","team":"my-team"}}]'

FinOps — Cost Governance & Optimization

A production-grade FinOps system that integrates cost visibility, governance, and optimization into the CI/CD lifecycle — from Terraform PRs through to monthly chargeback reports. Cost control is an engineering discipline, not a monthly finance surprise.

CI/CD Layer (GitHub Actions / Azure Pipelines / GitLab CI) └── Infracost → posts monthly cost impact on every Terraform PR blocks merge when cost growth > $500 or > 20% Kubernetes Admission Layer (Kyverno Policies) ├── require-cost-labels → blocks pods missing finops.org/* labels ├── enforce-resource-limits → blocks unconstrained containers ├── gpu-approval-gate → requires explicit GPU approval annotation └── require-pdb-large-wkld → PDB required for workloads > 4 CPU / 8Gi Cost Visibility Layer (Kubernetes) ├── Kubecost / OpenCost → real-time cost per namespace/workload ├── VPA Recommender → CPU & memory rightsizing recommendations └── CronJob → monthly chargeback reports → S3/Blob/GCS Alerting Layer (Prometheus + Alertmanager) ├── budget-alerts → 80% / 100% / 120% threshold per cost center ├── anomaly-alerts → cost spike vs 7-day baseline └── tag-compliance → fires when > 5% pods lack required cost labels Dashboards (Grafana — 9 dashboards) Cost Overview · Cost Breakdown · Budget Tracking · Anomaly Detection Rightsizing Opportunities · Tag Compliance · Multi-Cloud Comparison Reserved Capacity · Optimization Opportunities

Required Cost Labels (Enforced at Admission)

# All pods must carry these labels — Kyverno blocks any missing them
metadata:
  labels:
    app: my-service
    finops.org/costcenter: "engineering"     # Must match finops/config/budgets.yaml
    finops.org/environment: "production"     # dev | staging | production
    finops.org/team: "platform"             # Optional but recommended
    finops.org/project: "my-service"        # Optional

FinOps Quick Start

# 1. Install cost monitoring (Kubecost or OpenCost)
./finops/scripts/install-cost-monitoring.sh --tool kubecost

# 2. Deploy Kyverno policies in audit mode first (non-blocking)
./finops/scripts/deploy-policies.sh --audit-mode

# 3. Import all 9 Grafana dashboards
export GRAFANA_API_KEY=your-key
./finops/scripts/deploy-dashboards.sh

# 4. Validate cost label compliance
python finops/scripts/validate-cost-tags.py --all-namespaces

# 5. Find rightsizing opportunities
python finops/scripts/analyze-rightsizing.py --all-namespaces

FinOps Optimization Loop

The optimization loop closes the gap between cost alert and verified savings with a repeatable, automated workflow. Every step produces artifacts — patches, PR descriptions, advisor reports — so engineers don't start from scratch each time.

Alert fires Normalize costs Analyze rightsizing Generate PR Stage 24h Merge + verify

Alert Response Guide

AlertImmediate actionScript
BudgetThreshold80Identify top spender by teampython finops/scripts/normalize-cloud-costs.py --by team
BudgetThreshold100Stop non-critical deploymentspython finops/scripts/normalize-cloud-costs.py --by team
CostAnomalyDetectedCheck for runaway workloadspython finops/scripts/detect-underutilized.py
VPA recommendation > 7 days oldGenerate optimization PRpython finops/scripts/generate-optimization-pr.py
Unused volumes detectedReview and deletepython finops/scripts/detect-unused-volumes.py

Safety Guardrails in Optimization Scripts

  • Never reduce memory limits > 50% in one change (use --allow-aggressive to override with justification)
  • Never recommend limits below VPA lower bound
  • Warn on single-replica workloads with strict PDB before reducing resources
  • Never commit reserved capacity for workloads < 3 months old
  • Reserved capacity commitments > $10,000/year require leadership approval (documented in finops/docs/optimization-runbook.md)

Scripts Reference

ScriptWhat it doesMake target
analyze-rightsizing.pyReads VPA recommendations, calculates potential savings per workload, flags high-priority candidatesmake finops-rightsizing
generate-optimization-pr.pyGenerates Kustomize resource patches + PR description from VPA recommendations, with safety checksmake finops-optimize-pr
normalize-cloud-costs.pyCross-cloud cost normalization to (vCPU_hours × 0.048) + (GiB_hours × 0.006). Supports markdown/json/csv outputmake finops-normalize-costs
reserved-capacity-advisor.pyRisk-scored reserved instance recommendations (low/medium/high) with break-even and annual savingsmake finops-reserved-capacity
validate-cost-tags.pyChecks all pods carry required finops.org/* labels; exits non-zero if compliance < threshold
detect-underutilized.pyFinds pods where actual CPU/memory usage is < 20% of requests over the past 7 days
detect-unused-volumes.pyLists PVCs with no active pod binding in the last N days
generate-cost-report.pyMonthly chargeback CSV + JSON report by cost center, exported to S3/Blob/GCS
install-cost-monitoring.shInstalls Kubecost or OpenCost via Helm with values from finops/helm/

Database Migration Patterns

All migrations must be backwards-compatible, idempotent, and tested locally before production. These three golden rules prevent the most common zero-downtime deployment failures.

Pattern Selection

PatternFileWhen to useRuns when
Init containercd/kubernetes/_patterns/db-migration-init-container.yamlFast migrations (< 5 min), non-Helm workloadEvery pod start — blocks rollout until complete
Jobcd/kubernetes/_patterns/db-migration-job.yamlLong migrations (> 5 min), backfills, one-time jobsExplicitly from CI before deployment rollout
Helm hookcd/kubernetes/_patterns/db-migration-hook.yamlHelm-managed appsAutomatically as pre-upgrade hook; Helm blocks on failure

Expand / Contract Pattern

Never drop or rename a column in the same release that removes the code using it. Old pods are still running when new pods start:

ReleaseSchema changeApp change
Release NADD new column (nullable)New app writes to new column; old app ignores it
Release N+1Backfill rows; add NOT NULLBoth old and new app work against new schema
Release N+2No schema changeRemove code using the old column
Release N+3DROP old columnSafe — nothing reads it

Zero-Downtime Checklist

  • Migration is backwards-compatible with the old app version
  • Migration is idempotent (safe to run twice)
  • Rollback migration script written and tested
  • activeDeadlineSeconds set on the Job or init container
  • Database backup taken within the last 24 hours
  • Migration tested on staging with production-scale data volume

Versioning Strategy

Two automated release tools are included. Both require Conventional Commits format — commit messages are parsed to determine version bumps and generate changelogs automatically.

Semantic Versioning (SemVer)

Format: MAJOR.MINOR.PATCH — recommended for libraries and APIs.

TypeSemVer bumpExample commit
featMINOR (0.x.0)feat(auth): add OAuth2 login
fixPATCH (0.0.x)fix(db): handle null cursor on retry
feat! or BREAKING CHANGE: footerMAJOR (x.0.0)feat!: remove /v1 API endpoints
chore, docs, refactor, testNonedocs: update API reference

Image Tagging Rules

# Tag with Git SHA (always unique, traceable)
docker build -t my-app:sha-$(git rev-parse --short HEAD) .

# Also tag with semver on release
docker tag my-app:sha-abc1234 my-app:1.2.3
docker tag my-app:sha-abc1234 my-app:1.2
🚫
Never use :latest in production. It is not immutable — it changes on every build, making rollbacks impossible and Dependabot drift detection unreliable. Kyverno blocks :latest in production namespaces.

Branching Strategy

StrategyBest forBranches
Trunk-Based (Recommended for CI/CD)Teams with good test coverage and CI disciplineShort-lived feature branches (1–3 days) → main
GitFlowScheduled release cadence, multiple versions in productionmain, develop, feature/*, release/*, hotfix/*
GitHub FlowWeb apps with continuous deploymentFeature branches merged via PR → main = deploy

Branch Protection (Required on main)

  • Require pull request with at least 1 reviewer
  • Require status checks to pass (CI workflow names)
  • Require branches to be up to date before merging
  • No force push; no deletion

Runbooks

Action-oriented operational procedures for incidents and routine operations. Every alert in production must link to a runbook via the runbook_url annotation on the PrometheusRule.

Included Runbooks

RunbookTriggersKey content
docs/runbooks/podcrashloobackoff.mdPodCrashLoopBackOff alertExit code diagnosis (137=OOM, 1=app crash, 132/139=segfault), liveness probe misconfiguration, rollback procedures
docs/runbooks/slo-breach-response.mdSLOBurnRateCritical/High/Medium/LowSeverity triage by burn rate tier, mitigation options (rollback/scale/feature flag), error budget remaining check
docs/runbooks/slo-quarterly-review.mdQuarterly schedule60-min agenda, decision criteria for tightening/relaxing targets, error budget policy template
docs/runbooks/template.mdWrite new runbooksStandard sections: overview, user impact, immediate steps, diagnosis checklist, likely causes, Grafana queries, escalation, post-incident
secops/runbooks/compromised-pod.mdFalco alert / anomaly detectionPod isolation, evidence collection, forensic preservation, remediation
secops/runbooks/secret-exposure.mdTruffleHog alert / security reportImmediate rotation, scope assessment, Git history remediation, audit

Write a New Runbook

# Copy the template
cp docs/runbooks/template.md docs/runbooks/<alertname>.md

# Name the file to match the alert name exactly (lowercase, no spaces)
# e.g.: docs/runbooks/highlatencyp99.md

# Link it from the PrometheusRule
annotations:
  runbook_url: "https://github.com/your-org/repo/blob/main/docs/runbooks/<alertname>.md"

# Commit to docs/runbooks/ so it appears in PR review

Architecture Decision Records

ADRs capture why decisions were made — not just what was decided. They prevent repeated debates and preserve context when team members change. All ADRs live in docs/decisions/.

ADRDecisionKey rationale
ADR-001-folder-structure.mdOrganize by functional concern first, then platformEngineers ask "I need to deploy to AKS" not "I need a GitHub Actions file." Concern-first mirrors how teams reason about problems.
ADR-002-helm-vs-kustomize.mdInclude both; Kustomize for internal apps, Helm for distributed packagesKustomize is kubectl-native, lower learning curve, readable YAML. Helm adds value for versioned chart distribution and complex conditional logic.
ADR-003-gitops-strategy.mdArgoCD as default; Flux includedArgoCD: richer UI, intuitive Application CRD, App-of-Apps scales well. Flux: better multi-tenancy for large orgs. 2026 extension adds fleet management via ApplicationSet list generator.

Makefile & Taskfile Targets

All common commands are abstracted behind make targets. Run make help to see the full list at any time.

TargetWhat it does
make check-prereqsVerify all required tools are installed and at correct versions
make devStart local kind cluster with registry, ingress, and dev namespace
make dev-composeStart local stack via Docker Compose (no Kubernetes)
make hooksInstall pre-commit hooks (run once after cloning)
make lintRun all pre-commit hooks against all files
make teardownDestroy local kind cluster and registry
make logsFollow Docker Compose logs
TargetWhat it does
make buildBuild Docker image tagged IMAGE_NAME:IMAGE_TAG
make build-pushBuild and push image to local kind registry
make testRun unit tests (override with your test command)
make tag-releaseCreate and push signed semver git tag (prompts for version)
TargetWhat it does
make deploy-devBuild, push, apply Kustomize dev overlay to local cluster
make deploy-stagingApply Kustomize staging overlay (requires staging kubeconfig)
make k8s-statusShow pods, services, ingresses across all namespaces
make rollout-statusCheck rollout health for all deployments
TargetWhat it does
make policy-reportShow Kyverno policy violation report across all namespaces
make compliance-reportGenerate SOC 2/CIS/ISO 27001 compliance report, fail if score < 80%
make slo-validateLint SLO YAML files, verify recording rule naming convention
TargetWhat it does
make finops-rightsizingAnalyze CPU/memory rightsizing across all namespaces
make finops-optimize-prGenerate draft optimization PR with rightsizing changes
make finops-normalize-costsNormalize cross-cloud costs to standard unit for comparison
make finops-reserved-capacityEvaluate reserved instance / savings plan recommendations
TargetWhat it does
make catalog-validateValidate all service and team catalog entries (CI gate equivalent)
make catalog-codeownersRegenerate .github/CODEOWNERS from catalog team definitions

Supply Chain Security

Keyless signing, SBOM attestation, SLSA provenance, and Kyverno admission verification — wired as a standard delivery path. Zero private keys are stored anywhere.

code commit build image sign (cosign keyless) SBOM (spdx-json) SLSA provenance push to registry Kyverno verifies at admission

Prerequisites

RequirementWhy required
cosign CLIVerify signatures and attestations locally
GitHub Actions OIDC federationRequired before keyless signing can mint OIDC-backed certificates
Kyverno installed in clusterAdmission verification and PolicyReport compliance output

Implementation Steps

# Step 1: Configure OIDC first (prerequisite for keyless signing)
# See docs/guides/github-actions-oidc.md

# Step 2: Apply supply chain policies
kubectl apply -f secops/supply-chain/cosign-verify-policy.yaml
kubectl apply -f secops/supply-chain/sbom-policy.yaml
kubectl apply -f secops/supply-chain/slsa-verify.yaml

# Step 3: Wire the reusable workflow (copy from dotnet example)
# ci/github-actions/dotnet/supply-chain-integration.yml

# Step 4: Start in audit mode — non-blocking migration
kubectl label namespace <ns> policy.kyverno.io/supply-chain=audit --overwrite

# Step 5: Verify signature
cosign verify <image> \
  --certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

# Step 6: Verify SBOM attestation
cosign verify-attestation \
  --certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  --type spdx <image>

# Step 7: Graduate to enforce (after all violations resolved)
kubectl label namespace <ns> policy.kyverno.io/supply-chain- --overwrite

# Step 8: Monitor compliance
kubectl get policyreport -A

Policy Guardrails

RuleEnforced by
No unsigned images in productioncosign-verify-policy.yaml (Enforce)
No images without SBOM in productionsbom-policy.yaml (Enforce)
No images without SLSA provenanceslsa-verify.yaml (Enforce)
Supply chain tool versions pinned to SHACode review + reusable-supply-chain.yml
Keyless signing only — no private keys storedOIDC-only workflow design

FinOps Optimization

Close the loop from a cost alert to a merged PR to verified savings with a repeatable, automated workflow. Every step produces concrete artifacts — no starting from scratch each time.

1

Identify the Trigger

Confirm which alert fired or which KPI changed in the monthly report. Reference: finops/docs/optimization-runbook.md

2

Run Cross-Cloud Normalization

python finops/scripts/normalize-cloud-costs.py --by team

Find top movers by team and cloud before selecting an action branch.

3

Rightsizing Analysis

python finops/scripts/analyze-rightsizing.py --namespace <ns>

Focus on candidates with highest monthly savings and low operational risk first.

4

Generate Optimization PR Assets

python finops/scripts/generate-optimization-pr.py \
  --namespace <ns> --min-savings 20

Outputs Kustomize patches under optimization-patches/<ns>/ and a pre-filled PR-DESCRIPTION.md.

5

Test in Staging for 24 Hours

kubectl apply -k optimization-patches/<ns>/

Monitor for OOM kills, error-rate changes, and latency regressions before promoting to production.

6

Merge & Verify Savings

Merge the PR and validate savings in Grafana Optimization dashboards within 48 hours. If savings materially change the budget forecast, update finops/config/budgets.yaml.

Reserved Capacity (Quarterly Review)

# Risk-scored reserved instance recommendations
python finops/scripts/reserved-capacity-advisor.py \
  --min-savings 500 --max-risk medium --term 1yr

# Safety rules:
# - Never recommend 3-year for workloads < 3 months old
# - Commitments > $10k/year require leadership approval
# - Always test in staging 24h before production changes

Disaster Recovery

Recovery procedures for the three most common disaster scenarios. Use the decision tree to identify which section applies and follow the steps in order.

Is the entire cluster gone? ├── Yes → Section 1: Cluster Recovery └── No └── Is the database corrupted or unavailable? ├── Yes → Section 2: Database Recovery └── No └── Is a namespace/workload corrupted? └── Yes → Section 3: Namespace Restore

RTO / RPO Summary

ComponentRPO (data loss)RTO (downtime)Method
Kubernetes workloads24h (daily backup)2hVelero restore to new cluster
AWS RDS5 min (PITR)1hRDS PITR restore
Azure PostgreSQL5 min (PITR)1–2hFlexible Server PITR / geo-restore
GCP Cloud SQL5 min (PITR)30 minCloud SQL PITR or replica promotion
Secrets (ESO)0 (live in cloud store)15 minReinstall ESO + apply ClusterSecretStore

Section 1 — Cluster Recovery

# 1. Provision new cluster from existing Terraform
terraform -chdir=terraform/aws-eks apply \
  -var="project=<project>" -var="environment=production"

# 2. Install Velero (pointing to same backup bucket)
bash backup/velero/aws-install.sh

# 3. Restore latest scheduled backup
velero backup get
velero restore create \
  --from-schedule daily-full-backup \
  --restore-volumes=true
velero restore describe <restore-name> --details

# 4. Verify ESO re-syncs secrets automatically
kubectl get externalsecret -A

# 5. Validate cluster health
kubectl get nodes
kubectl get pods -A
curl -f https://my-app.example.com/health

Section 2 — Database Recovery (AWS RDS PITR)

# Find last good timestamp before corruption event
aws rds describe-db-instances \
  --db-instance-identifier rds-<project>-production

# Restore to new instance
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier rds-<project>-production \
  --target-db-instance-identifier rds-<project>-production-restored \
  --restore-time "2026-01-15T14:30:00Z"

# Wait for availability (~10-20 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier rds-<project>-production-restored

# Update connection string in AWS Secrets Manager, then restart pods

Post-Recovery Checklist

  • All pods are Running (no CrashLoopBackOff)
  • Ingress endpoints respond with HTTP 200
  • Database connection strings point to recovered instance
  • ExternalSecrets are Synced (kubectl get externalsecret -A)
  • Monitoring/alerting is active (Prometheus targets healthy)
  • On-call acknowledged — incident ticket updated with timeline
  • Post-mortem scheduled within 48 hours

Concepts & Glossary

Canonical terminology used consistently across all templates, guides, and review language in this repository.

TermDefinition
Golden PathPreferred, opinionated, end-to-end implementation workflow. Not optional guidance — it encodes production experience and enforces guardrails.
TemplateReusable starting file intended for adaptation by teams. Copy it, change the <-- CHANGE THIS markers, keep the guardrails.
GuardrailMandatory safety or governance control embedded in templates and CI/CD. Not configurable — removing a guardrail is a deliberate exception requiring documented justification.
BaselineMinimum acceptable standard for production readiness. A service not meeting the baseline is not production-ready, regardless of feature completeness.
RunbookAction-oriented operational procedure for incidents or routine operations. Linked from alert annotations so on-call engineers see it immediately when an alert fires.
TargetDeployment destination template under cd/targets/. One target per cloud platform and deployment model.
OverlayEnvironment-specific Kustomize customization on top of base Kubernetes manifests. Only patches what differs — never duplicates the full manifest.
Error BudgetThe allowed unreliability calculated from the SLO target. Spend it on risky features; run out and reliability work takes priority over feature work.
Burn RateSpeed of error budget consumption relative to the expected rate. 1× = on track. 14.4× = budget exhausted in 2 hours from now.
OIDCOpenID Connect federation for short-lived cloud authentication. The correct alternative to static long-lived credentials in CI/CD.
ESOExternal Secrets Operator. Syncs secrets from cloud secret managers (AWS SM, Azure KV, GCP SM) into Kubernetes Secrets automatically.
GitOpsDeployment model where desired cluster state is declared in Git and a controller continuously reconciles actual state to match. ArgoCD and Flux are the GitOps controllers in this repo.
SBOMSoftware Bill of Materials. Machine-readable inventory of all components in a built artifact — enables supply chain risk assessment and CVE impact analysis.
SLSASupply-chain Levels for Software Artifacts. Framework for build provenance — proves an artifact was built from a specific source by a specific process.
ADRArchitecture Decision Record. Lightweight document capturing what was decided, why, what alternatives were considered, and what the consequences are.
FinOpsCloud financial operations — the practice of bringing financial accountability to cloud spend as an engineering discipline, not a monthly finance task.
VPAVertical Pod Autoscaler. Generates CPU and memory right-sizing recommendations. Used in Off mode here — recommendations only, no automatic mutation of running pods.
HPAHorizontal Pod Autoscaler. Scales pod count based on CPU, memory, or custom metrics. Target CPU at 70% to provide headroom before scaling lag causes user-visible latency.
PDBPodDisruptionBudget. Ensures a minimum number of pods remain available during voluntary disruptions (node drains, rolling updates). Required for all production services.
KyvernoKubernetes-native policy engine. Validates, mutates, and generates Kubernetes resources at admission time. Policies are Kubernetes custom resources — no Rego required.
FalcoRuntime security engine. Watches kernel syscalls and Kubernetes audit events for suspicious behavior patterns in live containers.
VeleroKubernetes backup and restore tool. Backs up cluster objects and persistent volume snapshots. Used with DB-native backups for complete DR coverage.
InfracostTerraform PR cost estimation tool. Posts monthly cost delta as a PR comment and blocks merges when cost growth exceeds configured thresholds.
OTel / OpenTelemetryVendor-neutral telemetry collection standard. Provides a single API for metrics, logs, and traces across languages and backends.
SLI / SLO / SLASLI = the metric measured. SLO = target level for that metric (internal). SLA = external customer-facing commitment derived from the SLO with a penalty clause.
CosignTool for signing and verifying container images and other OCI artifacts. Used in keyless mode here — signatures are tied to OIDC identity, not stored private keys.