DevOps Playbook · Engineering Handbook

Production-Grade
DevOps & Platform
from Day One

A battle-tested, opinionated reference kit for cloud-native teams — covering CI/CD, Kubernetes, security, observability, FinOps, and SRE practices with copy-ready templates and clear guardrails.

16 Golden Paths Multi-Cloud · AWS / Azure / GCP Security by Default FinOps Governance Built-in

Implementation reference: https://vivek-doshi.github.io/devops-playbook/

☸

Kubernetes Microservice

Full path from local dev to production on EKS/AKS/GKE with GitOps

docker → ci → terraform → argocd

⚙

CI Pipeline Templates

GitHub Actions, Azure Pipelines, GitLab CI, Jenkins — all stacks

ci/github-actions/ · ci/azure-pipelines/

🔐

Security Scanning

Trivy, Gitleaks, Checkov, Semgrep wired into every pipeline

ci-security/ · policy/ · secops/

💰

FinOps Governance

Budget alerts, rightsizing, Infracost PR gates, 9 Grafana dashboards

finops/ · policy/kyverno/

🎯

SLO-Driven Development

Error budgets, multi-window burn-rate alerts, quarterly review runbook

observability/prometheus/slos/

🔑

OIDC / Keyless Auth

GitHub Actions to AWS/Azure/GCP — no long-lived credentials

docs/guides/github-actions-oidc.md

Getting Started

Quick Start

From zero to first PR in under an hour. Follow this sequence — no archaeology of the repo required.

Check Prerequisites

Run the environment checker to verify your local toolchain is complete before touching anything else.

bash scripts/env-checker.sh

Requires: Git 2.40+, Docker Desktop 4.x, kubectl 1.29+, kind 0.24+, Helm 3.14+, Terraform 1.7+, pre-commit 3.x

Install Git Hooks

Hooks catch secrets, IaC formatting errors, and code quality issues before they reach CI — making feedback loops faster and CI less noisy.

make hooks
# equivalent to:
pre-commit install
pre-commit install --hook-type pre-push

Start Local Kubernetes Cluster

A kind (Kubernetes in Docker) cluster that mirrors the production overlay structure — with a local registry, ingress-nginx, and a dev namespace ready.

make dev
# Creates kind cluster "devops-playbook"
# Starts local registry at localhost:5001
# Installs ingress-nginx via Helm
# Applies dev Kustomize overlay

Pick Your Golden Path

Each path is an opinionated, end-to-end workflow with exact file references, guardrails, and validation steps baked in.

Kubernetes Microservice · Frontend SPA · Serverless App · Data Pipeline · MLOps

Open a PR

Branch names follow type/description. Commits must follow Conventional Commits — the pre-push hook will reject non-conforming messages.

git checkout -b feat/add-service-name
make lint          # run all pre-commit hooks
git commit -m "feat(api): add health endpoint"
git push origin feat/add-service-name

Codebase

Repository Structure

Organized by functional concern first, then by platform and technology — so you ask "I need to deploy to AKS" and navigate directly there, not "what type of file is this".

Directory	Purpose	Key files
`docker/`	Multi-stage Dockerfiles for every major stack (.NET, Python, Node.js, Java, Go, Ruby, React, Angular)	`Dockerfile.api`, `Dockerfile.worker`, `security-hardened.Dockerfile`
`compose/`	Docker Compose stacks for local development with realistic service topologies	`microservices-example/`, `python-postgres-redis/`
`ci/`	CI pipeline templates across GitHub Actions, Azure Pipelines, GitLab CI, Jenkins — all major stacks	`_shared/reusable-*.yml`, `_strategies/`
`cd/`	CD manifests, GitOps definitions, Helm charts, Kustomize overlays, deployment targets	`kubernetes/_base/`, `gitops/argocd/`, `targets/`
`terraform/`	IaC blueprints for EKS, AKS, GKE, ECS, Lambda with bootstrap modules	`_bootstrap/`, `aws-eks/backup.tf`
`ci-security/`	Security scanning integrations: SAST, container scan, secret detection, IaC scan, dependency audit	`trivy-scan.yml`, `gitleaks.yml`, `checkov.yml`
`secops/`	Runtime security, supply chain controls, compliance control libraries, incident runbooks	`runbooks/`, `compliance/`, `supply-chain/`
`policy/`	Kyverno admission policies and Conftest/OPA rules enforced at cluster level	`kyverno/require-*.yaml`, `conftest/kubernetes/`
`secrets/`	External Secrets Operator configs, rotation workflows, lifecycle guides	`external-secrets/`, `rotation/`, `guides/`
`observability/`	Prometheus, Loki, Tempo, OpenTelemetry — full telemetry stack with SLOs and dashboards	`prometheus/slos/`, `prometheus/recording-rules/`
`finops/`	Cost monitoring, Kyverno cost policies, Infracost CI, rightsizing scripts, 9 Grafana dashboards	`scripts/analyze-rightsizing.py`, `policies/`, `dashboards/`
`catalog/`	Git-native service and team registry with CI validation and CODEOWNERS generation	`schema/service.yaml`, `scripts/validate-catalog.py`
`backup/`	Velero cluster backup and DB PITR Terraform modules	`velero/schedule.yaml`, `terraform/aws-rds-backup.tf`
`docs/`	Golden paths, architecture guides, runbooks, ADRs, environment strategy	`golden-paths/`, `guides/`, `runbooks/`

Design Principles

Clarity Over Abstraction

Every template is readable in one pass. No hidden control flow or magic variables buried three layers deep.

Golden Paths Over Flexibility

Opinionated choices are made for you. Adapt the templates but preserve the guardrails — that's the contract.

Production-Grade Defaults

Resource limits, non-root containers, read-only filesystems, and health probes are required, not optional.

Security by Default

OIDC over static credentials. External secrets over ConfigMaps. Kyverno blocks before admission, not after.

Setup

Dev Environment

Two modes: the devcontainer (recommended for onboarding and pair programming — guarantees identical toolchain) or local machine with manual tool installation.

Open the repo in VS Code and select Reopen in Container. The devcontainer installs all required tools at pinned versions and configures the environment automatically.

# All tools are pre-installed in the container:
# kubectl, helm, kind, terraform, pre-commit, ruff, mypy
# hadolint, checkov, gitleaks, cosign, velero, yq, jq

# Config: .devcontainer/devcontainer.json
# Dockerfile: .devcontainer/Dockerfile
# Post-create: .devcontainer/scripts/post-create.sh

💡

Version pinning: The devcontainer Dockerfile pins every CLI tool by exact version. Run pre-commit autoupdate quarterly to refresh hook revisions.

# 1. Verify all tools
bash scripts/env-checker.sh

# 2. Start local Kubernetes cluster
make dev
# → kind cluster "devops-playbook" with registry + ingress

# 3. Install git hooks
make hooks

# 4. Run all linters
make lint

# 5. Deploy to local cluster
make deploy-dev

# 6. Check status
make k8s-status

# 7. Tear down
make teardown

For MLOps and GPU workloads, use the dedicated CUDA devcontainer:

# Open: .devcontainer/gpu/devcontainer.json
# Requires: NVIDIA drivers + Docker GPU pass-through

# Validate GPU visibility
nvidia-smi

# Start Jupyter
python3 -m jupyter lab --ip=0.0.0.0 --no-browser

See .devcontainer/gpu/README.md for full GPU cluster provisioning with Terraform GPU node groups.

Workflows

Golden Paths

Golden paths are opinionated, end-to-end workflows with exact file references and enforced guardrails. They eliminate decision fatigue and encode hard-won production experience. Pick your scenario and follow the steps — the right templates, security gates, and operational hooks are already connected.

🗺

How to choose: Start with your deployment target (Kubernetes, serverless, app service) then refine by workload type (API, frontend, batch, mobile). If in doubt, start with Platform Onboarding to establish foundations before any application path.

☸

Kubernetes Microservice

Backend API on EKS/AKS/GKE. Includes GitOps, DB migrations, secrets, observability, FinOps checkpoints, and backup.

13 StepsMost Complete

📊

SLO-Driven Development

Define reliability targets, write recording rules, configure multi-window burn-rate alerts, and establish error budget policy.

8 StepsSRE Practice

💰

FinOps Optimization

Close the loop from cost alert to merged PR to verified savings. Rightsizing, reserved capacity, and cross-cloud normalization.

9 StepsCost Reduction

🔐

Supply Chain Security

Keyless image signing, SBOM attestation, SLSA provenance, and Kyverno admission verification. Zero private keys stored.

9 StepsSecurity

🚨

Incident Response

5-phase ops runbook: acknowledge → triage → mitigate → resolve → PIR. Covers rollback, scaling, DB recovery, and Velero restore.

5 PhasesOperations

📋

Service Catalog

Git-native service registration with owner, on-call routing, SLO link, cost center — validated in CI and enforced at admission.

7 StepsGovernance

🏗

Platform Onboarding

New team setup: toolchain, hooks, GitHub config, OIDC, Terraform state, namespace, secrets, observability, alert routing, runbooks.

11 StepsFoundation

📱

Mobile Backend (BFF)

API versioning, OAuth2/PKCE, APNs/FCM push notifications, per-user rate limiting. 12-month deprecation policy built in.

8 StepsMobile

🏢

Multi-Tenant SaaS

Namespace-per-tenant isolation, automated onboarding, per-tenant secrets/schema, billing instrumentation, safe offboarding.

9 StepsSaaS

📜

Compliance Reporting

SOC 2, CIS Kubernetes, ISO 27001 — machine-readable control libraries, automated evidence collection, CI scoring gates.

9 StepsCompliance

Golden Path

Kubernetes Microservice — End to End

The most complete path: backend API or microservice from local dev to production on any cloud-managed Kubernetes cluster. Every step names the exact file to copy or edit.

local dev→ pre-commit→ CI build/test/scan→ image push→ GitOps update→ ArgoCD sync→ Kubernetes→ alerts

File Map by Step

Step	What	File(s)
1	Local kind cluster	`local-dev/kind/setup.sh`
2	Pre-commit hooks	`.pre-commit-config.yaml`
3	CI pipeline (pick stack)	`ci/github-actions/{stack}/build-test.yml`
4	Docker build (reusable workflow)	`ci/github-actions/_shared/reusable-docker-build.yml`
5	Security scans (parallel)	`ci-security/container-scanning/trivy-scan.yml` `ci-security/secret-detection/gitleaks.yml` `ci-security/sast/semgrep.yml`
6	Bootstrap Terraform state (once)	`terraform/_bootstrap/{cloud}/main.tf`
7	Provision cluster + DB (with backup)	`terraform/{aws-eks\|azure-aks\|gcp-gke}/backup.tf`
8	Kubernetes manifests	`cd/kubernetes/_base/deployment.yaml`, `hpa.yaml`, `pdb.yaml`
9	DB migrations	`cd/kubernetes/_patterns/db-migration-job.yaml`
10	Secrets (ESO)	`secrets/external-secrets/example-external-secret.yaml`
11	GitOps deploy (ArgoCD)	`cd/gitops/argocd/application.yaml`
12	Observability + SLOs	`observability/prometheus/slos/availability-slo.yaml`
13	Backup (prod only)	`backup/velero/schedule.yaml`

Kubernetes Security Baseline (Kyverno-Enforced)

The following fields are required on every pod. Kyverno will block deployments missing any of them in production namespaces:

Field	Required value	Policy
`spec.securityContext.runAsNonRoot`	`true`	`require-non-root.yaml` — Enforce
`containers[].securityContext.allowPrivilegeEscalation`	`false`	`require-non-root.yaml` — Enforce
`containers[].resources.requests.cpu`	any value	`require-resource-limits.yaml` — Enforce
`containers[].resources.requests.memory`	any value	`require-resource-limits.yaml` — Enforce
`containers[].resources.limits.cpu`	any value	`require-resource-limits.yaml` — Enforce
`containers[].resources.limits.memory`	any value	`require-resource-limits.yaml` — Enforce
`metadata.labels.app`	non-empty string	`require-labels.yaml` — Audit
`containers[].securityContext.readOnlyRootFilesystem`	`true`	`require-readonly-filesystem.yaml` — Warn
`finops.org/costcenter`	string matching budgets.yaml	`require-cost-labels.yaml` — Enforce

⚠

Read-only filesystem: If your app writes to the filesystem (logs, tmp files), mount an emptyDir volume at that path. The pattern is in cd/kubernetes/_base/deployment.yaml.

FinOps Checkpoints (Pre-Production)

Three checkpoints are embedded into this path and are not optional before promoting to production:

# Checkpoint 1: Verify cost labels are present
python finops/scripts/validate-cost-tags.py --namespace <your-ns>

# Checkpoint 2: Review VPA rightsizing (after 24h in staging)
python finops/scripts/analyze-rightsizing.py --namespace <staging-ns>

# Checkpoint 3: Confirm budget headroom
# Check: Grafana → FinOps — Budget Tracking dashboard

Golden Path

SLO-Driven Development

Translate reliability requirements into measurable targets, automated alerts, and an engineering process that balances feature velocity with operational risk.

Core Concepts

SLI — Service Level Indicator

The metric you measure. For availability: good_requests / total_requests. For latency: requests_under_200ms / total.

SLO — Service Level Objective

The target. "99.9% of requests succeed over a 30-day window." This is an internal engineering commitment, not a customer-facing SLA.

Error Budget

The allowed unreliability. At 99.9%, you have 43.8 minutes/month of downtime budget. Spend it on risk-taking features; run out and reliability work takes priority.

Burn Rate

How fast you're consuming the budget. Burn rate of 1x = on track. 14.4x = budget gone in 2 hours. Multi-window alerts catch both fast burns and slow drains.

Burn-Rate Alert Tiers

Four alert tiers are required. The two-window approach (fast + slow detection per tier) reduces both false positives and missed gradual degradations:

Critical — Tier 1

Burn ≥ 14.4×
Budget gone in ~2 hours
Alert after 2 minutes
→ Page immediately, open incident

High — Tier 2

Burn ≥ 6×
Budget gone in ~5 days
Alert after 15 minutes
→ Page, investigate within 15 min

Medium — Tier 3

Burn ≥ 3×
Budget gone in ~10 days
Alert after 1 hour
→ Ticket, investigate next standup

Low — Tier 4

Burn ≥ 1×
Will exhaust at end of month
Alert after 3 hours
→ Review in next sprint

Implementation Steps

# 1. Define your SLO (copy + fill the schema)
cp observability/prometheus/slos/slo-schema.yaml \
   observability/prometheus/slos/my-service-slo.yaml

# 2. Write recording rules (replace "api-gateway" with your service)
cp observability/prometheus/recording-rules/slo-burn-rates.yaml \
   observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml
kubectl apply -f observability/prometheus/recording-rules/my-service-slo-burn-rates.yaml

# 3. Deploy burn-rate alerts
cp observability/prometheus/alerts/slo-burn-rate-alerts.yaml \
   observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml
kubectl apply -f observability/prometheus/alerts/my-service-slo-burn-rate-alerts.yaml

# 4. Deploy the SLO dashboard ConfigMap (auto-discovered by Grafana)
kubectl apply -f observability/prometheus/dashboards/slo-status-configmap.yaml

# 5. Validate naming convention
make slo-validate

⚠

Recording rule naming convention is enforced: All SLO recording rules must follow slo:<service>:<metric>:<window>. Example: slo:api-gateway:availability_burn_rate:1h. The make slo-validate target checks this.

SLO Target Selection Guide

Most teams default to 99.9%. That is often wrong. Use these questions first:

What is the user impact of 1 minute of downtime?
What is your current measured baseline? (never set target above baseline)
Can your team sustain the on-call burden of a tighter target?

🚫

99.99% allows only 4.3 minutes/month. A single poorly tuned rolling deployment can exhaust this. Only commit to 99.99% if you have automated rollback, canary releases, and mature incident response.

Golden Path

Incident Response

A 5-phase ops runbook that applies to any service using this playbook. Start here when PagerDuty fires or a user reports production degradation.

Alert fires→ Phase 1: Acknowledge→ Phase 2: Triage→ Phase 3: Mitigate→ Phase 4: Resolve→ Phase 5: PIR

Severity Levels

Severity	Definition	Response time
SEV-1	Total service outage — no users can access	Immediate — wake on-call
SEV-2	Partial outage — significant % of users affected	Within 15 minutes
SEV-3	Degraded performance or non-critical feature failure	Within 1 hour
SEV-4	Minor issue, workaround available	Next business day

Phase 1 — Acknowledge & Assemble (0–5 min)

# Acknowledge PagerDuty within 5 minutes to stop escalation

# Open dedicated Slack channel: #inc-YYYY-MM-DD-<service-name>

# Post immediately:
Service: <name>
Severity: SEV-X
Started: HH:MM UTC
IC: @name
Impact: <what users see>

# Verify ownership via service catalog
cat catalog/services/<service-name>.yaml
# → spec.oncall.pagerduty_service, spec.oncall.slack_channel

Phase 2 — Triage (5–15 min)

# Layer 1: Is the service returning errors?
curl -I https://my-service.example.com/health

# Layer 2: Kubernetes health
kubectl get pods -n <namespace> -l app=<service>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Layer 3: Recent changes (most common root cause)
kubectl rollout history deployment/<name> -n <namespace>
git log --oneline --since="1 hour ago" main

# Layer 4: Logs
kubectl logs -n <namespace> -l app=<service> --previous | tail -100

Phase 3 — Mitigate Options

Scenario	Mitigation	Command
Recent bad deploy	Roll back	`kubectl rollout undo deployment/<name> -n <ns>`
Capacity overload	Scale up	`kubectl scale deployment/<name> --replicas=N`
OOMKilled pods	Increase memory limit	`kubectl set resources deployment/<name> --limits=memory=512Mi`
Feature-specific failure	Disable via ConfigMap	`kubectl edit configmap <name> -n <ns>`
Data corruption	Velero restore	`velero restore create --from-backup <name> --include-namespaces <ns>`

Phase 5 — Post-Incident Review

Mandatory for SEV-1 and SEV-2. Complete within 48 hours. Include: timeline, root cause, contributing factors, action items with owners and due dates. Write or update the runbook: cp docs/runbooks/template.md docs/runbooks/<service>-<symptom>.md

Golden Path

Platform Onboarding

New team setup — get every foundation in place before any application path. Completing this path means the application paths work on first attempt.

Workstation Setup

Each engineer runs bash scripts/env-checker.sh and make hooks independently on their machine.

Local Kind Cluster

Run bash local-dev/kind/setup.sh — idempotent, creates a 3-node cluster with ingress and local registry in one command.

GitHub Repo Setup

Branch protection on main, conventional commits CI workflow, OIDC federation (no static credentials in GitHub Secrets), release strategy.

Bootstrap Terraform State (once)

Run terraform/_bootstrap/{cloud}/ before any other Terraform. Creates the remote state backend. Without this, state is local and will be lost.

Kubernetes Namespace + RBAC

Request namespace from platform team. Gets: namespace with labels, developer read-only RBAC, CI deployer role, default-deny NetworkPolicy, DNS egress allow.

Secrets (External Secrets Operator)

Apply SecretStore for your cloud, create first ExternalSecret, verify sync with kubectl get externalsecret.

Kyverno Policies

Install Kyverno cluster-wide. Run kubectl get policyreport -n <ns> to check violations before your first deploy attempt.

Observability (Prometheus + Loki + Tempo)

Install in order: Prometheus → Loki → Tempo. Apply alert rules and SLO templates for your service.

Alert Routing

Configure Alertmanager receivers: Slack for dev/staging, PagerDuty for production. Fire a test alert to confirm routing works before going live.

Write First Runbook

cp docs/runbooks/template.md docs/runbooks/<service>-crash-loop.md. Link it from alert annotations with runbook_url.

Production Readiness Checklist

bash scripts/env-checker.sh passes on all engineers' machines
pre-commit install run on all engineers' clones
Branch protection enabled with required CI checks
OIDC federation configured — no long-lived credentials in GitHub Secrets
Remote Terraform state backend configured
Namespace with correct RBAC and NetworkPolicy applied
External Secrets store configured and syncing
Kyverno policies installed — no failures in kubectl get policyreport
At least one alert rule deployed and routing verified
PagerDuty on-call rotation configured
Runbook written for top 2 failure modes with runbook_url in alert
Velero backup schedule applied to production namespace

Golden Path

Mobile Backend (BFF)

Backend for Frontend pattern for iOS/Android apps — covering the concerns that make mobile different from a standard API: versioning, PKCE auth, push notifications, and aggressive client retry protection.

Key Mobile-Specific Requirements

API Versioning (Required)

Mobile clients cannot be force-updated. Old app versions live for 12+ months. URL path versioning (/v1/, /v2/) is the default. Deprecated versions must serve a Deprecation: true + Sunset: header.

OAuth2/PKCE (Required)

Authorization Code + PKCE (RFC 7636) is mandatory — mobile apps cannot safely store a client secret. The BFF validates JWTs from the JWKS endpoint on every request. Never hardcode public keys.

Push Notifications

APNs keys and FCM server keys must be stored via External Secrets Operator — never in code or manifests. Upsert device tokens on every app launch (they rotate).

Rate Limiting (Required)

All public endpoints must have Ingress-layer IP-based limiting (nginx.ingress.kubernetes.io/limit-rps). Authenticated endpoints additionally need per-user Redis sliding-window counters. Return 429 with Retry-After.

Mobile-Specific Prometheus Metrics

Metric	Alert when
`http_requests_total{version="v1"}`	Drops to 0 (all clients migrated — safe to sunset)
`auth_token_validation_errors_total`	Spikes (potential credential stuffing)
`push_notification_delivery_failures_total`	Failure rate > 5%
`rate_limit_rejections_total`	Sustained spike (client misbehaviour or DDoS)

Golden Path

Multi-Tenant SaaS

Namespace-per-tenant isolation model. Scales to ~100 tenants per cluster before control plane overhead becomes significant. Beyond that, provision additional clusters.

Isolation Layers

Layer	Mechanism	File
Network	Default-deny NetworkPolicy per namespace + explicit allow rules	`cd/kubernetes/_base/network-policies/default-deny.yaml`
Identity	RBAC scoped to namespace — no cross-tenant secret access	`cd/kubernetes/_base/rbac/`
Secrets	ESO namespace-scoped SecretStore per tenant	`secrets/external-secrets/`
Data	Schema-per-tenant on shared DB (scales to ~1000 tenants; add PgBouncer beyond)	`cd/kubernetes/_patterns/db-migration-job.yaml`
Billing	All Prometheus metrics carry a `tenant` label — enforced by Kyverno	`policy/kyverno/require-labels.yaml`

Tenant Onboarding Flow

Add a YAML record to tenants/<slug>.yaml with 6 required fields. CI detects the new file and creates exactly 6 resources per tenant: namespace, RBAC, NetworkPolicy, ESO store config, database schema, and ArgoCD Application. The job is idempotent — re-running is safe.

Tenant Offboarding (Order Matters)

Never automate schema drops. The required sequence:

Set status: offboarding in tenant YAML → merge PR
Export billing metrics → take Velero namespace snapshot
Delete ArgoCD Application
Delete namespace
Manual DBA step: Drop tenant schema (requires explicit approval)
Remove tenant secrets from secrets store

🚫

Schema drop is irreversible. The DBA must verify Velero backup is complete and tenant has confirmed data export before executing DROP SCHEMA. This step must NEVER be automated.

Golden Path

Service Catalog Registration

Every production service must be registered in the Git-native catalog with owner, on-call routing, runbook link, SLO file, and cost center. CI validates metadata integrity. Kyverno can block deployments from unregistered services in production namespaces.

Required Service Fields

# catalog/services/<service-name>.yaml
metadata:
  name: api-gateway          # must match Deployment app label + filename
  owner: platform-team       # must exist in catalog/teams/
  tier: tier-1               # tier-1 | tier-2 | tier-3
spec:
  oncall:
    pagerduty_service: "..."
    slack_channel: "#alerts-api"
  slo:
    definition_file: "observability/prometheus/slos/api-gateway-slo.yaml"
  cost_center: engineering   # must exist in finops/config/budgets.yaml
  runbook: "docs/runbooks/api-gateway-crashloop.md"

Validation & Automation

# Validate locally (strict mode, skip URL checks)
python catalog/scripts/validate-catalog.py --strict --skip-url-check

# Regenerate .github/CODEOWNERS from team ownership metadata
python catalog/scripts/generate-codeowners.py --output .github/CODEOWNERS
make catalog-codeowners

# Export to Backstage format (when service count approaches 50)
python catalog/scripts/migrate-to-backstage.py --output-dir backstage/catalog

💡

Incident routing: The incident response path uses catalog/services/<name>.yaml as the source of truth for responder routing. Keep spec.oncall.pagerduty_service and spec.oncall.slack_channel current.

CI / CD

CI Core Concepts

Every CI pipeline in this repo follows the same conceptual model regardless of platform — build once, test in isolation, scan in parallel, deploy immutable artifacts. Platform-specific syntax differs; the stages and contracts do not.

Universal Pipeline Stages

Source→ Build→ Test→ Quality Gate→ Security Scan→ Package→ Deploy

Reusable vs Standalone Workflows

Use case	Pattern	File
Docker build shared across services	Reusable workflow	`ci/github-actions/_shared/reusable-docker-build.yml`
Slack notify shared across pipelines	Reusable workflow	`ci/github-actions/_shared/reusable-notify-slack.yml`
Trivy scan shared across pipelines	Reusable workflow	`ci/github-actions/_shared/reusable-security-scan.yml`
Supply chain (sign + SBOM + SLSA)	Reusable workflow	`ci/github-actions/_shared/reusable-supply-chain.yml`
Single-service owns its own build	Standalone workflow	Service-specific `.github/workflows/build.yml`
Monorepo — run only affected services	Strategy	`ci/github-actions/_strategies/monorepo-affected.yml`
Multi-version matrix (Node 20+22)	Strategy	`ci/github-actions/_strategies/matrix-build.yml`
Automated SemVer releases	Strategy	`ci/github-actions/_strategies/release-please.yml`

OIDC vs Static Credentials

🚫

Never use static long-lived credentials in CI. AWS_ACCESS_KEY_ID, AZURE_CLIENT_SECRET, and GCP service account JSON keys cannot be scoped to a specific repository or branch, and persist indefinitely if leaked. OIDC tokens are short-lived (≤15 minutes) and automatically scoped.

Aspect	Long-lived secrets	OIDC tokens
Lifetime	Until manually rotated	Minutes (auto-expires)
Rotation	Manual, easy to forget	Automatic, every workflow run
Blast radius	Leaked secret = persistent access	Leaked token expires in minutes
Audit trail	Hard to attribute	Each token tied to repo/branch/workflow
Scope control	Global until revoked	Restricted by trust policy to specific repo + branch

CI / CD

CI Templates

Copy the workflow file for your stack into .github/workflows/ and add security scanning as a parallel job. All pipelines emit the same stage contract regardless of language.

Stack	Build + Test	Docker Publish
.NET / ASP.NET Core	`ci/github-actions/dotnet/build-test.yml`	`ci/github-actions/dotnet/docker-publish.yml`
Python (pytest + ruff)	`ci/github-actions/python/build-test.yml`	via reusable-docker-build
Go	`ci/github-actions/go/build-test.yml`	`ci/github-actions/go/docker-publish.yml`
Java (Maven/Gradle)	`ci/github-actions/java/build-test.yml`	via reusable-docker-build
Node.js / React	`ci/github-actions/react/build-test.yml`	via reusable-docker-build
Angular	`ci/github-actions/angular/build-test.yml`	`ci/github-actions/angular/lighthouse-audit.yml`
Ruby on Rails	`ci/github-actions/ruby/build-test.yml`	via reusable-docker-build
Terraform	`ci/github-actions/terraform/plan-apply.yml`	`ci/github-actions/terraform/cost-estimation.yml`

# Minimal wiring example: build → security scan
jobs:
  build-test:
    uses: ./.github/workflows/reusable-docker-build.yml
    with:
      image-name: my-service
      dockerfile: docker/python/Dockerfile.fastapi
    permissions:
      id-token: write   # Required for OIDC
      contents: read

  security-scan:
    needs: build-test
    uses: ./.github/workflows/reusable-security-scan.yml
    with:
      image: my-service:${{ github.sha }}

Stack	Pipeline file
.NET	`ci/azure-pipelines/dotnet/azure-pipelines.yml`
Angular	`ci/azure-pipelines/angular/azure-pipelines.yml`
Python	`ci/azure-pipelines/python/azure-pipelines.yml`
Terraform	`ci/azure-pipelines/terraform/azure-pipelines.yml`

Azure Pipelines templates use Variable Groups + Key Vault link or Managed Identity (OIDC equivalent) for cloud credentials — never hardcoded pipeline variables.

Stack	Pipeline file
.NET	`ci/gitlab-ci/dotnet/.gitlab-ci.yml`
Python	`ci/gitlab-ci/python/.gitlab-ci.yml`
Terraform	`ci/gitlab-ci/terraform/.gitlab-ci.yml`

GitLab CI uses include: to pull shared includes from ci/gitlab-ci/_includes/ — Docker build, SAST scan, and Slack notify are all shared components.

Stack	Jenkinsfile
.NET	`ci/jenkins/dotnet/Jenkinsfile`
Python	`ci/jenkins/python/Jenkinsfile`

⚠

Jenkins is the last resort. Prefer cloud-native CI when possible. Jenkins requires self-managed infrastructure, plugin version management, and significantly more operational burden than GitHub Actions or GitLab CI.

CI / CD

Deployment Targets

Choose your deployment target based on whether you need Kubernetes features. If unsure, start with the decision tree:

Need Kubernetes features? (service mesh, custom autoscaling, StatefulSets, DaemonSets) ├── Yes → Which cloud? │ ├── Azure → terraform/azure-aks/ + cd/targets/azure-aks/ │ ├── AWS → terraform/aws-eks/ + cd/targets/aws-eks/ │ └── GCP → terraform/gcp-gke/ + cd/targets/gcp-gke/ └── No → Is this a stateless web app or API? ├── Yes, on Azure → terraform/azure-app-service/ + cd/targets/azure-app-service/ ├── Yes, on AWS no K8s→ terraform/aws-ecs/ + cd/targets/aws-ecs/ └── Event-driven → terraform/aws-lambda/ + cd/targets/aws-lambda/

Target	Terraform	Deploy workflow	Use when
AWS EKS	`terraform/aws-eks/`	`cd/targets/aws-eks/github-actions-deploy.yml`	Full K8s on AWS
Azure AKS	`terraform/azure-aks/`	`cd/targets/azure-aks/github-actions-deploy.yml`	Full K8s on Azure
GCP GKE	`terraform/gcp-gke/`	`cd/targets/gcp-gke/github-actions-deploy.yml`	Full K8s on GCP
Azure App Service	`terraform/azure-app-service/`	`cd/targets/azure-app-service/github-actions-deploy.yml`	Stateless apps, no K8s
AWS ECS	`terraform/aws-ecs/`	`cd/targets/aws-ecs/github-actions-deploy.yml`	Containers without K8s on AWS
AWS Lambda	`terraform/aws-lambda/`	`cd/targets/aws-lambda/serverless-deploy.yml`	Event-driven, short-lived
GCP Cloud Run	via gcp-gke module	`cd/targets/gcp-gke/cloudbuild.yaml`	Serverless containers on GCP
OpenShift	—	`cd/targets/openshift/github-actions-deploy.yml`	Enterprise OpenShift environments
AWS CodePipeline	—	`cd/targets/aws-codepipeline/codepipeline.yml`	AWS-native pipeline requirement

CI / CD

GitOps & ArgoCD

GitOps treats Git as the single source of truth for cluster state. A controller running inside the cluster continuously reconciles actual state with desired state from Git — no push-based deployments from CI, no manual kubectl apply in production.

GitOps Principles

Declarative

All desired state defined in YAML committed to Git. What's in Git is what's in the cluster.

Versioned

Git history is the complete audit log. Every cluster change is traceable to a PR, author, and timestamp.

Pull-Based

The cluster pulls from Git — CI never pushes directly to the cluster. This eliminates a large attack surface.

Auto-Reconciled

If someone manually edits a resource in the cluster (drift), the controller automatically reverts it to the Git state.

ArgoCD Application Patterns

Pattern	File	Use when
Single app	`cd/gitops/argocd/application.yaml`	One service, one repo
Multi-environment generator	`cd/gitops/argocd/applicationset.yaml`	Same app across dev/staging/prod
App-of-apps bootstrap	`cd/gitops/argocd/app-of-apps.yaml`	Bootstrap many apps from one definition
Fleet (3–20 clusters)	`cd/gitops/argocd/fleet/fleet-applicationset.yaml`	Multi-cluster platform component delivery

Promotion Flow

# 1. CI builds image, tags with Git SHA
# 2. CI updates dev overlay: image tag → new SHA
#    cd/kubernetes/_overlays/dev/kustomization.yaml
# 3. ArgoCD syncs dev automatically
# 4. After testing: PR to update staging overlay
# 5. After staging approval: PR to update prod overlay
# 6. ArgoCD syncs prod (with manual sync gate)

# Verify sync status
kubectl get applications -n argocd
kubectl describe application <app-name> -n argocd | grep -A 20 "History:"

💡

ArgoCD vs Flux: Both are included. ArgoCD is the default — richer UI, intuitive Application CRD, App-of-Apps pattern, good for most teams. Flux is preferred for Helm-heavy stacks and large multi-tenant organizations with complex tenancy models. Pick one per cluster; don't mix.

Kubernetes

Kubernetes Concepts & Patterns

Core workload patterns and the decision logic for choosing between them. All base manifests are in cd/kubernetes/_base/; reusable patterns for specific scenarios are in cd/kubernetes/_patterns/.

Workload Type Selection

App type	Workload	Notes
Stateless web API / frontend	Deployment	Default choice. Use HPA for autoscaling. PDB for HA.
Stateful app (database, queue, cache)	StatefulSet	Stable network identity, ordered rollout, persistent storage.
Node-level daemon (log collector, metrics)	DaemonSet	Runs exactly one pod per node automatically.
One-off task	Job	Set `restartPolicy: Never` and `activeDeadlineSeconds`.
Recurring scheduled task	CronJob	Wraps a Job. Set `concurrencyPolicy: Forbid` for batch jobs.

Autoscaling Decision

Scale on	Tool	File
CPU or memory utilization	HPA	`cd/kubernetes/_base/hpa.yaml` — target CPU at 70%, memory at 80%
Queue depth, custom metrics, scale-to-zero	KEDA	Add separately — not in this repo by default
Right-size resource requests automatically	VPA (Off mode)	`cd/kubernetes/_base/vpa.yaml` — recommendations only, no auto-apply

⚠

HPA thresholds: Set CPU target at 70% (not 80%) to provide headroom before scaling lag causes user-visible latency. Never set minReplicas: 1 in production — a single replica means no high availability.

Helm vs Kustomize

Situation	Use
Packaging an app for distribution / multiple teams to install	Helm (`cd/helm/`)
Environment-specific configuration of manifests you own	Kustomize (`cd/kubernetes/_overlays/`)
Internal apps with simple env-to-env differences	Kustomize — less indirection, easier to debug
Complex conditional logic across many parameters	Helm — but if it feels like programming, reconsider

Deployment Strategies

Strategy	Pattern file	When to use
Rolling update (default)	`cd/kubernetes/_base/deployment.yaml`	Standard — Kubernetes handles pod replacement automatically
Blue/Green	`cd/kubernetes/_patterns/blue-green.yaml`	Zero-downtime cutover, instant rollback by switching Service selector
Canary	`cd/kubernetes/_patterns/canary.yaml`	Progressive traffic shift — stable + canary replica ratio

Kubernetes

Base Manifests & Overlays

Start from the base manifests and layer environment differences with Kustomize overlays. Never duplicate full manifests per environment — only patch what differs.

Base Manifest File Map

File	What it configures
`_base/deployment.yaml`	Deployment with health checks, security context, resource limits, emptyDir for writable paths
`_base/service.yaml`	ClusterIP service with correct selector labels
`_base/ingress.yaml`	nginx Ingress with TLS (cert-manager), security headers
`_base/hpa.yaml`	HPA targeting 70% CPU, min 2 replicas
`_base/pdb.yaml`	PodDisruptionBudget ensuring at least 1 pod available during disruptions
`_base/vpa.yaml`	VPA in Off mode — generates recommendations without applying them
`_base/networkpolicy.yaml`	Allow only required ingress/egress; deny all by default
`_base/rbac.yaml`	ServiceAccount + minimal RBAC roles
`_base/configmap.yaml`	Non-secret configuration (use ExternalSecret for secrets)
`_base/network-policies/default-deny.yaml`	Deny-all baseline — apply first, then add allow rules
`_patterns/canary.yaml`	Canary rollout with stable/canary replica split
`_patterns/blue-green.yaml`	Blue/green Deployments with Service selector swap
`_patterns/init-containers.yaml`	Wait-for-dependency init container pattern
`_patterns/dev-scale-to-zero.yaml`	Scale dev/staging to zero overnight to save cost

Kustomize Overlay Structure

cd/kubernetes/
  _base/                    # Shared defaults (all environments)
  _overlays/
    dev/kustomization.yaml  # replicas: 1, relaxed resources, mutable tags
    staging/kustomization.yaml
    prod/kustomization.yaml # replicas: 3, tight PDB, pinned image SHA

# Preview what Kustomize will generate (validate before applying)
kubectl kustomize cd/kubernetes/_overlays/dev

# Run policy checks locally
conftest test cd/kubernetes/_base/ --policy policy/conftest/kubernetes

Kubernetes

Kyverno Policy Engine

Kyverno enforces governance at Kubernetes admission time — before a resource is admitted to the cluster. It complements static analysis (Checkov, Conftest) which runs earlier in CI. These operate at different lifecycle stages and are not alternatives.

Policy Modes

Mode	Effect	Use when
Enforce	Blocks resource from being created/updated	Policy is well-understood; existing resources are compliant
Audit	Admits resource but records violation in PolicyReport	Migrating existing workloads; discovering current violations
Warn	Admits resource with a warning in the API response	Developer feedback without hard blocking

⚠

Migration path: Audit → fix violations → Enforce. Never go straight to Enforce on a production cluster with existing workloads.

Installed Policies

Policy file	Mode	What it enforces
`require-non-root.yaml`	Enforce	`runAsNonRoot: true`, `allowPrivilegeEscalation: false`
`require-resource-limits.yaml`	Enforce	CPU and memory requests + limits on all containers
`disallow-latest-tag.yaml`	Enforce (prod)	Image tags must be pinned — `:latest` is blocked
`require-labels.yaml`	Audit	`app` and `version` labels on Deployments
`require-liveness-readiness.yaml`	Audit	Liveness + readiness probes on multi-replica Deployments
`require-readonly-filesystem.yaml`	Warn	`readOnlyRootFilesystem: true`
`require-catalog-registration.yaml`	Audit / Enforce (prod)	Service must be registered in `catalog/services/`
`enforce-finops-labels.yaml`	Enforce	`finops.org/costcenter` and `finops.org/environment` labels
`fleet-policy-propagation.yaml`	Audit	Cross-cluster policy compliance tracking for fleet deployments

# Check violations in your namespace
kubectl get policyreport -n <your-namespace>
kubectl describe policyreport -n <your-namespace> | grep -A 5 "fail"

# Check cluster-wide (requires cluster-admin)
kubectl get clusterpolicyreport

Infrastructure as Code

Terraform

All cloud infrastructure is provisioned declaratively via Terraform modules. Each cloud target is a self-contained module with its own state key. Bootstrap the remote state backend once before using any other module.

Bootstrap Remote State (Required First)

🚫

Never use local Terraform state in production. Local state is lost if you lose your machine. Never commit terraform.tfstate to Git — it contains plaintext secrets.

# Run ONCE per cloud account, by a human with admin access (not CI)

# AWS — creates S3 bucket + DynamoDB lock table
cd terraform/_bootstrap/aws
terraform init && terraform plan -out=tfplan && terraform apply tfplan

# Azure — creates Storage Account + Blob container
cd terraform/_bootstrap/azure
terraform init && terraform apply

# GCP — creates GCS bucket
cd terraform/_bootstrap/gcp
terraform init && terraform apply

# Then uncomment the backend block in your workload module's main.tf
# and run: terraform init -migrate-state

Cloud Modules

Module	Path	Provisions
AWS EKS	`terraform/aws-eks/`	EKS cluster, managed node groups, VPC, ALB, IRSA roles, optional GPU node group
Azure AKS	`terraform/azure-aks/`	AKS cluster, node pools, ACR, Key Vault, workload identity, optional GPU pool
GCP GKE	`terraform/gcp-gke/`	GKE cluster, Artifact Registry, Workload Identity Federation, PITR Cloud SQL
AWS ECS	`terraform/aws-ecs/`	ECS cluster, task definition, IAM execution role, CloudWatch log group
AWS Lambda	`terraform/aws-lambda/`	Lambda function, least-privilege IAM role, API Gateway, CloudWatch alarms
Azure App Service	`terraform/azure-app-service/`	App Service Plan, Web App (container), Application Insights

Database Backup Modules (Include with Cluster)

Cloud	File	Backup approach
AWS RDS	`terraform/aws-eks/backup.tf`	PITR enabled, cross-region replica, CloudWatch alarm on backup age
Azure PostgreSQL	`terraform/azure-aks/backup.tf`	Flexible Server with geo-redundant backup, PITR to 35 days
GCP Cloud SQL	`terraform/gcp-gke/backup.tf`	PITR enabled, cross-region read replica, point-in-time clone

State Strategy

This repo uses separate state keys per module (e.g., prod/eks/terraform.tfstate) rather than Terraform workspaces. Workspaces share provider configuration — this makes it harder to use different cloud accounts per environment, which is the recommended security posture.

Infrastructure

Environment Strategy

Three environments with distinct purposes, triggers, and data policies. Never bake environment config into images — externalize via Kubernetes ConfigMaps, Helm values files, or Kustomize overlays.

Environment	Purpose	Deploy trigger	Data
dev	Rapid feedback, feature work	Every push / PR	Synthetic / mocked — never real PII
staging	Pre-prod verification, QA sign-off	Merge to main	Anonymised production snapshot
production	Live system	Manual / scheduled after staging approval	Real production data

Production Access Controls

No direct kubectl exec to production pods
All changes via Git PR (GitOps) — ArgoCD applies, never CI push
Break-glass procedures documented and audited in secops/runbooks/
RBAC: least-privilege service accounts — use cd/kubernetes/_base/rbac/ci-deployer.yaml
Separate Kubernetes namespaces per environment

Infrastructure

OIDC / Keyless Cloud Authentication

OpenID Connect federation lets GitHub Actions authenticate directly to cloud providers with short-lived, auto-rotating tokens — no secrets stored in GitHub, no rotation to manage, full audit trail per workflow run.

┌──────────────┐   JWT    ┌──────────────┐  Token  ┌──────────────┐
│  GitHub       │ ───────► │  Cloud       │ ───────► │  Cloud       │
│  Actions      │         │  Identity    │         │  Resources   │
│  Runner       │         │  Provider    │         │  (deploy)    │
└──────────────┘         └──────────────┘         └──────────────┘
     1. GH mints JWT              2. Cloud validates JWT
        with claims                  and issues short-lived
        (repo, branch, env)          scoped credentials
  

Setup by Cloud

Mechanism: IAM OIDC Provider + IAM Role with trust policy

# Step 1: Create OIDC provider (once per AWS account)
aws iam create-open-id-connect-provider \
  --url "https://token.actions.githubusercontent.com" \
  --client-id-list "sts.amazonaws.com"

# Step 2: Create IAM role with trust policy (restrict to your repo)
# Condition: "token.actions.githubusercontent.com:sub": "repo:YOUR-ORG/YOUR-REPO:*"

# Step 3: Add GitHub secret
# AWS_DEPLOY_ROLE_ARN = arn:aws:iam::123456789012:role/github-actions-deploy

# Step 4: Use in workflow
permissions:
  id-token: write    # Required for OIDC
  contents: read
steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
      aws-region: us-east-1

Mechanism: Azure AD App Registration + Federated Identity Credential

# Step 1: Create app registration + service principal
az ad app create --display-name "github-actions-deploy"
az ad sp create --id $APP_ID

# Step 2: Add federated credential (restrict to main branch)
# subject: "repo:YOUR-ORG/YOUR-REPO:ref:refs/heads/main"

# Step 3: Add GitHub secrets
# AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID

# Step 4: Use in workflow
permissions:
  id-token: write
steps:
  - uses: azure/login@v2
    with:
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Mechanism: Workload Identity Pool + OIDC Provider + Service Account binding

# Step 1: Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-actions-pool" \
  --location="global"

# Step 2: Create OIDC provider with attribute condition
# --attribute-condition="assertion.repository_owner == 'YOUR-ORG'"
# Critical: without this, ANY GitHub repo could request tokens

# Step 3: Grant Service Account access to pool
# --member="principalSet://iam.googleapis.com/${POOL_ID}/attribute.repository/YOUR-ORG/YOUR-REPO"

# Step 4: Use in workflow
permissions:
  id-token: write
steps:
  - uses: google-github-actions/auth@v2
    with:
      workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
      service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}

Common Troubleshooting

Cloud	Error	Fix
Azure	`AADSTS70021: No matching federated identity record`	Subject claim mismatch — check branch/environment in federated credential matches workflow trigger
AWS	`Not authorized to perform: sts:AssumeRoleWithWebIdentity`	Trust policy `sub` condition doesn't match workflow repo/branch/environment
GCP	`The caller does not have permission to use this identity pool`	`attribute-condition` rejects the org — check `assertion.repository_owner` matches exactly

Security

Shift-Left Security Scanning

Security checks run at three stages: locally via pre-commit hooks, in CI on every push, and at cluster admission via Kyverno. Each layer catches different classes of issues — they are complements, not alternatives.

Stage 1 — Local (pre-commit) Gitleaks → catches secrets in staged files before commit terraform fmt → catches IaC formatting before push Stage 2 — CI (parallel jobs after build) Gitleaks → history scan across all commits TruffleHog → verified live secrets (PR only — expensive) Trivy → container CVE scan (HIGH/CRITICAL blocks merge) Grype → alternative container scanner Checkov → Terraform/K8s misconfiguration tfsec → cloud-specific Terraform checks Semgrep → SAST (fast, no account needed) SonarQube → SAST + quality gates (deeper analysis) npm audit / pip-audit / dotnet list → dependency CVEs Stage 3 — Cluster admission (Kyverno) disallow-latest-tag → blocks untagged images require-non-root → blocks root containers require-resource-limits → blocks unconstrained pods require-catalog-registration → blocks unregistered services

Scanner Selection Guide

Purpose	Tool	When	File
Secrets in Git history	Gitleaks	Every push + pre-commit	`ci-security/secret-detection/gitleaks.yml`
Verified live secrets in PRs	TruffleHog	PR only (slow)	`ci-security/secret-detection/trufflehog.yml`
Container image CVEs	Trivy	After every image build	`ci-security/container-scanning/trivy-scan.yml`
Container image CVEs (alternative)	Grype	After every image build	`ci-security/container-scanning/grype-scan.yml`
Terraform misconfigurations	Checkov	Push + PR on terraform/**	`ci-security/iac-scanning/checkov.yml`
Terraform cloud-specific checks	tfsec	Push + PR on terraform/**	`ci-security/iac-scanning/tfsec.yml`
SAST (fast, no account)	Semgrep	Push + PR	`ci-security/sast/semgrep.yml`
SAST (deep, quality gates)	SonarQube	Push + PR	`ci-security/sast/sonarqube.yml`
Dockerfile lint	Hadolint	Pre-commit	`.pre-commit-config.yaml`
npm vulnerabilities	npm audit	Weekly schedule	`ci-security/dependency-audit/npm-audit.yml`
Python vulnerabilities	pip-audit	Weekly schedule	`ci-security/dependency-audit/pip-audit.yml`
.NET NuGet vulnerabilities	dotnet list	Weekly schedule	`ci-security/dependency-audit/nuget-audit.yml`

When a Scanner Fires

False Positive — Gitleaks

Add to .gitleaks.toml allowlist with justification comment. Never make skipping the default workflow — CI still runs the shared detection policy.

Real Secret in Git History

Rotate immediately, then use git-filter-repo to remove from history. Assume the secret is compromised even if the repo is private.

Container CVE (Trivy)

Update base image or the vulnerable package. Check if a fixed version is available — use trivy image --ignore-unfixed to focus on actionable findings.

IaC Misconfiguration (Checkov)

Fix the Terraform resource. Do not suppress with #checkov:skip without understanding the risk and documenting the exception with a justification.

Security

Secrets Management

Secrets must never be stored in Git — even in private repos. Every secret needs a source of truth in a cloud-managed secret store, a sync mechanism into Kubernetes, a rotation policy, and an offboarding procedure.

Where Secrets Live

Secret type	Store here	Access via
Long-lived infrastructure credentials	AWS Secrets Manager / Azure Key Vault / GCP Secret Manager	External Secrets Operator
Short-lived cloud credentials in CI	OIDC (no secret storage needed)	GitHub Actions OIDC
TLS certificates	cert-manager + Let's Encrypt	Kubernetes Secret (auto-managed)
App configuration (non-secret)	Kubernetes ConfigMap	`envFrom.configMapRef`
App secrets	External Secrets Operator → Kubernetes Secret	`envFrom.secretRef`

External Secrets Operator

ESO watches ExternalSecret resources in the cluster and syncs values from your cloud secret store into Kubernetes Secrets automatically. The source of truth stays in the cloud store — Kubernetes Secrets are ephemeral replicas.

# 1. Apply SecretStore for your cloud (edit annotations first)
kubectl apply -f secrets/external-secrets/aws-secret-store.yaml
# or: azure-secret-store.yaml | gcp-secret-store.yaml

# 2. Create your ExternalSecret
cp secrets/external-secrets/example-external-secret.yaml \
   secrets/external-secrets/my-app-secrets.yaml
kubectl apply -f secrets/external-secrets/my-app-secrets.yaml

# 3. Verify sync (STATUS should be "SecretSynced")
kubectl get externalsecret my-app-secrets -n <namespace>

⚠

Rotation gap: Rotating the secret value in the cloud store does NOT automatically update running pods. You must either restart the deployment (kubectl rollout restart deployment/app) or mount secrets as volumes (volume mounts pick up new versions without restart; env vars do not).

Secret Lifecycle Guides

Need	Guide
Full lifecycle (provision → rotate → offboard)	`secrets/guides/secret-lifecycle.md`
Emergency rotation (secret exposed)	`secrets/guides/emergency-rotation.md`
Decommission a service's secrets	`secrets/guides/secret-offboarding.md`
Scheduled rotation (Terraform-managed)	`secrets/rotation/aws-rotation.yml` etc.

Anti-Patterns to Avoid

❌ .env files committed to Git
❌ Secrets in docker-compose.yml
❌ Long-lived service account keys / personal access tokens
❌ Same secret in dev, staging, and prod
❌ Secrets in container image layers
❌ Secrets in Kubernetes ConfigMaps (base64 is not encryption)

Security

Runtime Security

Shift-left scanning catches known vulnerabilities before deployment. Runtime security detects threats and anomalous behavior in live environments — attacks that only appear after a workload is running.

Falco

Falco watches kernel syscalls and Kubernetes audit events for suspicious behavior. Custom rules are in secops/runtime/falco/rules/custom-rules.yaml. Alerts route to Slack or PagerDuty via secops/runtime/falco/rules/alerts.yaml.

Audit Logging

Kubernetes API server audit logs are shipped to Loki via Promtail (secops/runtime/audit-logging/loki-shipper.yaml). The audit policy (secops/runtime/audit-logging/audit-policy.yaml) records all resource modifications and secret access at the metadata level.

SecOps Runbooks

Incident type	Runbook
Compromised pod (unusual network or process behavior)	`secops/runbooks/compromised-pod.md`
Node compromise (node-level access detected)	`secops/runbooks/node-compromise.md`
Secret exposure (credentials in logs/repo)	`secops/runbooks/secret-exposure.md`
Supply chain incident (malicious image/dependency)	`secops/runbooks/supply-chain-incident.md`

Security

Compliance as Code

Machine-readable control libraries map specific Kyverno policies and cluster checks to compliance framework controls. Evidence is collected automatically and scored weekly — turning audits into continuous engineering practice.

Supported Frameworks

Framework	Control library	Key controls covered
SOC 2 Type II	`secops/compliance/control-library/soc2-controls.yaml`	CC6.1 (access control), CC6.8 (supply chain), CC7.2 (monitoring), CC7.4 (incident detection), CC8.1 (resource governance)
CIS Kubernetes Benchmark	`secops/compliance/control-library/cis-kubernetes.yaml`	5.2.x (pod security), 5.3.x (network segmentation), 5.5.1 (image verification)
ISO 27001	`secops/compliance/control-library/iso27001.yaml`	8.4 (source control), 8.8 (vulnerability management), 8.28 (SBOM/secure coding)

Evidence Pipeline

# Run evidence collection manually
bash secops/compliance/scripts/collect-evidence.sh --output-dir ./evidence

# Generate compliance report (fails if score < 85%)
python secops/compliance/scripts/generate-compliance-report.py \
  --control-library-dir secops/compliance/control-library \
  --evidence-dir ./evidence \
  --fail-below 0.85 \
  --format markdown

# CI gate equivalent
make compliance-report

💡

Control-to-policy map: secops/compliance/control-library/control-to-policy-map.yaml links each compliance control to the Kyverno policy or script that provides evidence for it. This is the traceability bridge for auditors.

Observability

Observability Stack

The three pillars of observability — metrics, logs, and traces — each answer different operational questions. Instrument all three from day one; correlate them with shared labels and trace IDs to cut mean-time-to-diagnosis by orders of magnitude.

Metrics

Prometheus + Grafana

observability/prometheus/

Logs

Loki + Grafana

observability/loki/

Traces

Tempo + Grafana

observability/tempo/

OTel

Collector Sidecar

observability/opentelemetry/

Signal → Question Matrix

Question	Signal	Tool
Is my service up and healthy?	Metrics	Prometheus + Alertmanager
What happened at 14:23 during the incident?	Logs	Loki + Grafana
Why is this specific request slow?	Traces	Tempo + Grafana
Is my error rate above the SLO?	Metrics + SLO rules	Prometheus recording rules
What did a specific user's request traverse?	Traces	Tempo — filter by trace ID
Which service is causing cascading failures?	Traces (service map)	Tempo + Grafana service graph

Install Order

Install in this order — each layer builds on the previous:

Prometheus Stack

Baseline cluster visibility and alert routing first. helm install kube-prometheus-stack -f observability/prometheus/values.yaml

Loki Stack

Log aggregation second — incidents need both metrics and logs. helm install loki grafana/loki-stack -f observability/loki/values.yaml

Tempo

Distributed tracing last — most value after metrics and logs are stable. helm install tempo grafana/tempo -f observability/tempo/values.yaml

OTel Collector Sidecar

Add to each app Deployment: kubectl patch deployment <name> --patch-file observability/opentelemetry/collector-sidecar.yaml

OpenTelemetry Language Env Vars

Language	Env var file
.NET	`observability/opentelemetry/env-vars/dotnet.env`
Java	`observability/opentelemetry/env-vars/java.env`
Python	`observability/opentelemetry/env-vars/python.env`

Trace Sampling Rates

Environment	Rate	Rationale
Local / Kind	100%	Every request traced — debugging is the priority
Dev	100%	Same as local
Staging	50%	Enough to catch integration issues
Production	10%	Statistically representative; bounded storage cost

Override with OTEL_TRACES_SAMPLER_ARG in your Helm values or Kustomize overlay. See observability/opentelemetry/env-vars/ for language-specific env files.

Observability

SLOs & Error Budgets

File Layout

File	Purpose
`observability/prometheus/slos/slo-schema.yaml`	Schema to copy and fill for any service
`observability/prometheus/slos/my-service-availability-slo.yaml`	Worked availability SLO example
`observability/prometheus/slos/my-service-latency-slo.yaml`	Worked latency SLO example
`observability/prometheus/recording-rules/slo-burn-rates.yaml`	Multi-window burn-rate recording rules template
`observability/prometheus/alerts/slo-burn-rate-alerts.yaml`	Four-tier burn-rate alert rules template
`observability/prometheus/dashboards/slo-status-configmap.yaml`	Grafana SLO dashboard (auto-discovered via ConfigMap)
`docs/runbooks/slo-breach-response.md`	On-call runbook for SLO burn-rate alerts
`docs/runbooks/slo-quarterly-review.md`	Quarterly review agenda, decision criteria, error budget policy template

Error Budget Policy Template

# docs/slo-policies/<service>-error-budget-policy.yaml
service: api-gateway
slo_target: 99.9%
error_budget_monthly_minutes: 43.8

policy:
  above_50_percent_remaining:
    action: normal operations, proceed with planned changes
  below_25_percent_remaining:
    action: pause non-critical feature work, prioritize reliability
    owner: engineering manager
  below_10_percent_remaining:
    action: halt all deployments except rollbacks and critical fixes
    owner: engineering director approval required
  exhausted:
    action: incident response, all hands on reliability
    escalation: VP Engineering

Observability

Alert Routing

Alerts route to different channels based on severity. Configure receivers in the Alertmanager config in observability/prometheus/values.yaml under alertmanager.config.

Severity	Goes to	Wake someone?	Config file
critical	PagerDuty	Yes — immediately	`notifications/pagerduty-notify.yml`
warning	Slack	No — review during business hours	`notifications/slack-notify.yml`
info	Grafana annotation / Slack	No — informational only	`notifications/grafana-notify.yml`

Available Notification Integrations

Channel	Config file	Use for
Slack	`notifications/slack-notify.yml`	Dev and staging alerts, team notifications
PagerDuty	`notifications/pagerduty-notify.yml`	Production on-call paging (critical severity)
Microsoft Teams	`notifications/teams-notify.yml`	Teams using Microsoft 365
Grafana	`notifications/grafana-notify.yml`	Deployment markers on dashboards
Datadog	`notifications/datadog-notify.yml`	DORA metrics and APM correlation

# Test alert routing before going live
kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093
curl -XPOST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning","team":"my-team"}}]'

FinOps

FinOps — Cost Governance & Optimization

A production-grade FinOps system that integrates cost visibility, governance, and optimization into the CI/CD lifecycle — from Terraform PRs through to monthly chargeback reports. Cost control is an engineering discipline, not a monthly finance surprise.

CI/CD Layer (GitHub Actions / Azure Pipelines / GitLab CI) └── Infracost → posts monthly cost impact on every Terraform PR blocks merge when cost growth > $500 or > 20% Kubernetes Admission Layer (Kyverno Policies) ├── require-cost-labels → blocks pods missing finops.org/* labels ├── enforce-resource-limits → blocks unconstrained containers ├── gpu-approval-gate → requires explicit GPU approval annotation └── require-pdb-large-wkld → PDB required for workloads > 4 CPU / 8Gi Cost Visibility Layer (Kubernetes) ├── Kubecost / OpenCost → real-time cost per namespace/workload ├── VPA Recommender → CPU & memory rightsizing recommendations └── CronJob → monthly chargeback reports → S3/Blob/GCS Alerting Layer (Prometheus + Alertmanager) ├── budget-alerts → 80% / 100% / 120% threshold per cost center ├── anomaly-alerts → cost spike vs 7-day baseline └── tag-compliance → fires when > 5% pods lack required cost labels Dashboards (Grafana — 9 dashboards) Cost Overview · Cost Breakdown · Budget Tracking · Anomaly Detection Rightsizing Opportunities · Tag Compliance · Multi-Cloud Comparison Reserved Capacity · Optimization Opportunities

Required Cost Labels (Enforced at Admission)

# All pods must carry these labels — Kyverno blocks any missing them
metadata:
  labels:
    app: my-service
    finops.org/costcenter: "engineering"     # Must match finops/config/budgets.yaml
    finops.org/environment: "production"     # dev | staging | production
    finops.org/team: "platform"             # Optional but recommended
    finops.org/project: "my-service"        # Optional

FinOps Quick Start

# 1. Install cost monitoring (Kubecost or OpenCost)
./finops/scripts/install-cost-monitoring.sh --tool kubecost

# 2. Deploy Kyverno policies in audit mode first (non-blocking)
./finops/scripts/deploy-policies.sh --audit-mode

# 3. Import all 9 Grafana dashboards
export GRAFANA_API_KEY=your-key
./finops/scripts/deploy-dashboards.sh

# 4. Validate cost label compliance
python finops/scripts/validate-cost-tags.py --all-namespaces

# 5. Find rightsizing opportunities
python finops/scripts/analyze-rightsizing.py --all-namespaces

FinOps

FinOps Optimization Loop

The optimization loop closes the gap between cost alert and verified savings with a repeatable, automated workflow. Every step produces artifacts — patches, PR descriptions, advisor reports — so engineers don't start from scratch each time.

Alert fires→ Normalize costs→ Analyze rightsizing→ Generate PR→ Stage 24h→ Merge + verify

Alert Response Guide

Alert	Immediate action	Script
`BudgetThreshold80`	Identify top spender by team	`python finops/scripts/normalize-cloud-costs.py --by team`
`BudgetThreshold100`	Stop non-critical deployments	`python finops/scripts/normalize-cloud-costs.py --by team`
`CostAnomalyDetected`	Check for runaway workloads	`python finops/scripts/detect-underutilized.py`
VPA recommendation > 7 days old	Generate optimization PR	`python finops/scripts/generate-optimization-pr.py`
Unused volumes detected	Review and delete	`python finops/scripts/detect-unused-volumes.py`

Safety Guardrails in Optimization Scripts

Never reduce memory limits > 50% in one change (use --allow-aggressive to override with justification)
Never recommend limits below VPA lower bound
Warn on single-replica workloads with strict PDB before reducing resources
Never commit reserved capacity for workloads < 3 months old
Reserved capacity commitments > $10,000/year require leadership approval (documented in finops/docs/optimization-runbook.md)

FinOps

Scripts Reference

Script	What it does	Make target
`analyze-rightsizing.py`	Reads VPA recommendations, calculates potential savings per workload, flags high-priority candidates	`make finops-rightsizing`
`generate-optimization-pr.py`	Generates Kustomize resource patches + PR description from VPA recommendations, with safety checks	`make finops-optimize-pr`
`normalize-cloud-costs.py`	Cross-cloud cost normalization to `(vCPU_hours × 0.048) + (GiB_hours × 0.006)`. Supports markdown/json/csv output	`make finops-normalize-costs`
`reserved-capacity-advisor.py`	Risk-scored reserved instance recommendations (low/medium/high) with break-even and annual savings	`make finops-reserved-capacity`
`validate-cost-tags.py`	Checks all pods carry required `finops.org/*` labels; exits non-zero if compliance < threshold	—
`detect-underutilized.py`	Finds pods where actual CPU/memory usage is < 20% of requests over the past 7 days	—
`detect-unused-volumes.py`	Lists PVCs with no active pod binding in the last N days	—
`generate-cost-report.py`	Monthly chargeback CSV + JSON report by cost center, exported to S3/Blob/GCS	—
`install-cost-monitoring.sh`	Installs Kubecost or OpenCost via Helm with values from `finops/helm/`	—

Reference

Database Migration Patterns

All migrations must be backwards-compatible, idempotent, and tested locally before production. These three golden rules prevent the most common zero-downtime deployment failures.

Pattern Selection

Pattern	File	When to use	Runs when
Init container	`cd/kubernetes/_patterns/db-migration-init-container.yaml`	Fast migrations (< 5 min), non-Helm workload	Every pod start — blocks rollout until complete
Job	`cd/kubernetes/_patterns/db-migration-job.yaml`	Long migrations (> 5 min), backfills, one-time jobs	Explicitly from CI before deployment rollout
Helm hook	`cd/kubernetes/_patterns/db-migration-hook.yaml`	Helm-managed apps	Automatically as pre-upgrade hook; Helm blocks on failure

Expand / Contract Pattern

Never drop or rename a column in the same release that removes the code using it. Old pods are still running when new pods start:

Release	Schema change	App change
Release N	`ADD` new column (nullable)	New app writes to new column; old app ignores it
Release N+1	Backfill rows; add `NOT NULL`	Both old and new app work against new schema
Release N+2	No schema change	Remove code using the old column
Release N+3	`DROP` old column	Safe — nothing reads it

Zero-Downtime Checklist

Migration is backwards-compatible with the old app version
Migration is idempotent (safe to run twice)
Rollback migration script written and tested
activeDeadlineSeconds set on the Job or init container
Database backup taken within the last 24 hours
Migration tested on staging with production-scale data volume

Reference

Versioning Strategy

Two automated release tools are included. Both require Conventional Commits format — commit messages are parsed to determine version bumps and generate changelogs automatically.

Semantic Versioning (SemVer)

Format: MAJOR.MINOR.PATCH — recommended for libraries and APIs.

Type	SemVer bump	Example commit
`feat`	MINOR (0.x.0)	`feat(auth): add OAuth2 login`
`fix`	PATCH (0.0.x)	`fix(db): handle null cursor on retry`
`feat!` or `BREAKING CHANGE:` footer	MAJOR (x.0.0)	`feat!: remove /v1 API endpoints`
`chore`, `docs`, `refactor`, `test`	None	`docs: update API reference`

Image Tagging Rules

# Tag with Git SHA (always unique, traceable)
docker build -t my-app:sha-$(git rev-parse --short HEAD) .

# Also tag with semver on release
docker tag my-app:sha-abc1234 my-app:1.2.3
docker tag my-app:sha-abc1234 my-app:1.2

🚫

Never use :latest in production. It is not immutable — it changes on every build, making rollbacks impossible and Dependabot drift detection unreliable. Kyverno blocks :latest in production namespaces.

Reference

Branching Strategy

Strategy	Best for	Branches
Trunk-Based (Recommended for CI/CD)	Teams with good test coverage and CI discipline	Short-lived feature branches (1–3 days) → main
GitFlow	Scheduled release cadence, multiple versions in production	main, develop, feature/, release/, hotfix/*
GitHub Flow	Web apps with continuous deployment	Feature branches merged via PR → main = deploy

Branch Protection (Required on main)

Require pull request with at least 1 reviewer
Require status checks to pass (CI workflow names)
Require branches to be up to date before merging
No force push; no deletion

Reference

Runbooks

Action-oriented operational procedures for incidents and routine operations. Every alert in production must link to a runbook via the runbook_url annotation on the PrometheusRule.

Included Runbooks

Runbook	Triggers	Key content
`docs/runbooks/podcrashloobackoff.md`	`PodCrashLoopBackOff` alert	Exit code diagnosis (137=OOM, 1=app crash, 132/139=segfault), liveness probe misconfiguration, rollback procedures
`docs/runbooks/slo-breach-response.md`	`SLOBurnRateCritical/High/Medium/Low`	Severity triage by burn rate tier, mitigation options (rollback/scale/feature flag), error budget remaining check
`docs/runbooks/slo-quarterly-review.md`	Quarterly schedule	60-min agenda, decision criteria for tightening/relaxing targets, error budget policy template
`docs/runbooks/template.md`	Write new runbooks	Standard sections: overview, user impact, immediate steps, diagnosis checklist, likely causes, Grafana queries, escalation, post-incident
`secops/runbooks/compromised-pod.md`	Falco alert / anomaly detection	Pod isolation, evidence collection, forensic preservation, remediation
`secops/runbooks/secret-exposure.md`	TruffleHog alert / security report	Immediate rotation, scope assessment, Git history remediation, audit

Write a New Runbook

# Copy the template
cp docs/runbooks/template.md docs/runbooks/<alertname>.md

# Name the file to match the alert name exactly (lowercase, no spaces)
# e.g.: docs/runbooks/highlatencyp99.md

# Link it from the PrometheusRule
annotations:
  runbook_url: "https://github.com/your-org/repo/blob/main/docs/runbooks/<alertname>.md"

# Commit to docs/runbooks/ so it appears in PR review

Reference

Architecture Decision Records

ADRs capture why decisions were made — not just what was decided. They prevent repeated debates and preserve context when team members change. All ADRs live in docs/decisions/.

ADR	Decision	Key rationale
`ADR-001-folder-structure.md`	Organize by functional concern first, then platform	Engineers ask "I need to deploy to AKS" not "I need a GitHub Actions file." Concern-first mirrors how teams reason about problems.
`ADR-002-helm-vs-kustomize.md`	Include both; Kustomize for internal apps, Helm for distributed packages	Kustomize is kubectl-native, lower learning curve, readable YAML. Helm adds value for versioned chart distribution and complex conditional logic.
`ADR-003-gitops-strategy.md`	ArgoCD as default; Flux included	ArgoCD: richer UI, intuitive Application CRD, App-of-Apps scales well. Flux: better multi-tenancy for large orgs. 2026 extension adds fleet management via ApplicationSet list generator.

Reference

Makefile & Taskfile Targets

All common commands are abstracted behind make targets. Run make help to see the full list at any time.

Target	What it does
`make check-prereqs`	Verify all required tools are installed and at correct versions
`make dev`	Start local kind cluster with registry, ingress, and dev namespace
`make dev-compose`	Start local stack via Docker Compose (no Kubernetes)
`make hooks`	Install pre-commit hooks (run once after cloning)
`make lint`	Run all pre-commit hooks against all files
`make teardown`	Destroy local kind cluster and registry
`make logs`	Follow Docker Compose logs

Target	What it does
`make build`	Build Docker image tagged `IMAGE_NAME:IMAGE_TAG`
`make build-push`	Build and push image to local kind registry
`make test`	Run unit tests (override with your test command)
`make tag-release`	Create and push signed semver git tag (prompts for version)

Target	What it does
`make deploy-dev`	Build, push, apply Kustomize dev overlay to local cluster
`make deploy-staging`	Apply Kustomize staging overlay (requires staging kubeconfig)
`make k8s-status`	Show pods, services, ingresses across all namespaces
`make rollout-status`	Check rollout health for all deployments

Target	What it does
`make policy-report`	Show Kyverno policy violation report across all namespaces
`make compliance-report`	Generate SOC 2/CIS/ISO 27001 compliance report, fail if score < 80%
`make slo-validate`	Lint SLO YAML files, verify recording rule naming convention

Target	What it does
`make finops-rightsizing`	Analyze CPU/memory rightsizing across all namespaces
`make finops-optimize-pr`	Generate draft optimization PR with rightsizing changes
`make finops-normalize-costs`	Normalize cross-cloud costs to standard unit for comparison
`make finops-reserved-capacity`	Evaluate reserved instance / savings plan recommendations

Target	What it does
`make catalog-validate`	Validate all service and team catalog entries (CI gate equivalent)
`make catalog-codeowners`	Regenerate `.github/CODEOWNERS` from catalog team definitions

Golden Path

Supply Chain Security

Keyless signing, SBOM attestation, SLSA provenance, and Kyverno admission verification — wired as a standard delivery path. Zero private keys are stored anywhere.

code commit→ build image→ sign (cosign keyless)→ SBOM (spdx-json)→ SLSA provenance→ push to registry→ Kyverno verifies at admission

Prerequisites

Requirement	Why required
cosign CLI	Verify signatures and attestations locally
GitHub Actions OIDC federation	Required before keyless signing can mint OIDC-backed certificates
Kyverno installed in cluster	Admission verification and PolicyReport compliance output

Implementation Steps

# Step 1: Configure OIDC first (prerequisite for keyless signing)
# See docs/guides/github-actions-oidc.md

# Step 2: Apply supply chain policies
kubectl apply -f secops/supply-chain/cosign-verify-policy.yaml
kubectl apply -f secops/supply-chain/sbom-policy.yaml
kubectl apply -f secops/supply-chain/slsa-verify.yaml

# Step 3: Wire the reusable workflow (copy from dotnet example)
# ci/github-actions/dotnet/supply-chain-integration.yml

# Step 4: Start in audit mode — non-blocking migration
kubectl label namespace <ns> policy.kyverno.io/supply-chain=audit --overwrite

# Step 5: Verify signature
cosign verify <image> \
  --certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

# Step 6: Verify SBOM attestation
cosign verify-attestation \
  --certificate-identity-regexp 'https://github.com/YOUR-ORG/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  --type spdx <image>

# Step 7: Graduate to enforce (after all violations resolved)
kubectl label namespace <ns> policy.kyverno.io/supply-chain- --overwrite

# Step 8: Monitor compliance
kubectl get policyreport -A

Policy Guardrails

Rule	Enforced by
No unsigned images in production	`cosign-verify-policy.yaml` (Enforce)
No images without SBOM in production	`sbom-policy.yaml` (Enforce)
No images without SLSA provenance	`slsa-verify.yaml` (Enforce)
Supply chain tool versions pinned to SHA	Code review + `reusable-supply-chain.yml`
Keyless signing only — no private keys stored	OIDC-only workflow design

Golden Path

FinOps Optimization

Close the loop from a cost alert to a merged PR to verified savings with a repeatable, automated workflow. Every step produces concrete artifacts — no starting from scratch each time.

Identify the Trigger

Confirm which alert fired or which KPI changed in the monthly report. Reference: finops/docs/optimization-runbook.md

Run Cross-Cloud Normalization

python finops/scripts/normalize-cloud-costs.py --by team

Find top movers by team and cloud before selecting an action branch.

Rightsizing Analysis

python finops/scripts/analyze-rightsizing.py --namespace <ns>

Focus on candidates with highest monthly savings and low operational risk first.

Generate Optimization PR Assets

python finops/scripts/generate-optimization-pr.py \
  --namespace <ns> --min-savings 20

Outputs Kustomize patches under optimization-patches/<ns>/ and a pre-filled PR-DESCRIPTION.md.

Test in Staging for 24 Hours

kubectl apply -k optimization-patches/<ns>/

Monitor for OOM kills, error-rate changes, and latency regressions before promoting to production.

Merge & Verify Savings

Merge the PR and validate savings in Grafana Optimization dashboards within 48 hours. If savings materially change the budget forecast, update finops/config/budgets.yaml.

Reserved Capacity (Quarterly Review)

# Risk-scored reserved instance recommendations
python finops/scripts/reserved-capacity-advisor.py \
  --min-savings 500 --max-risk medium --term 1yr

# Safety rules:
# - Never recommend 3-year for workloads < 3 months old
# - Commitments > $10k/year require leadership approval
# - Always test in staging 24h before production changes

Reference

Disaster Recovery

Recovery procedures for the three most common disaster scenarios. Use the decision tree to identify which section applies and follow the steps in order.

Is the entire cluster gone? ├── Yes → Section 1: Cluster Recovery └── No └── Is the database corrupted or unavailable? ├── Yes → Section 2: Database Recovery └── No └── Is a namespace/workload corrupted? └── Yes → Section 3: Namespace Restore

RTO / RPO Summary

Component	RPO (data loss)	RTO (downtime)	Method
Kubernetes workloads	24h (daily backup)	2h	Velero restore to new cluster
AWS RDS	5 min (PITR)	1h	RDS PITR restore
Azure PostgreSQL	5 min (PITR)	1–2h	Flexible Server PITR / geo-restore
GCP Cloud SQL	5 min (PITR)	30 min	Cloud SQL PITR or replica promotion
Secrets (ESO)	0 (live in cloud store)	15 min	Reinstall ESO + apply ClusterSecretStore

Section 1 — Cluster Recovery

# 1. Provision new cluster from existing Terraform
terraform -chdir=terraform/aws-eks apply \
  -var="project=<project>" -var="environment=production"

# 2. Install Velero (pointing to same backup bucket)
bash backup/velero/aws-install.sh

# 3. Restore latest scheduled backup
velero backup get
velero restore create \
  --from-schedule daily-full-backup \
  --restore-volumes=true
velero restore describe <restore-name> --details

# 4. Verify ESO re-syncs secrets automatically
kubectl get externalsecret -A

# 5. Validate cluster health
kubectl get nodes
kubectl get pods -A
curl -f https://my-app.example.com/health

Section 2 — Database Recovery (AWS RDS PITR)

# Find last good timestamp before corruption event
aws rds describe-db-instances \
  --db-instance-identifier rds-<project>-production

# Restore to new instance
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier rds-<project>-production \
  --target-db-instance-identifier rds-<project>-production-restored \
  --restore-time "2026-01-15T14:30:00Z"

# Wait for availability (~10-20 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier rds-<project>-production-restored

# Update connection string in AWS Secrets Manager, then restart pods

Post-Recovery Checklist

All pods are Running (no CrashLoopBackOff)
Ingress endpoints respond with HTTP 200
Database connection strings point to recovered instance
ExternalSecrets are Synced (kubectl get externalsecret -A)
Monitoring/alerting is active (Prometheus targets healthy)
On-call acknowledged — incident ticket updated with timeline
Post-mortem scheduled within 48 hours

Reference

Concepts & Glossary

Canonical terminology used consistently across all templates, guides, and review language in this repository.

Term	Definition
Golden Path	Preferred, opinionated, end-to-end implementation workflow. Not optional guidance — it encodes production experience and enforces guardrails.
Template	Reusable starting file intended for adaptation by teams. Copy it, change the `<-- CHANGE THIS` markers, keep the guardrails.
Guardrail	Mandatory safety or governance control embedded in templates and CI/CD. Not configurable — removing a guardrail is a deliberate exception requiring documented justification.
Baseline	Minimum acceptable standard for production readiness. A service not meeting the baseline is not production-ready, regardless of feature completeness.
Runbook	Action-oriented operational procedure for incidents or routine operations. Linked from alert annotations so on-call engineers see it immediately when an alert fires.
Target	Deployment destination template under `cd/targets/`. One target per cloud platform and deployment model.
Overlay	Environment-specific Kustomize customization on top of base Kubernetes manifests. Only patches what differs — never duplicates the full manifest.
Error Budget	The allowed unreliability calculated from the SLO target. Spend it on risky features; run out and reliability work takes priority over feature work.
Burn Rate	Speed of error budget consumption relative to the expected rate. 1× = on track. 14.4× = budget exhausted in 2 hours from now.
OIDC	OpenID Connect federation for short-lived cloud authentication. The correct alternative to static long-lived credentials in CI/CD.
ESO	External Secrets Operator. Syncs secrets from cloud secret managers (AWS SM, Azure KV, GCP SM) into Kubernetes Secrets automatically.
GitOps	Deployment model where desired cluster state is declared in Git and a controller continuously reconciles actual state to match. ArgoCD and Flux are the GitOps controllers in this repo.
SBOM	Software Bill of Materials. Machine-readable inventory of all components in a built artifact — enables supply chain risk assessment and CVE impact analysis.
SLSA	Supply-chain Levels for Software Artifacts. Framework for build provenance — proves an artifact was built from a specific source by a specific process.
ADR	Architecture Decision Record. Lightweight document capturing what was decided, why, what alternatives were considered, and what the consequences are.
FinOps	Cloud financial operations — the practice of bringing financial accountability to cloud spend as an engineering discipline, not a monthly finance task.
VPA	Vertical Pod Autoscaler. Generates CPU and memory right-sizing recommendations. Used in Off mode here — recommendations only, no automatic mutation of running pods.
HPA	Horizontal Pod Autoscaler. Scales pod count based on CPU, memory, or custom metrics. Target CPU at 70% to provide headroom before scaling lag causes user-visible latency.
PDB	PodDisruptionBudget. Ensures a minimum number of pods remain available during voluntary disruptions (node drains, rolling updates). Required for all production services.
Kyverno	Kubernetes-native policy engine. Validates, mutates, and generates Kubernetes resources at admission time. Policies are Kubernetes custom resources — no Rego required.
Falco	Runtime security engine. Watches kernel syscalls and Kubernetes audit events for suspicious behavior patterns in live containers.
Velero	Kubernetes backup and restore tool. Backs up cluster objects and persistent volume snapshots. Used with DB-native backups for complete DR coverage.
Infracost	Terraform PR cost estimation tool. Posts monthly cost delta as a PR comment and blocks merges when cost growth exceeds configured thresholds.
OTel / OpenTelemetry	Vendor-neutral telemetry collection standard. Provides a single API for metrics, logs, and traces across languages and backends.
SLI / SLO / SLA	SLI = the metric measured. SLO = target level for that metric (internal). SLA = external customer-facing commitment derived from the SLO with a penalty clause.
Cosign	Tool for signing and verifying container images and other OCI artifacts. Used in keyless mode here — signatures are tied to OIDC identity, not stored private keys.

Production-Grade DevOps & Platform from Day One

Quick Start

Check Prerequisites

Install Git Hooks

Start Local Kubernetes Cluster

Pick Your Golden Path

Open a PR

Repository Structure

Design Principles

Clarity Over Abstraction

Golden Paths Over Flexibility

Production-Grade Defaults

Security by Default

Dev Environment

Golden Paths

Kubernetes Microservice — End to End

File Map by Step

Kubernetes Security Baseline (Kyverno-Enforced)

FinOps Checkpoints (Pre-Production)

SLO-Driven Development

Core Concepts

SLI — Service Level Indicator

SLO — Service Level Objective

Error Budget

Burn Rate

Burn-Rate Alert Tiers

Critical — Tier 1

High — Tier 2

Medium — Tier 3

Low — Tier 4

Implementation Steps

SLO Target Selection Guide

Incident Response

Severity Levels

Phase 1 — Acknowledge & Assemble (0–5 min)

Phase 2 — Triage (5–15 min)

Phase 3 — Mitigate Options

Phase 5 — Post-Incident Review

Platform Onboarding

Workstation Setup

Local Kind Cluster

GitHub Repo Setup

Bootstrap Terraform State (once)

Kubernetes Namespace + RBAC

Secrets (External Secrets Operator)

Kyverno Policies

Observability (Prometheus + Loki + Tempo)

Alert Routing

Write First Runbook

Production Readiness Checklist

Mobile Backend (BFF)

Key Mobile-Specific Requirements

API Versioning (Required)

OAuth2/PKCE (Required)

Push Notifications

Rate Limiting (Required)

Mobile-Specific Prometheus Metrics

Multi-Tenant SaaS

Isolation Layers

Tenant Onboarding Flow

Tenant Offboarding (Order Matters)

Service Catalog Registration

Required Service Fields

Validation & Automation

CI Core Concepts

Universal Pipeline Stages

Reusable vs Standalone Workflows

OIDC vs Static Credentials

CI Templates

Deployment Targets

GitOps & ArgoCD

GitOps Principles

Declarative

Versioned

Pull-Based

Auto-Reconciled

ArgoCD Application Patterns

Promotion Flow

Kubernetes Concepts & Patterns

Workload Type Selection

Production-Grade
DevOps & Platform
from Day One