Back to handbooks index
Platform Engineering Handbook
DevOps SecOps FinOps
// Beginner → Intermediate Field Guide

DevOps · SecOps

& FinOps — Concepts Handbook

A practical reference covering the full engineering lifecycle — from writing your first Dockerfile to owning incident response, cost governance, and everything in between. Every concept explained at two depths.

Source Control CI / CD Kubernetes GitOps SecOps Observability FinOps
01 · Foundations

🧭 The Engineering Lifecycle

The 12-Step Lifecycle Mental Model
Beginner
Every engineering platform is just this loop, made more reliable and faster over time. Understanding where a tool or practice fits in the loop is more valuable than memorizing any individual tool.
Build
Containers & Packages
Validate
Pre-commit Hooks
CI
Test, Lint, Scan
Provision
Terraform / Pulumi
Deploy
Kubernetes / GitOps
Observe
Metrics, Logs, Traces
The golden rule: Every step in this loop is something you want to automate, version-control, and make repeatable. A manual step is a reliability risk. If you're doing something manually more than once — automate it.
02 · Foundations

🌿 Source Control & Team Flow

Branching Strategy
Beginner → Intermediate
What it is: A rule set for how teams create, name, merge, and protect branches in Git. Defines the path code takes from idea to production.
🟢 Beginner

Use short-lived feature branches off main. Merge often. Never commit directly to main. Keep main always deployable.

🟡 Intermediate

Add branch protection rules requiring PR reviews + passing CI. Use release branches only when your cadence demands them. Enforce naming conventions (feat/, fix/).

✕ Anti-Patterns
  • Long-lived feature branches (merge hell)
  • Committing directly to main/master
  • No branch naming convention
  • Merging without CI checks passing
✓ Good Patterns
  • Short-lived branches, merged within days
  • Required PR review + status checks
  • Consistent naming: feat/, fix/, chore/
  • Delete branches after merge
Conventional Commits
Beginner → Intermediate
What it is: A lightweight specification for commit messages. Format: type(scope): description. Enables automated changelogs, semantic versioning, and better history readability.
Commit Format
# Pattern: type(scope): short description feat(auth): add FIDO2 passkey support # new feature → MINOR bump fix(api): correct rate limit header values # bug fix → PATCH bump chore(deps): update go to 1.22.1 # maintenance, no version bump docs(readme): add deployment guide # documentation only refactor(payments): extract retry logic # no behavior change ci(github): add Trivy container scan step # CI changes # Breaking change — MAJOR bump feat!(api): remove deprecated v1 endpoints BREAKING CHANGE: v1/users endpoint removed. Migrate to v2/users.
🟢 Beginner

Learn the 5 common types: feat, fix, chore, docs, refactor. Just using these consistently makes your history dramatically more readable.

🟡 Intermediate

Wire commit types to semantic-release or release-please to automate version bumps and CHANGELOG generation. Add a commitlint pre-commit hook to enforce the format.

Semantic Versioning (SemVer)
Beginner
What it is: Version numbers in the form MAJOR.MINOR.PATCH where each part has a precise meaning based on backward compatibility.
PartIncrements WhenExample
MAJORBreaking API change — existing clients may break2.0.0
MINORNew feature added, backward compatible1.3.0
PATCHBug fix, backward compatible1.3.7
03 · Foundations

💻 Developer Experience

Dev Containers
Beginner
What it is: A Docker-based development environment defined in .devcontainer/ that ensures every developer uses identical tool versions, CLI dependencies, and configurations — regardless of their host OS.
The pain it solves: "Works on my machine" is eliminated. When you onboard someone new, they open VS Code, click "Reopen in Container", and have a fully working environment in minutes — same Go version, same Terraform version, same linters as everyone else.
🟢 Beginner

Open the repo in VS Code with the Dev Containers extension. All tools are pre-installed. Run the same make commands as your teammates.

🟡 Intermediate

Pin exact tool versions in the Dockerfile. Add a post-create.sh to auto-run setup (install hooks, configure git). Review pinned versions quarterly.

Makefile / Taskfile — Task Runners
Beginner
What it is: Standardized entrypoints for common developer commands. Instead of every developer remembering 8 different CLI flags, they run make test or task build.
Makefile — Common Targets
# Common pattern: guard + command .PHONY: lint test build deploy lint: golangci-lint run ./... test: go test -race -coverprofile=coverage.out ./... go tool cover -html=coverage.out -o coverage.html build: docker build --platform linux/amd64 -t myapp:$(VERSION) . scan: trivy image myapp:$(VERSION) --severity CRITICAL,HIGH --exit-code 1 ci: lint test build scan # run the full local CI loop
Pre-Commit Hooks
Beginner → Intermediate
What it is: Git hooks that run automated checks before a commit or push lands. Catch formatting issues, secret leaks, and obvious bugs without waiting for CI.
.pre-commit-config.yaml
repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: detect-private-key # blocks commit if private key found - repo: https://github.com/gitleaks/gitleaks rev: v8.18.0 hooks: - id: gitleaks # secret detection on staged files - repo: https://github.com/antonbabenko/pre-commit-terraform rev: v1.86.0 hooks: - id: terraform_fmt - id: terraform_validate
04 · CI / CD

🔄 Continuous Integration

What CI Does (and Why)
Beginner
What it is: Every time code is pushed to a branch or a PR is opened, an automated pipeline runs: compiles the code, runs tests, checks formatting, scans for vulnerabilities, and produces a build artifact. If any step fails, the PR is blocked.
Nothing
Manual Tests
Tests on Push
Test + Lint + Scan
Full Pipeline + Artifacts
Build once, promote everywhere: A key CI principle is to build your artifact once and promote that exact artifact through environments (dev → staging → prod). Never rebuild per environment — different builds are different code.
GitHub Actions — Anatomy of a Pipeline
Beginner → Intermediate
.github/workflows/ci.yml
name: CI on: pull_request: branches: [main] push: branches: [main] jobs: lint-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Go uses: actions/setup-go@v5 with: { go-version: '1.22' } - name: Run linter run: golangci-lint run ./... - name: Run tests run: go test -race -coverprofile=coverage.out ./... security-scan: needs: lint-and-test # only runs if lint-and-test passes runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Trivy vulnerability scan uses: aquasecurity/trivy-action@master with: scan-type: 'fs' severity: 'CRITICAL,HIGH' exit-code: '1' # fail CI if found
Pipeline Parity — Multi-Provider
Intermediate
What it is: Keeping conceptually equivalent pipelines across GitHub Actions, GitLab CI, Azure Pipelines, and Jenkins. The same steps (build → test → scan → publish) implemented in each platform's syntax.
Why it matters: Organizations often run multiple CI platforms (legacy Jenkins + modern GitHub Actions). Parity means teams can migrate without process change — just syntax translation. Maintain shared naming conventions and reusable templates in each system.
05 · CI / CD

🚦 Quality Gates: Linting, SAST & SCA

🔍 Linting & Formatting
Static checks for style, complexity, and obvious bugs. Formatters ensure consistent code style. Examples: golangci-lint, eslint, ruff, terraform fmt. Run on every commit.
🛡️ SAST
Static Application Security Testing — analyzes source code for insecure patterns (SQL injection sinks, hardcoded secrets, unsafe deserialization) without running the app. Tools: semgrep, CodeQL, gosec.
📦 SCA / Dep Scanning
Software Composition Analysis — scans your dependency tree for known CVEs. Most app vulnerabilities enter through dependencies. Tools: Snyk, Dependabot, trivy fs, npm audit.
🐳 Container Scanning
Scans Docker image layers for CVEs. Base image risk ships directly into runtime. Run trivy image in CI and gate on CRITICAL/HIGH severity. Rebuild images regularly to pick up OS patches.
🏗️ IaC Scanning
Security checks against Terraform, Kubernetes manifests, and Dockerfiles. Catches misconfigurations before they're provisioned. Tools: checkov, tfsec, terrascan, kics.
🔑 Secret Detection
Scans code and git history for leaked credentials, API keys, and tokens. Run as pre-commit hook AND in CI. Tools: gitleaks, truffleHog, detect-secrets. Treat every hit as real until proven otherwise.
Beginner sequence: Start with secret detection (highest risk, easy to add) → dependency scanning → SAST → IaC scanning. Each adds protection without overwhelming noise if you start from the highest-signal tools.
06 · Infrastructure

🏗️ Infrastructure as Code

Terraform — Declarative IaC
Beginner → Intermediate
What it is: You write HCL (HashiCorp Configuration Language) files describing the desired state of your cloud infrastructure. Terraform figures out how to make reality match that state. Run terraform plan to preview, terraform apply to execute.
HCL — Terraform Module Structure
# terraform/modules/eks-cluster/main.tf module "eks" { source = "terraform-aws-modules/eks/aws" version = "~> 20.0" # pin version! never use latest cluster_name = var.cluster_name cluster_version = "1.29" vpc_id = var.vpc_id subnet_ids = var.private_subnet_ids } # terraform/environments/production/main.tf # Environments call modules with environment-specific vars module "cluster" { source = "../../modules/eks-cluster" cluster_name = "prod-eks-01" vpc_id = data.aws_vpc.main.id }
🟢 Beginner

Separate reusable modules/ from environments/. Always run terraform plan before apply. Start with remote state in S3 + DynamoDB lock table.

🟡 Intermediate

Add Terragrunt for DRY multi-environment configs. Run checkov in CI for policy checks. Use workspace strategies or directory-per-env. Add Infracost for cost estimation in PRs.

Pulumi — IaC with Real Programming Languages
Intermediate
What it is: Like Terraform, but you write infrastructure in TypeScript, Python, Go, or C# instead of HCL. Useful when you need loops, conditionals, and functions that feel more natural in a general-purpose language.
Terraform vs Pulumi: Terraform's HCL is simpler to learn for infrastructure-only teams. Pulumi shines when your infra logic needs real programming abstractions (dynamic resource creation, complex conditional logic). Both are excellent — pick based on your team's background and stick to it.
07 · Infrastructure

Kubernetes Packaging

Core Kubernetes Concepts
Beginner
Deployment
Manages a set of identical pods. Handles rolling updates, rollbacks, and scaling. Your application workload definition.
Service
Stable network endpoint in front of pods. Pods come and go; Services provide a consistent IP/DNS name. Types: ClusterIP, NodePort, LoadBalancer.
ConfigMap / Secret
Externalizes configuration from your container image. ConfigMap for non-sensitive config; Secret for credentials (base64 encoded, not encrypted by default — use external secret stores).
Ingress
HTTP/S routing rules that direct traffic to services. Requires an Ingress Controller (nginx, Traefik, AWS ALB). Add TLS termination here.
Always set: Resource requests/limits (CPU+memory), liveness/readiness probes, Pod Disruption Budgets for critical workloads, and a security context (runAsNonRoot: true). These are the most commonly missing production requirements.
Kustomize vs Helm — When to Use Which
Beginner → Intermediate
⎈ Kustomize
  • Patch-based — overlay diffs on a base
  • No templating language to learn
  • Great for your own app's env variation
  • Built into kubectl
  • Use for: your apps, different env configs
⎈ Helm
  • Template-based — Go templates with values
  • Packaging format with versioned releases
  • Great for distributable, reusable charts
  • Huge ecosystem of community charts
  • Use for: shared infra (Prometheus, cert-manager)
08 · Infrastructure

🔁 GitOps

GitOps Core Principles
Beginner → Intermediate
What it is: Rather than running kubectl apply from a CI pipeline (push model), a controller running inside the cluster watches a Git repo and continuously reconciles cluster state to match it (pull model). The repo is the only truth — no manual kubectl edits.
Dev
Commits to Git
CI
Build & Update Image Tag
PR Merge
Config Repo Updated
Controller
Argo CD / Flux Detects Drift
Cluster
Auto-Reconciled
🟢 Beginner

Never edit cluster state with kubectl directly in production. All changes go through Git. If you make a manual change, the GitOps controller will revert it (drift correction).

🟡 Intermediate

Add health checks and sync windows (e.g., no auto-sync during business peak hours). Use app-of-apps pattern in Argo CD for managing many applications. Separate the app repo from the config/infra repo.

09 · SecOps

🔐 Shift-Left Security

Why "Shift Left"?
Beginner
What it is: A vulnerability found in design costs 10× less to fix than one found in code review, which costs 100× less than one found in production. "Shifting left" means running security checks as early as possible — ideally on developer laptops before a line of code is even pushed.
Fix in Prod
Pentest Before Release
Scan in CI
Pre-commit Checks
Security in Design
OWASP Top 10 — The Must-Know List
Beginner
#RiskSimple Rule
A01Broken Access ControlCheck authorization on every request server-side. Never trust the client.
A02Cryptographic FailuresEnforce TLS 1.2+. Use AES-256-GCM. Ban MD5/SHA-1 for security purposes.
A03Injection (SQL, OS, LDAP)Always use parameterized queries. Never concatenate user input into commands.
A04Insecure DesignThreat model before coding. Security is a design requirement, not a patch.
A05Security MisconfigurationIaC scanning in CI. CIS Benchmarks. Default-deny everything.
A06Vulnerable ComponentsSCA scanning + automated dependency updates (Dependabot/Renovate).
A07Auth & Session FailuresUse established auth libraries. Enforce MFA. Rotate tokens. Short sessions.
A08Software Integrity FailuresSign artifacts. Verify signatures. Use trusted CI for builds.
Vulnerability Severity & SLAs
Beginner → Intermediate
What it is: CVSS (Common Vulnerability Scoring System) scores vulnerabilities 0–10. But CVSS alone doesn't tell you what to fix first. Combine it with EPSS (probability of exploitation) and the CISA KEV catalog (actively exploited in the wild).
SeverityCVSSPatch SLAIf Unpatchable
CRITICAL9.0–1024 hoursIsolate system immediately
HIGH7.0–8.97 daysWAF rule or network ACL
MEDIUM4.0–6.930 daysDocument risk acceptance
LOW0.1–3.990 daysRisk acceptance OK
CVSS ≠ priority. A CVSS 7.5 vulnerability with 85% exploitation probability (EPSS) is more urgent than a CVSS 9.0 with 1% probability. Always check the CISA KEV catalog first — those vulnerabilities are actively exploited and must be treated as emergencies.
10 · SecOps

🔑 Identity & Secrets Management

OIDC Federation for CI — No More Static Credentials
Intermediate
What it is: Instead of storing AWS/Azure/GCP credentials as CI secrets (which can be stolen), your CI provider (GitHub Actions, GitLab) becomes a trusted identity provider. It issues short-lived JWT tokens that cloud providers exchange for temporary cloud credentials.
Runner
GitHub Actions Job
JWT Token
OIDC Token Issued
Cloud IAM
Validate JWT
Temp Creds
15-min AWS Role
Result: No static credentials stored anywhere. No rotation needed. Even if a token is intercepted, it expires in 15 minutes. Scope trust policies by branch (ref:refs/heads/main) to limit which pipelines can assume which roles.
Secrets Anti-Patterns vs Best Practices
Beginner → Intermediate
✕ Never Do This
  • Hardcode secrets in source code
  • Store secrets in .env files committed to git
  • Share secrets via Slack or email
  • Use same credential across dev/staging/prod
  • Set secrets that never rotate
  • Log secrets in CI pipeline output
✓ Always Do This
  • Use a central secret store (Vault, AWS Secrets Manager)
  • Use dynamic secrets with short TTLs
  • Separate secrets per environment
  • Automate rotation and test it
  • Audit all secret access (who, when, what)
  • Run secret detection as a pre-commit hook
HashiCorp Vault — Dynamic DB Credentials
# Vault generates a unique, time-limited DB user per request # App authenticates to Vault using its cloud IAM role or SPIFFE identity # 1. Configure the database secrets engine vault write database/config/my-postgres \ plugin_name=postgresql-database-plugin \ connection_url="postgresql://{{username}}:{{password}}@db:5432/mydb" \ allowed_roles="app-role" # 2. App requests credentials at startup (TTL: 1 hour) vault read database/creds/app-role # → username: v-app-role-xyz789 (auto-created Postgres user) # → password: <64-char random> (auto-revoked after 1 hour) # Static credentials are NEVER stored anywhere — Vault generates them fresh
External Secrets Operator (ESO)
Intermediate
What it is: A Kubernetes controller that reads secrets from external stores (AWS Secrets Manager, Vault, Azure Key Vault) and automatically syncs them as Kubernetes Secret objects. Your Git repo never contains secret values — it only contains references.
ExternalSecret — Kubernetes YAML
apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: my-app-secrets spec: refreshInterval: "1h" # Re-sync from source every hour secretStoreRef: name: aws-secrets-manager # Reference to the SecretStore kind: ClusterSecretStore target: name: my-app-secrets # Name of the resulting K8s Secret data: - secretKey: DB_PASSWORD # Key in the K8s Secret remoteRef: key: prod/myapp/db # Path in AWS Secrets Manager property: password
11 · SecOps

📜 Policy as Code

Kyverno — Kubernetes Admission Policies
Intermediate
What it is: Kyverno is a policy engine that runs as a Kubernetes admission webhook. Every resource creation/update must pass your policies before it's allowed into the cluster. Policies can validate, mutate, or generate resources.
Kyverno — Require Resource Limits
apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-resource-limits spec: validationFailureAction: Enforce # Block if violated (vs. Audit = just warn) rules: - name: check-container-resources match: resources: { kinds: [Pod] } validate: message: "CPU and memory limits are required for all containers." pattern: spec: containers: - resources: limits: memory: "?*" cpu: "?*"
🟢 Beginner

Start in Audit mode to see what would fail without breaking things. Common beginner policies: require labels, require resource limits, require non-root containers.

🟡 Intermediate

Move critical policies to Enforce mode. Add Kyverno Policy Reports to your observability stack. Namespace-scope enforcement for gradual rollout. Exception workflows for approved deviations.

Conftest & OPA — Policy Checks in CI
Intermediate
What it is: While Kyverno enforces at the cluster admission layer, Conftest runs OPA (Open Policy Agent) Rego policies against files in CI — Terraform plans, Kubernetes manifests, Dockerfiles. Catches violations before they reach the cluster.
Defense in depth: Run Conftest in CI (catches violations in PRs) AND Kyverno in the cluster (catches anything that slips through). Two independent enforcement layers at two different points in the pipeline.
12 · SecOps

⛓️ Supply Chain Security

📋 SBOM
Software Bill of Materials — a machine-readable inventory of every component in your artifact (libraries, OS packages, transitive deps). Generated with syft or cyclonedx. Required for vulnerability impact analysis: "Are we affected by Log4Shell?"
✍️ Artifact Signing
Cryptographically sign container images with cosign (Sigstore). Verify signatures at deploy time via admission policy. Ensures the image in production came from your trusted CI pipeline, not an attacker.
📦 Provenance
Evidence of how and where an artifact was built. SLSA (Supply chain Levels for Software Artifacts) is the framework. Level 2+: builds happen in a hosted CI, generating signed provenance attestations.
📌 Dependency Pinning
Pin base images by digest (FROM ubuntu@sha256:abc... not FROM ubuntu:latest). Pin dependencies at exact versions. Unpinned deps are the primary supply chain attack vector.
13 · SecOps

🚨 Runtime Security & Incident Response

Incident Response Lifecycle (PICERL)
Beginner → Intermediate
01 Prepare
02 Identify
03 Contain
04 Eradicate
05 Recover
06 Review
🚫
Critical order: Contain before eradicate. Never remove malware or patch the entry point before containing the incident. Doing so destroys forensic evidence, tips off the attacker, and may trigger data destruction. Isolate first, preserve memory and disk images, then eradicate.
Compliance Frameworks — Which One Applies to You?
Intermediate
FrameworkWhen It AppliesKey Focus
SOC 2 Type IISaaS / cloud services with enterprise customersTrust Service Criteria — security, availability, confidentiality
ISO 27001Any org seeking certification for third-party trustISMS — 114 controls, annual audit
PCI DSS v4.0Handling credit card / cardholder data12 requirements — mandatory, audited by QSA
GDPRProcessing personal data of EU/UK residentsLawful basis, 72h breach notification, DPO
HIPAAUS healthcare — PHI handlingPHI safeguards, BAAs, breach notification
Compliance ≠ Security. Passing a SOC 2 audit means you met a specific set of controls at a point in time. It does not mean you are secure. Use compliance to formalize your program, but build actual security posture around threat-informed defense.
14 · Observability

📡 The Three Pillars

📈 Metrics
What: Numeric measurements over time — request rate, error rate, latency, CPU, memory.

Answer: "Is something wrong? How often? How bad?"

Tools: Prometheus (scrape-based), Grafana (visualization), OpenTelemetry SDK.
  • Track 4 Golden Signals: latency, traffic, errors, saturation
  • Use histograms for latency, not averages
📋 Logs
What: Timestamped records of discrete events — structured JSON events emitted by your application.

Answer: "What exactly happened at this moment?"

Tools: Loki (log aggregation), Elasticsearch/OpenSearch, Datadog.

  • Always use structured JSON logs
  • Include correlation IDs linking to traces
  • Never log PII or credentials
🔍 Traces
What: End-to-end journey of a single request through multiple services, showing timing at each hop.

Answer: "Where is the bottleneck in this request path?"

Tools: Tempo, Jaeger, Zipkin, OpenTelemetry.

  • Instrument entry points (HTTP handlers, consumers)
  • Propagate context via W3C Trace Context headers
OpenTelemetry — The Unified Standard
Intermediate
What it is: A vendor-neutral standard and SDK for instrumenting your application to emit metrics, logs, and traces in a consistent format. Write instrumentation once, send to any backend (Datadog, Grafana Cloud, AWS X-Ray, Jaeger).
Why it matters: If you instrument with a vendor-specific SDK (Datadog agent, New Relic SDK), you're locked in. OpenTelemetry lets you swap observability backends without re-instrumenting your code. Use the OTel Collector as a routing layer between your apps and backends.
15 · Observability

📊 SLOs & Error Budgets

SLI → SLO → SLA
Beginner → Intermediate
TermWhat It IsExample
SLI
Service Level Indicator
The actual metric you measure% of HTTP requests returning 2xx in under 500ms
SLO
Service Level Objective
Your target for that metric99.9% of requests must be successful over a 30-day window
SLA
Service Level Agreement
External contractual commitment (usually stricter penalty)99.5% uptime in the contract (lower than SLO — buffer for the org)
Error BudgetThe allowed unreliability from the SLO99.9% SLO = 0.1% budget = 43.8 min/month of allowed downtime
🟢 Beginner

Start with one availability SLO per critical service (e.g., "99.5% of checkout requests succeed in 30 days"). Track it. That's it. One measured SLO is infinitely better than zero.

🟡 Intermediate

Add latency SLOs. Define error budget burn rate alerts (multi-window: 1h + 6h). When burn rate is too high, halt feature work and focus on reliability. Error budget = the bridge between product and operations teams.

16 · Observability

🔔 Alerting Best Practices

Alert Design Principles
Beginner → Intermediate
📢 Alert on Symptoms
Alert on user-visible impact (high error rate, slow response) not on causes (CPU at 80%). A cause without a symptom might be fine. A symptom always needs response.
🎯 Actionable Only
Every alert must have a runbook. If an alert fires and the responder doesn't know what to do, either the alert is wrong or the runbook is missing. Fix both.
🛤️ Routed by Severity
P1/SEV1: page immediately 24/7. P2: page business hours. P3: ticket + next business day. Use Alertmanager (Prometheus) or PagerDuty routing rules.
🔇 Deduplicate
Group related alerts to prevent alert storms. One database failure should generate one page, not 500 "service can't connect to DB" alerts. Alertmanager's group_by handles this.
Alert fatigue is a security risk. When on-call engineers receive too many low-quality alerts, they start ignoring them. A critical security alert buried in noise gets missed. Ruthlessly prune alerts that don't result in action.
17 · FinOps

💰 The FinOps Operating Loop

What is FinOps?
Beginner
What it is: FinOps (Financial Operations) is a practice where engineering teams take ownership of cloud spending. Instead of receiving a surprise bill at the end of the month, teams have real-time visibility into what they're spending and why, with automated guardrails to prevent waste.
1. Inform
Measure & Allocate Costs
2. Optimize
Rightsize & Eliminate Waste
3. Govern
Policy Guardrails
4. Verify
Measure Savings
Repeat
Continuous Loop
Rightsizing Workloads
Beginner → Intermediate
What it is: Matching CPU/memory requests and limits to what workloads actually use. Overprovisioned pods waste node resources; underprovisioned pods get OOMKilled. Use VPA (Vertical Pod Autoscaler) in recommendation mode to see what workloads actually need.
VPA Recommendation Check
# Check VPA recommendations for a deployment kubectl describe vpa my-app -n production # Output shows what VPA recommends vs what's currently set: # Lower Bound: cpu: 50m memory: 64Mi (min safe) # Target: cpu: 120m memory: 200Mi (recommended) # Upper Bound: cpu: 400m memory: 512Mi (max observed) # If your requests are set at: cpu: 2000m memory: 4Gi # → You're wasting 16× more CPU than needed # → Reducing saves money and improves cluster density
🟢 Beginner

Start by identifying your 10 largest resource consumers with kubectl top pods -A. Compare requests vs actual usage. Even a 20% reduction in your top 10 consumers can mean significant monthly savings.

🟡 Intermediate

Automate rightsizing recommendations as pull requests using scripts that read VPA recommendations and generate manifest diffs. Gate changes through PR review to prevent accidental under-provisioning.

18 · FinOps

🏷️ Cost Allocation & Governance

Resource Labeling Strategy
Beginner
What it is: Every cloud resource and Kubernetes workload gets standard labels/tags that enable cost attribution. Without labels, you can't answer "which team owns this $30k/month RDS instance?"
Required Labels — Kubernetes / Cloud
# Kubernetes pod labels (enforced via Kyverno policy) labels: app.kubernetes.io/name: "payments-api" app.kubernetes.io/team: "platform-team" app.kubernetes.io/env: "production" cost-center: "eng-platform-001" # AWS resource tags (enforce via SCP / tag policy) tags = { Team = "platform" Environment = "production" Service = "payments-api" CostCenter = "eng-platform-001" } # Untagged = unallocated cost = no accountability # Enforce minimum tags at org level via SCP before provisioning
Infracost — Cost Estimation in CI
Intermediate
What it is: Infracost reads your Terraform plan and generates a cost diff comment on every PR — "this change will add $847/month to your AWS bill." Engineers see cost impact before merging, not on the next invoice.
Real impact: A developer accidentally provisions a GPU instance when they meant a regular EC2 — that's a $4,000/month mistake. Infracost in CI catches it at PR review time. Add a policy gate to block PRs that increase cost by more than 20% without a justification comment.
19 · Governance

📋 Service Catalog & Ownership

Service Catalog Concepts
Beginner → Intermediate
Every production service should have a catalog entry that answers: Who owns this? What does it depend on? Where are the runbooks? What SLO does it have? What tier is it (critical vs non-critical)?
service-catalog.yaml — Example Entry
services: - name: payments-api description: "Handles payment processing and fraud detection" tier: P0 # P0 = business-critical team: platform-payments oncall: "pagerduty://payments-oncall" slo: availability: "99.9%" latency_p99: "500ms" runbooks: incident: "https://wiki.internal/runbooks/payments-api" deployment: "https://wiki.internal/runbooks/payments-deploy" dependencies: - postgres-primary - redis-cache - stripe-api (external)
🟢 Beginner

Register every production service. Even a simple list with owner + oncall contact is transformative during an incident. The first question is always "who owns this?"

🟡 Intermediate

Generate CODEOWNERS automatically from catalog metadata to avoid drift. Validate catalog entries in CI (schema check + URL validation). Enforce registration as a production deployment gate.

20 · Reference

🎯 Learning Path

Beginner → Intermediate Progression
All Levels
PhaseConceptsGoal
Phase 1
Week 1–2
Git branching, conventional commits, dev containers, Makefile Every developer working identically. Code history is readable. No "works on my machine."
Phase 2
Week 3–4
Pre-commit hooks, basic CI pipeline, secret detection, dep scanning Automated quality gates on every push. Security checks running in CI.
Phase 3
Month 2
Terraform basics, Kubernetes manifests, container scanning, IaC scanning Infrastructure defined as code. Deployments reproducible.
Phase 4
Month 3
Kustomize/Helm, GitOps with Argo CD, OIDC federation, secrets management GitOps-driven deployments. No static credentials anywhere.
Phase 5
Month 4–5
Policy as code (Kyverno), observability (metrics + logs + traces), SLOs Governance automated. Reliability is measured and owned.
Phase 6
Ongoing
FinOps loop, service catalog, supply chain security, incident response playbooks Full platform engineering maturity. Cost visible. Ownership clear. Incidents structured.
The meta-principle: Every practice in this guide follows the same pattern: Observe first, then enforce. Run your linter in warn mode before failing CI. Run Kyverno in Audit before Enforce. Map traffic flows before micro-segmenting. Run Infracost before blocking PRs. Visibility always comes before control — you cannot govern what you cannot see.
ConceptPrimary Tool(s)Learn First
BranchingGit, GitHub/GitLabYES — Day 1
CI PipelineGitHub Actions, GitLab CIYES — Week 1
Secret Detectiongitleaks, truffleHogYES — Week 1
IaCTerraform, PulumiMonth 1
Kuberneteskubectl, k9s, LensMonth 1–2
GitOpsArgo CD, FluxMonth 2–3
SLOsPrometheus, GrafanaMonth 3–4
Policy as CodeKyverno, OPA/ConftestAfter Kubernetes basics
Supply Chain Seccosign, syft, SLSAAfter CI/CD mastery
FinOpsInfracost, VPA, KubecostAfter infra is running