// Beginner → Intermediate Field Guide

DevOps · SecOps

& FinOps — Concepts Handbook

A practical reference covering the full engineering lifecycle — from writing your first Dockerfile to owning incident response, cost governance, and everything in between. Every concept explained at two depths.

Source Control CI / CD Kubernetes GitOps SecOps Observability FinOps

01 · Foundations

🧭 The Engineering Lifecycle

Before learning tools, understand the loop every platform engineering team runs — from code to shipped, secured, observed, and optimized.

The 12-Step Lifecycle Mental Model

Beginner

Every engineering platform is just this loop, made more reliable and faster over time. Understanding where a tool or practice fits in the loop is more valuable than memorizing any individual tool.

Build

Containers & Packages

→

Validate

Pre-commit Hooks

→

Test, Lint, Scan

→

Provision

Terraform / Pulumi

→

Deploy

Kubernetes / GitOps

→

Observe

Metrics, Logs, Traces

ℹ

The golden rule: Every step in this loop is something you want to automate, version-control, and make repeatable. A manual step is a reliability risk. If you're doing something manually more than once — automate it.

02 · Foundations

🌿 Source Control & Team Flow

Git branching, commit standards, and versioning — the foundation everything else builds on.

Branching Strategy

Beginner → Intermediate

What it is: A rule set for how teams create, name, merge, and protect branches in Git. Defines the path code takes from idea to production.

🟢 Beginner

Use short-lived feature branches off main. Merge often. Never commit directly to main. Keep main always deployable.

🟡 Intermediate

Add branch protection rules requiring PR reviews + passing CI. Use release branches only when your cadence demands them. Enforce naming conventions (feat/, fix/).

✕ Anti-Patterns

Long-lived feature branches (merge hell)
Committing directly to main/master
No branch naming convention
Merging without CI checks passing

✓ Good Patterns

Short-lived branches, merged within days
Required PR review + status checks
Consistent naming: feat/, fix/, chore/
Delete branches after merge

Conventional Commits

Beginner → Intermediate

What it is: A lightweight specification for commit messages. Format: type(scope): description. Enables automated changelogs, semantic versioning, and better history readability.

Commit Format

# Pattern: type(scope): short description

feat(auth): add FIDO2 passkey support          # new feature → MINOR bump
fix(api): correct rate limit header values    # bug fix → PATCH bump
chore(deps): update go to 1.22.1             # maintenance, no version bump
docs(readme): add deployment guide            # documentation only
refactor(payments): extract retry logic       # no behavior change
ci(github): add Trivy container scan step     # CI changes

# Breaking change — MAJOR bump
feat!(api): remove deprecated v1 endpoints

BREAKING CHANGE: v1/users endpoint removed. Migrate to v2/users.

🟢 Beginner

Learn the 5 common types: feat, fix, chore, docs, refactor. Just using these consistently makes your history dramatically more readable.

🟡 Intermediate

Wire commit types to semantic-release or release-please to automate version bumps and CHANGELOG generation. Add a commitlint pre-commit hook to enforce the format.

Semantic Versioning (SemVer)

Beginner

What it is: Version numbers in the form MAJOR.MINOR.PATCH where each part has a precise meaning based on backward compatibility.

Part	Increments When	Example
MAJOR	Breaking API change — existing clients may break	`2.0.0`
MINOR	New feature added, backward compatible	`1.3.0`
PATCH	Bug fix, backward compatible	`1.3.7`

03 · Foundations

💻 Developer Experience

Dev containers, task runners, and pre-commit hooks — making every developer's machine work the same way.

Dev Containers

Beginner

What it is: A Docker-based development environment defined in .devcontainer/ that ensures every developer uses identical tool versions, CLI dependencies, and configurations — regardless of their host OS.

✅

The pain it solves: "Works on my machine" is eliminated. When you onboard someone new, they open VS Code, click "Reopen in Container", and have a fully working environment in minutes — same Go version, same Terraform version, same linters as everyone else.

🟢 Beginner

Open the repo in VS Code with the Dev Containers extension. All tools are pre-installed. Run the same make commands as your teammates.

🟡 Intermediate

Pin exact tool versions in the Dockerfile. Add a post-create.sh to auto-run setup (install hooks, configure git). Review pinned versions quarterly.

Makefile / Taskfile — Task Runners

Beginner

What it is: Standardized entrypoints for common developer commands. Instead of every developer remembering 8 different CLI flags, they run make test or task build.

Makefile — Common Targets

# Common pattern: guard + command
.PHONY: lint test build deploy

lint:
	golangci-lint run ./...

test:
	go test -race -coverprofile=coverage.out ./...
	go tool cover -html=coverage.out -o coverage.html

build:
	docker build --platform linux/amd64 -t myapp:$(VERSION) .

scan:
	trivy image myapp:$(VERSION) --severity CRITICAL,HIGH --exit-code 1

ci: lint test build scan   # run the full local CI loop

Pre-Commit Hooks

Beginner → Intermediate

What it is: Git hooks that run automated checks before a commit or push lands. Catch formatting issues, secret leaks, and obvious bugs without waiting for CI.

.pre-commit-config.yaml

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: detect-private-key      # blocks commit if private key found

  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks                 # secret detection on staged files

  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.86.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate

04 · CI / CD

🔄 Continuous Integration

Automated build, test, and scan on every change — the shared quality gate before merging code.

What CI Does (and Why)

Beginner

What it is: Every time code is pushed to a branch or a PR is opened, an automated pipeline runs: compiles the code, runs tests, checks formatting, scans for vulnerabilities, and produces a build artifact. If any step fails, the PR is blocked.

Nothing

Manual Tests

Tests on Push

Test + Lint + Scan

Full Pipeline + Artifacts

ℹ

Build once, promote everywhere: A key CI principle is to build your artifact once and promote that exact artifact through environments (dev → staging → prod). Never rebuild per environment — different builds are different code.

GitHub Actions — Anatomy of a Pipeline

Beginner → Intermediate

.github/workflows/ci.yml

name: CI

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Go
        uses: actions/setup-go@v5
        with: { go-version: '1.22' }

      - name: Run linter
        run: golangci-lint run ./...

      - name: Run tests
        run: go test -race -coverprofile=coverage.out ./...

  security-scan:
    needs: lint-and-test       # only runs if lint-and-test passes
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'      # fail CI if found

Pipeline Parity — Multi-Provider

Intermediate

What it is: Keeping conceptually equivalent pipelines across GitHub Actions, GitLab CI, Azure Pipelines, and Jenkins. The same steps (build → test → scan → publish) implemented in each platform's syntax.

⚠

Why it matters: Organizations often run multiple CI platforms (legacy Jenkins + modern GitHub Actions). Parity means teams can migrate without process change — just syntax translation. Maintain shared naming conventions and reusable templates in each system.

05 · CI / CD

🚦 Quality Gates: Linting, SAST & SCA

The automated checks that stand between your code and production.

🔍 Linting & Formatting

Static checks for style, complexity, and obvious bugs. Formatters ensure consistent code style. Examples: golangci-lint, eslint, ruff, terraform fmt. Run on every commit.

🛡️ SAST

Static Application Security Testing — analyzes source code for insecure patterns (SQL injection sinks, hardcoded secrets, unsafe deserialization) without running the app. Tools: semgrep, CodeQL, gosec.

📦 SCA / Dep Scanning

Software Composition Analysis — scans your dependency tree for known CVEs. Most app vulnerabilities enter through dependencies. Tools: Snyk, Dependabot, trivy fs, npm audit.

🐳 Container Scanning

Scans Docker image layers for CVEs. Base image risk ships directly into runtime. Run trivy image in CI and gate on CRITICAL/HIGH severity. Rebuild images regularly to pick up OS patches.

🏗️ IaC Scanning

Security checks against Terraform, Kubernetes manifests, and Dockerfiles. Catches misconfigurations before they're provisioned. Tools: checkov, tfsec, terrascan, kics.

🔑 Secret Detection

Scans code and git history for leaked credentials, API keys, and tokens. Run as pre-commit hook AND in CI. Tools: gitleaks, truffleHog, detect-secrets. Treat every hit as real until proven otherwise.

✅

Beginner sequence: Start with secret detection (highest risk, easy to add) → dependency scanning → SAST → IaC scanning. Each adds protection without overwhelming noise if you start from the highest-signal tools.

06 · Infrastructure

🏗️ Infrastructure as Code

Terraform and Pulumi — defining cloud resources in version-controlled code instead of clicking in the console.

Terraform — Declarative IaC

Beginner → Intermediate

What it is: You write HCL (HashiCorp Configuration Language) files describing the desired state of your cloud infrastructure. Terraform figures out how to make reality match that state. Run terraform plan to preview, terraform apply to execute.

HCL — Terraform Module Structure

# terraform/modules/eks-cluster/main.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"     # pin version! never use latest

  cluster_name    = var.cluster_name
  cluster_version = "1.29"
  vpc_id          = var.vpc_id
  subnet_ids      = var.private_subnet_ids
}

# terraform/environments/production/main.tf
# Environments call modules with environment-specific vars
module "cluster" {
  source       = "../../modules/eks-cluster"
  cluster_name = "prod-eks-01"
  vpc_id       = data.aws_vpc.main.id
}

🟢 Beginner

Separate reusable modules/ from environments/. Always run terraform plan before apply. Start with remote state in S3 + DynamoDB lock table.

🟡 Intermediate

Add Terragrunt for DRY multi-environment configs. Run checkov in CI for policy checks. Use workspace strategies or directory-per-env. Add Infracost for cost estimation in PRs.

Pulumi — IaC with Real Programming Languages

Intermediate

What it is: Like Terraform, but you write infrastructure in TypeScript, Python, Go, or C# instead of HCL. Useful when you need loops, conditionals, and functions that feel more natural in a general-purpose language.

ℹ

Terraform vs Pulumi: Terraform's HCL is simpler to learn for infrastructure-only teams. Pulumi shines when your infra logic needs real programming abstractions (dynamic resource creation, complex conditional logic). Both are excellent — pick based on your team's background and stick to it.

07 · Infrastructure

⎈ Kubernetes Packaging

Manifests, Kustomize overlays, and Helm charts — the three layers of Kubernetes config management.

Core Kubernetes Concepts

Beginner

Deployment

Manages a set of identical pods. Handles rolling updates, rollbacks, and scaling. Your application workload definition.

Service

Stable network endpoint in front of pods. Pods come and go; Services provide a consistent IP/DNS name. Types: ClusterIP, NodePort, LoadBalancer.

ConfigMap / Secret

Externalizes configuration from your container image. ConfigMap for non-sensitive config; Secret for credentials (base64 encoded, not encrypted by default — use external secret stores).

Ingress

HTTP/S routing rules that direct traffic to services. Requires an Ingress Controller (nginx, Traefik, AWS ALB). Add TLS termination here.

⚠

Always set: Resource requests/limits (CPU+memory), liveness/readiness probes, Pod Disruption Budgets for critical workloads, and a security context (runAsNonRoot: true). These are the most commonly missing production requirements.

Kustomize vs Helm — When to Use Which

Beginner → Intermediate

⎈ Kustomize

Patch-based — overlay diffs on a base
No templating language to learn
Great for your own app's env variation
Built into kubectl
Use for: your apps, different env configs

⎈ Helm

Template-based — Go templates with values
Packaging format with versioned releases
Great for distributable, reusable charts
Huge ecosystem of community charts
Use for: shared infra (Prometheus, cert-manager)

08 · Infrastructure

🔁 GitOps

Git as the single source of truth. Controllers automatically reconcile cluster state to match what's in the repo.

GitOps Core Principles

Beginner → Intermediate

What it is: Rather than running kubectl apply from a CI pipeline (push model), a controller running inside the cluster watches a Git repo and continuously reconciles cluster state to match it (pull model). The repo is the only truth — no manual kubectl edits.

Dev

Commits to Git

→

Build & Update Image Tag

→

PR Merge

Config Repo Updated

→

Controller

Argo CD / Flux Detects Drift

→

Cluster

Auto-Reconciled

🟢 Beginner

Never edit cluster state with kubectl directly in production. All changes go through Git. If you make a manual change, the GitOps controller will revert it (drift correction).

🟡 Intermediate

Add health checks and sync windows (e.g., no auto-sync during business peak hours). Use app-of-apps pattern in Argo CD for managing many applications. Separate the app repo from the config/infra repo.

09 · SecOps

🔐 Shift-Left Security

Moving security checks earlier in the development process — finding vulnerabilities when they're cheapest to fix.

Why "Shift Left"?

Beginner

What it is: A vulnerability found in design costs 10× less to fix than one found in code review, which costs 100× less than one found in production. "Shifting left" means running security checks as early as possible — ideally on developer laptops before a line of code is even pushed.

Fix in Prod

Pentest Before Release

Scan in CI

Pre-commit Checks

Security in Design

OWASP Top 10 — The Must-Know List

Beginner

#	Risk	Simple Rule
A01	Broken Access Control	Check authorization on every request server-side. Never trust the client.
A02	Cryptographic Failures	Enforce TLS 1.2+. Use AES-256-GCM. Ban MD5/SHA-1 for security purposes.
A03	Injection (SQL, OS, LDAP)	Always use parameterized queries. Never concatenate user input into commands.
A04	Insecure Design	Threat model before coding. Security is a design requirement, not a patch.
A05	Security Misconfiguration	IaC scanning in CI. CIS Benchmarks. Default-deny everything.
A06	Vulnerable Components	SCA scanning + automated dependency updates (Dependabot/Renovate).
A07	Auth & Session Failures	Use established auth libraries. Enforce MFA. Rotate tokens. Short sessions.
A08	Software Integrity Failures	Sign artifacts. Verify signatures. Use trusted CI for builds.

Vulnerability Severity & SLAs

Beginner → Intermediate

What it is: CVSS (Common Vulnerability Scoring System) scores vulnerabilities 0–10. But CVSS alone doesn't tell you what to fix first. Combine it with EPSS (probability of exploitation) and the CISA KEV catalog (actively exploited in the wild).

Severity	CVSS	Patch SLA	If Unpatchable
CRITICAL	9.0–10	24 hours	Isolate system immediately
HIGH	7.0–8.9	7 days	WAF rule or network ACL
MEDIUM	4.0–6.9	30 days	Document risk acceptance
LOW	0.1–3.9	90 days	Risk acceptance OK

⚠

CVSS ≠ priority. A CVSS 7.5 vulnerability with 85% exploitation probability (EPSS) is more urgent than a CVSS 9.0 with 1% probability. Always check the CISA KEV catalog first — those vulnerabilities are actively exploited and must be treated as emergencies.

10 · SecOps

🔑 Identity & Secrets Management

OIDC federation, workload identity, dynamic secrets — eliminating long-lived credentials from your systems.

OIDC Federation for CI — No More Static Credentials

Intermediate

What it is: Instead of storing AWS/Azure/GCP credentials as CI secrets (which can be stolen), your CI provider (GitHub Actions, GitLab) becomes a trusted identity provider. It issues short-lived JWT tokens that cloud providers exchange for temporary cloud credentials.

Runner

GitHub Actions Job

→

JWT Token

OIDC Token Issued

→

Cloud IAM

Validate JWT

→

Temp Creds

15-min AWS Role

✅

Result: No static credentials stored anywhere. No rotation needed. Even if a token is intercepted, it expires in 15 minutes. Scope trust policies by branch (ref:refs/heads/main) to limit which pipelines can assume which roles.

Secrets Anti-Patterns vs Best Practices

Beginner → Intermediate

✕ Never Do This

Hardcode secrets in source code
Store secrets in .env files committed to git
Share secrets via Slack or email
Use same credential across dev/staging/prod
Set secrets that never rotate
Log secrets in CI pipeline output

✓ Always Do This

Use a central secret store (Vault, AWS Secrets Manager)
Use dynamic secrets with short TTLs
Separate secrets per environment
Automate rotation and test it
Audit all secret access (who, when, what)
Run secret detection as a pre-commit hook

HashiCorp Vault — Dynamic DB Credentials

# Vault generates a unique, time-limited DB user per request
# App authenticates to Vault using its cloud IAM role or SPIFFE identity

# 1. Configure the database secrets engine
vault write database/config/my-postgres \
    plugin_name=postgresql-database-plugin \
    connection_url="postgresql://{{username}}:{{password}}@db:5432/mydb" \
    allowed_roles="app-role"

# 2. App requests credentials at startup (TTL: 1 hour)
vault read database/creds/app-role
# → username: v-app-role-xyz789  (auto-created Postgres user)
# → password: <64-char random>   (auto-revoked after 1 hour)

# Static credentials are NEVER stored anywhere — Vault generates them fresh

External Secrets Operator (ESO)

Intermediate

What it is: A Kubernetes controller that reads secrets from external stores (AWS Secrets Manager, Vault, Azure Key Vault) and automatically syncs them as Kubernetes Secret objects. Your Git repo never contains secret values — it only contains references.

ExternalSecret — Kubernetes YAML

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-app-secrets
spec:
  refreshInterval: "1h"        # Re-sync from source every hour
  secretStoreRef:
    name: aws-secrets-manager  # Reference to the SecretStore
    kind: ClusterSecretStore
  target:
    name: my-app-secrets       # Name of the resulting K8s Secret
  data:
    - secretKey: DB_PASSWORD   # Key in the K8s Secret
      remoteRef:
        key: prod/myapp/db     # Path in AWS Secrets Manager
        property: password

11 · SecOps

📜 Policy as Code

Machine-enforceable governance rules written in version-controlled code — preventing unsafe changes before they become incidents.

Kyverno — Kubernetes Admission Policies

Intermediate

What it is: Kyverno is a policy engine that runs as a Kubernetes admission webhook. Every resource creation/update must pass your policies before it's allowed into the cluster. Policies can validate, mutate, or generate resources.

Kyverno — Require Resource Limits

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce   # Block if violated (vs. Audit = just warn)
  rules:
    - name: check-container-resources
      match:
        resources: { kinds: [Pod] }
      validate:
        message: "CPU and memory limits are required for all containers."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

🟢 Beginner

Start in Audit mode to see what would fail without breaking things. Common beginner policies: require labels, require resource limits, require non-root containers.

🟡 Intermediate

Move critical policies to Enforce mode. Add Kyverno Policy Reports to your observability stack. Namespace-scope enforcement for gradual rollout. Exception workflows for approved deviations.

Conftest & OPA — Policy Checks in CI

Intermediate

What it is: While Kyverno enforces at the cluster admission layer, Conftest runs OPA (Open Policy Agent) Rego policies against files in CI — Terraform plans, Kubernetes manifests, Dockerfiles. Catches violations before they reach the cluster.

✅

Defense in depth: Run Conftest in CI (catches violations in PRs) AND Kyverno in the cluster (catches anything that slips through). Two independent enforcement layers at two different points in the pipeline.

12 · SecOps

⛓️ Supply Chain Security

SBOM, artifact signing, and provenance — knowing what's in your software and that it hasn't been tampered with.

📋 SBOM

Software Bill of Materials — a machine-readable inventory of every component in your artifact (libraries, OS packages, transitive deps). Generated with syft or cyclonedx. Required for vulnerability impact analysis: "Are we affected by Log4Shell?"

✍️ Artifact Signing

Cryptographically sign container images with cosign (Sigstore). Verify signatures at deploy time via admission policy. Ensures the image in production came from your trusted CI pipeline, not an attacker.

📦 Provenance

Evidence of how and where an artifact was built. SLSA (Supply chain Levels for Software Artifacts) is the framework. Level 2+: builds happen in a hosted CI, generating signed provenance attestations.

📌 Dependency Pinning

Pin base images by digest (FROM ubuntu@sha256:abc... not FROM ubuntu:latest). Pin dependencies at exact versions. Unpinned deps are the primary supply chain attack vector.

13 · SecOps

🚨 Runtime Security & Incident Response

Detecting and responding to threats that only appear at runtime — after all the static checks have passed.

Incident Response Lifecycle (PICERL)

Beginner → Intermediate

01 Prepare

02 Identify

03 Contain

04 Eradicate

05 Recover

06 Review

🚫

Critical order: Contain before eradicate. Never remove malware or patch the entry point before containing the incident. Doing so destroys forensic evidence, tips off the attacker, and may trigger data destruction. Isolate first, preserve memory and disk images, then eradicate.

Compliance Frameworks — Which One Applies to You?

Intermediate

Framework	When It Applies	Key Focus
SOC 2 Type II	SaaS / cloud services with enterprise customers	Trust Service Criteria — security, availability, confidentiality
ISO 27001	Any org seeking certification for third-party trust	ISMS — 114 controls, annual audit
PCI DSS v4.0	Handling credit card / cardholder data	12 requirements — mandatory, audited by QSA
GDPR	Processing personal data of EU/UK residents	Lawful basis, 72h breach notification, DPO
HIPAA	US healthcare — PHI handling	PHI safeguards, BAAs, breach notification

ℹ

Compliance ≠ Security. Passing a SOC 2 audit means you met a specific set of controls at a point in time. It does not mean you are secure. Use compliance to formalize your program, but build actual security posture around threat-informed defense.

14 · Observability

📡 The Three Pillars

Metrics, logs, and traces — three complementary tools that together let you understand what your system is doing.

📈 Metrics

What: Numeric measurements over time — request rate, error rate, latency, CPU, memory.

Answer: "Is something wrong? How often? How bad?"

Tools: Prometheus (scrape-based), Grafana (visualization), OpenTelemetry SDK.

Track 4 Golden Signals: latency, traffic, errors, saturation
Use histograms for latency, not averages

📋 Logs

What: Timestamped records of discrete events — structured JSON events emitted by your application.

Answer: "What exactly happened at this moment?"

Tools: Loki (log aggregation), Elasticsearch/OpenSearch, Datadog.

Always use structured JSON logs
Include correlation IDs linking to traces
Never log PII or credentials

🔍 Traces

What: End-to-end journey of a single request through multiple services, showing timing at each hop.

Answer: "Where is the bottleneck in this request path?"

Tools: Tempo, Jaeger, Zipkin, OpenTelemetry.

Instrument entry points (HTTP handlers, consumers)
Propagate context via W3C Trace Context headers

OpenTelemetry — The Unified Standard

Intermediate

What it is: A vendor-neutral standard and SDK for instrumenting your application to emit metrics, logs, and traces in a consistent format. Write instrumentation once, send to any backend (Datadog, Grafana Cloud, AWS X-Ray, Jaeger).

✅

Why it matters: If you instrument with a vendor-specific SDK (Datadog agent, New Relic SDK), you're locked in. OpenTelemetry lets you swap observability backends without re-instrumenting your code. Use the OTel Collector as a routing layer between your apps and backends.

15 · Observability

📊 SLOs & Error Budgets

Making reliability measurable and turning uptime into a product decision, not just an ops concern.

SLI → SLO → SLA

Beginner → Intermediate

Term	What It Is	Example
SLI Service Level Indicator	The actual metric you measure	% of HTTP requests returning 2xx in under 500ms
SLO Service Level Objective	Your target for that metric	99.9% of requests must be successful over a 30-day window
SLA Service Level Agreement	External contractual commitment (usually stricter penalty)	99.5% uptime in the contract (lower than SLO — buffer for the org)
Error Budget	The allowed unreliability from the SLO	99.9% SLO = 0.1% budget = 43.8 min/month of allowed downtime

🟢 Beginner

Start with one availability SLO per critical service (e.g., "99.5% of checkout requests succeed in 30 days"). Track it. That's it. One measured SLO is infinitely better than zero.

🟡 Intermediate

Add latency SLOs. Define error budget burn rate alerts (multi-window: 1h + 6h). When burn rate is too high, halt feature work and focus on reliability. Error budget = the bridge between product and operations teams.

16 · Observability

🔔 Alerting Best Practices

Monitoring without action paths is just noise. Good alerts are actionable, routed, and deduplicated.

Alert Design Principles

Beginner → Intermediate

📢 Alert on Symptoms

Alert on user-visible impact (high error rate, slow response) not on causes (CPU at 80%). A cause without a symptom might be fine. A symptom always needs response.

🎯 Actionable Only

Every alert must have a runbook. If an alert fires and the responder doesn't know what to do, either the alert is wrong or the runbook is missing. Fix both.

🛤️ Routed by Severity

P1/SEV1: page immediately 24/7. P2: page business hours. P3: ticket + next business day. Use Alertmanager (Prometheus) or PagerDuty routing rules.

🔇 Deduplicate

Group related alerts to prevent alert storms. One database failure should generate one page, not 500 "service can't connect to DB" alerts. Alertmanager's group_by handles this.

⚠

Alert fatigue is a security risk. When on-call engineers receive too many low-quality alerts, they start ignoring them. A critical security alert buried in noise gets missed. Ruthlessly prune alerts that don't result in action.

17 · FinOps

💰 The FinOps Operating Loop

Cloud cost control becomes part of engineering, not a monthly finance surprise.

What is FinOps?

Beginner

What it is: FinOps (Financial Operations) is a practice where engineering teams take ownership of cloud spending. Instead of receiving a surprise bill at the end of the month, teams have real-time visibility into what they're spending and why, with automated guardrails to prevent waste.

1. Inform

Measure & Allocate Costs

→

2. Optimize

Rightsize & Eliminate Waste

→

3. Govern

Policy Guardrails

→

4. Verify

Measure Savings

→

Repeat

Continuous Loop

Rightsizing Workloads

Beginner → Intermediate

What it is: Matching CPU/memory requests and limits to what workloads actually use. Overprovisioned pods waste node resources; underprovisioned pods get OOMKilled. Use VPA (Vertical Pod Autoscaler) in recommendation mode to see what workloads actually need.

VPA Recommendation Check

# Check VPA recommendations for a deployment
kubectl describe vpa my-app -n production

# Output shows what VPA recommends vs what's currently set:
# Lower Bound: cpu: 50m  memory: 64Mi   (min safe)
# Target:      cpu: 120m memory: 200Mi  (recommended)
# Upper Bound: cpu: 400m memory: 512Mi  (max observed)

# If your requests are set at: cpu: 2000m memory: 4Gi
# → You're wasting 16× more CPU than needed
# → Reducing saves money and improves cluster density

🟢 Beginner

Start by identifying your 10 largest resource consumers with kubectl top pods -A. Compare requests vs actual usage. Even a 20% reduction in your top 10 consumers can mean significant monthly savings.

🟡 Intermediate

Automate rightsizing recommendations as pull requests using scripts that read VPA recommendations and generate manifest diffs. Gate changes through PR review to prevent accidental under-provisioning.

18 · FinOps

🏷️ Cost Allocation & Governance

Tagging resources, catching budget anomalies, and showing the cost of every Terraform change before it's merged.

Resource Labeling Strategy

Beginner

What it is: Every cloud resource and Kubernetes workload gets standard labels/tags that enable cost attribution. Without labels, you can't answer "which team owns this $30k/month RDS instance?"

Required Labels — Kubernetes / Cloud

# Kubernetes pod labels (enforced via Kyverno policy)
labels:
  app.kubernetes.io/name: "payments-api"
  app.kubernetes.io/team: "platform-team"
  app.kubernetes.io/env:  "production"
  cost-center:             "eng-platform-001"

# AWS resource tags (enforce via SCP / tag policy)
tags = {
  Team        = "platform"
  Environment = "production"
  Service     = "payments-api"
  CostCenter  = "eng-platform-001"
}

# Untagged = unallocated cost = no accountability
# Enforce minimum tags at org level via SCP before provisioning

Infracost — Cost Estimation in CI

Intermediate

What it is: Infracost reads your Terraform plan and generates a cost diff comment on every PR — "this change will add $847/month to your AWS bill." Engineers see cost impact before merging, not on the next invoice.

✅

Real impact: A developer accidentally provisions a GPU instance when they meant a regular EC2 — that's a $4,000/month mistake. Infracost in CI catches it at PR review time. Add a policy gate to block PRs that increase cost by more than 20% without a justification comment.

19 · Governance

📋 Service Catalog & Ownership

A Git-native registry of every service, its owner, runbooks, and operational metadata — so incidents route to the right team immediately.

Service Catalog Concepts

Beginner → Intermediate

Every production service should have a catalog entry that answers: Who owns this? What does it depend on? Where are the runbooks? What SLO does it have? What tier is it (critical vs non-critical)?

service-catalog.yaml — Example Entry

services:
  - name: payments-api
    description: "Handles payment processing and fraud detection"
    tier: P0                  # P0 = business-critical
    team: platform-payments
    oncall: "pagerduty://payments-oncall"
    slo:
      availability: "99.9%"
      latency_p99: "500ms"
    runbooks:
      incident: "https://wiki.internal/runbooks/payments-api"
      deployment: "https://wiki.internal/runbooks/payments-deploy"
    dependencies:
      - postgres-primary
      - redis-cache
      - stripe-api (external)

🟢 Beginner

Register every production service. Even a simple list with owner + oncall contact is transformative during an incident. The first question is always "who owns this?"

🟡 Intermediate

Generate CODEOWNERS automatically from catalog metadata to avoid drift. Validate catalog entries in CI (schema check + URL validation). Enforce registration as a production deployment gate.

20 · Reference

🎯 Learning Path

The recommended order to learn these concepts, from first principles to full platform engineering maturity.

Beginner → Intermediate Progression

All Levels

Phase	Concepts	Goal
Phase 1 Week 1–2	Git branching, conventional commits, dev containers, Makefile	Every developer working identically. Code history is readable. No "works on my machine."
Phase 2 Week 3–4	Pre-commit hooks, basic CI pipeline, secret detection, dep scanning	Automated quality gates on every push. Security checks running in CI.
Phase 3 Month 2	Terraform basics, Kubernetes manifests, container scanning, IaC scanning	Infrastructure defined as code. Deployments reproducible.
Phase 4 Month 3	Kustomize/Helm, GitOps with Argo CD, OIDC federation, secrets management	GitOps-driven deployments. No static credentials anywhere.
Phase 5 Month 4–5	Policy as code (Kyverno), observability (metrics + logs + traces), SLOs	Governance automated. Reliability is measured and owned.
Phase 6 Ongoing	FinOps loop, service catalog, supply chain security, incident response playbooks	Full platform engineering maturity. Cost visible. Ownership clear. Incidents structured.

✅

The meta-principle: Every practice in this guide follows the same pattern: Observe first, then enforce. Run your linter in warn mode before failing CI. Run Kyverno in Audit before Enforce. Map traffic flows before micro-segmenting. Run Infracost before blocking PRs. Visibility always comes before control — you cannot govern what you cannot see.

Concept	Primary Tool(s)	Learn First
Branching	Git, GitHub/GitLab	YES — Day 1
CI Pipeline	GitHub Actions, GitLab CI	YES — Week 1
Secret Detection	gitleaks, truffleHog	YES — Week 1
IaC	Terraform, Pulumi	Month 1
Kubernetes	kubectl, k9s, Lens	Month 1–2
GitOps	Argo CD, Flux	Month 2–3
SLOs	Prometheus, Grafana	Month 3–4
Policy as Code	Kyverno, OPA/Conftest	After Kubernetes basics
Supply Chain Sec	cosign, syft, SLSA	After CI/CD mastery
FinOps	Infracost, VPA, Kubecost	After infra is running